More than fifteen years ago, at the 1976 ALLC symposium in Oxford, I outlined some of the features of the software for processing textual data which we were then developing at Tübingen University. I illustrated them by the solution they can offer to specific problems occuring when preparing critical editions (cf. CHum 13 (1979) 29-35).
As of last year (thanks to a substantial grant we received from the State of Baden-Württemberg from 1985-1989) TUSTEP, the "TUebingen System of TExt processing Programs", is bi-lingual: in addition to German, it now understands English commands and produces English messages. Before that, TUSTEP had been accessible as a tool for German-speaking users only. At present (Jan. 1992), the German version is being used by humanities projects in about 75 universities, most of them on the European continent. Its bilingual version now makes it more readily accessible to users outside the German-speaking part of the world.
For this reason, it may be worthwhile to review some of TUSTEP's goals and principles of design:
Word processing and desktop publishing programs lack these basic requirements: they contain too many functions in a single program unit. For example, a word processing program normally contains the functions of entering, correcting, formatting and printing a text. The same applies to index and concordance programs, where the functions for finding the index entries in the text, defining a collation sequence for each entry, re-arranging entries in the required order, building index entries from successive identical or partially identical entries after the sort, and formatting them for printing are all contained in a single program.
In TUSTEP's modular approach, each of these basic functions is assigned to a program of its own, for example, the functions of entering a text are separate from those required for formatting, which in turn are separate from those used to print the formatted text. On the other hand, the capacity of these basic functions themselves has been enhanced in order to meet the additional requirements of handling scholarly problems.
The upshot of this approach is that even simple tasks, such as creating a list of word forms occurring in a literary text, require that a separate TUSTEP program be invoked for each step:
One of the advantages of this approach is that it provides an unrivalled degree of flexibility. This becomes even more evident when one examines the procedure required for generating a critical apparatus from the results of automatic collation. It is more or less identical to the one for the index just outlined, with the following differences: after opening the files, the program COMPARE is called (instead of PINDEX), because now, instead of decomposing the text into its basic elements (word forms), the differences found between the basic text (version A) and version B are to be written to a file. This procedure will be repeated for every other version C, D, . . . , with the results being recorded in the same file of variants. Before executing SORT, an additional program, PRESORT, will be used to establish the necessary sort keys. Then, as in the index example, the programs SORT, GINDEX (with the parameters adjusted to the other format of the entries), FORMAT and PRINT are used to generate the apparatus entries in their final form.
For preparing the critical editon itself, the apparatus entries must be inserted into the text before typesetting can be carried out. This can be accomplished by first using the program COPY to add the necessary codes to the apparatus entries and then employing the program CORRECT for inserting the entries into the text at the appropriate locations.
Regarding point b, performance. Naturally, the speed of a program depends on the hardware being used. Thus, TUSTEP (though relatively fast in any given environment) runs faster on a large mainframe or a workstation than on an XT-compatible PC. Due to its portability, TUSTEP lets the user switch to a faster hardware environment whenever the PC becomes too slow. Although there are limits to the amount of data that it can handle, TUSTEP has tried to set these unavoidable limits at a level far above those needed for normal purposes; for example, up to 1.000 lines per printed page are allowed, and the maximum size of a single text file which the TUSTEP editor (and the other programs) can handle is 2 GB.
Point a, the requirements of scholarly text processing (as opposed to the requirements of an office environment), can be illustrated by the impact they have even on simple functions like search-and-replace instructions. For example, in TUSTEP, the instructions necessary for automatically marking the elisions occuring in a Latin hexameter poem can be given in a minimum of 3 lines. Instructions of this type are available not only in the editor (which can also be used as a powerful search engine for large text files), but also in the other TUSTEP programs, where they are used to establish, for example, the sort keys for index entries containing non-Latin alphabetic characters and diacritics, or the sort keys for bibliographic records having a complex structure. The same pattern-matching instructions are required for finding, after automatic collation, any variants that contain orthographic differences only (inasmuch as these can be recognized by formulated rules), so that they can be separated from those to be listed in the apparatus.
The consequences that this systemic integration has for the design of a single program can be illustrated by the TUSTEP composing program. Naturally, the process of typesetting alters the physical representation of the text as seen in the editor. Automatic composition and pagination add not only hyphens but also page numbers and (in critical editions) line numbers. In other words, it alters the information needed for making references to individual text parts and their contents. Those references are required for generating tables of contents and all types of indexes and concordances.
The importance of preserving information concerning the three-dimensional structure of the text (pages, lines, and words or characters within the line) as it will appear in print is also the key to understanding one important feature in the format of TUSTEP text files: every line of text has its own page- and line-number. This page-line number is not part of the text but rather an internal prefix to every line. It is always available, but can be made visible to the degree desired by the user.
It is this page-line number that is updated when the text is reformatted by the composing program. In addition to the file needed for driving the photocomposer, the TUSTEP typesetting program provides a second output file. In this file, the text is encoded in the same way in which it was found in the input file (which is also the format required by the rest of the programs), but its page and line division is identical to that of the composed page.
This high degree of portability has effects also on the character sets which must be available within the system itself. We therefore try to provide TUSTEP with the special characters and fonts which are required for humanities applications (presently including Greek, Hebrew, Coptic, Syriac, Cyrillic, Arabic, and the International Phonetic Alphabet), as well as less-common combinations of diacritics. We try to ensure that these characters are available in all versions of the system, be it a PC, a workstation or a mainframe, and on all output devices that TUSTEP supports, usually giving priority to their availability for the device used for the final output, the photocomposer.