TUSTEP: design principles

Modularity, Professionality, Integration:
A Conception Revisited

Wilhelm Ott, Universität Tübingen

Paper prepared for the ALLC-ACH 92 Conference, Oxford, 6-9 April 1992
first published in: ALLC-ACH 92, Conference Abstracts and Programme, Oxford: Christ Church 1992, p. 197-199

More than fifteen years ago, at the 1976 ALLC symposium in Oxford, I outlined some of the features of the software for processing textual data which we were then developing at Tübingen University. I illustrated them by the solution they can offer to specific problems occuring when preparing critical editions (cf. CHum 13 (1979) 29-35).

As of last year (thanks to a substantial grant we received from the State of Baden-Württemberg from 1985-1989) TUSTEP, the "TUebingen System of TExt processing Programs", is bi-lingual: in addition to German, it now understands English commands and produces English messages. Before that, TUSTEP had been accessible as a tool for German-speaking users only. At present (Jan. 1992), the German version is being used by humanities projects in about 75 universities, most of them on the European continent. Its bilingual version now makes it more readily accessible to users outside the German-speaking part of the world.

For this reason, it may be worthwhile to review some of TUSTEP's goals and principles of design:

Modularity: TUSTEP is a collection of relatively independent programs, each of which offers a well-defined subset of basic operations for processing textual data. It is the task of the user to combine these basic functions in order to arrive at the solution to a given problem.
Professionality: The individual modules are especially designed to meet the requirements of processing textual data in professional academic research. These requirements go beyond those found in the office environment.
Integration: the functions provided by TUSTEP are designed to handle all stages of a project: initial data input, the various steps of analysis and processing, and the final output for conventional publications or for exporting the results to a data base. So normally there is no need to switch to a different system during the course of a project.
Portability: Programs and data can be transferred unchanged between the most common types of micro and mainframe computers. This feature not only eliminates problems caused by the fast changes in hardware technology but also allows scholars at different sites to collaborate.

1. Modularity

TUSTEP's basic design concept proceeds from the idea that the user should be entrusted with the responsibility also for the data processing part of his project. This forbids us to provide the user with ready-made solutions. On the other hand, the scholar must be provided with tools that allow him to achieve satisfactory results without his having to write a program himself. These tools consist of the above mentioned program modules for the basic functions of text analysis and processing, which the user has to combine in order to obtain the solution to the problem at hand.

Word processing and desktop publishing programs lack these basic requirements: they contain too many functions in a single program unit. For example, a word processing program normally contains the functions of entering, correcting, formatting and printing a text. The same applies to index and concordance programs, where the functions for finding the index entries in the text, defining a collation sequence for each entry, re-arranging entries in the required order, building index entries from successive identical or partially identical entries after the sort, and formatting them for printing are all contained in a single program.

In TUSTEP's modular approach, each of these basic functions is assigned to a program of its own, for example, the functions of entering a text are separate from those required for formatting, which in turn are separate from those used to print the formatted text. On the other hand, the capacity of these basic functions themselves has been enhanced in order to meet the additional requirements of handling scholarly problems.

The upshot of this approach is that even simple tasks, such as creating a list of word forms occurring in a literary text, require that a separate TUSTEP program be invoked for each step:

Open the file containing the text (using the command OPEN).
Create a file for the list of word forms (command CREATE).
Decompose the text into word forms with the program PINDEX, where user-defined parameters specify how the text is to be decomposed and define the collation sequence for the subseqent sort.
Arrange the entries in alphabetic order (command SORT).
Compile index entries from the sorted word forms (program GINDEX).
Prepare the index for n-column printing by using the program FORMAT and supplying the appropriate parameters.
Print the index on the desired printer (command PRINT).

One of the advantages of this approach is that it provides an unrivalled degree of flexibility. This becomes even more evident when one examines the procedure required for generating a critical apparatus from the results of automatic collation. It is more or less identical to the one for the index just outlined, with the following differences: after opening the files, the program COMPARE is called (instead of PINDEX), because now, instead of decomposing the text into its basic elements (word forms), the differences found between the basic text (version A) and version B are to be written to a file. This procedure will be repeated for every other version C, D, . . . , with the results being recorded in the same file of variants. Before executing SORT, an additional program, PRESORT, will be used to establish the necessary sort keys. Then, as in the index example, the programs SORT, GINDEX (with the parameters adjusted to the other format of the entries), FORMAT and PRINT are used to generate the apparatus entries in their final form.

For preparing the critical editon itself, the apparatus entries must be inserted into the text before typesetting can be carried out. This can be accomplished by first using the program COPY to add the necessary codes to the apparatus entries and then employing the program CORRECT for inserting the entries into the text at the appropriate locations.

2. Professionality

Professionality of (text) data processing tools for academic research entails three aspects.

The functions offered by the individual programs must meet the requirements of scholarly research;
The capacity of these programs (i.e., their performance and the amount of data they can handle) must conform to professional needs; and
The (visible) results must not look amateurish; if printed, they have to show professional letterpress quality.

Point c refers to printing the results of computer-aided studies (for example an edition, a bibliography, a concordance, or even a monograph) in the conventional manner as found in a typeset publication. Although TUSTEP's typesetting program cannot guarantee that the final product will conform to professional typographic standards (that, of course, depends on the user), it at least provides the user with the possibility to do so, one reason being that it supports not only laser printers, but also professional composing equipment.

Regarding point b, performance. Naturally, the speed of a program depends on the hardware being used. Thus, TUSTEP (though relatively fast in any given environment) runs faster on a large mainframe or a workstation than on an XT-compatible PC. Due to its portability, TUSTEP lets the user switch to a faster hardware environment whenever the PC becomes too slow. Although there are limits to the amount of data that it can handle, TUSTEP has tried to set these unavoidable limits at a level far above those needed for normal purposes; for example, up to 1.000 lines per printed page are allowed, and the maximum size of a single text file which the TUSTEP editor (and the other programs) can handle is 2 GB.

Point a, the requirements of scholarly text processing (as opposed to the requirements of an office environment), can be illustrated by the impact they have even on simple functions like search-and-replace instructions. For example, in TUSTEP, the instructions necessary for automatically marking the elisions occuring in a Latin hexameter poem can be given in a minimum of 3 lines. Instructions of this type are available not only in the editor (which can also be used as a powerful search engine for large text files), but also in the other TUSTEP programs, where they are used to establish, for example, the sort keys for index entries containing non-Latin alphabetic characters and diacritics, or the sort keys for bibliographic records having a complex structure. The same pattern-matching instructions are required for finding, after automatic collation, any variants that contain orthographic differences only (inasmuch as these can be recognized by formulated rules), so that they can be separated from those to be listed in the apparatus.

3. Integration

The integration necessary for a system such as TUSTEP is achieved by a feature which is as simple as it is fundamental for a flexible system: the output of any program may serve as the input to any other program. Thus, any information added by one program can be processed by all other programs.

The consequences that this systemic integration has for the design of a single program can be illustrated by the TUSTEP composing program. Naturally, the process of typesetting alters the physical representation of the text as seen in the editor. Automatic composition and pagination add not only hyphens but also page numbers and (in critical editions) line numbers. In other words, it alters the information needed for making references to individual text parts and their contents. Those references are required for generating tables of contents and all types of indexes and concordances.

The importance of preserving information concerning the three-dimensional structure of the text (pages, lines, and words or characters within the line) as it will appear in print is also the key to understanding one important feature in the format of TUSTEP text files: every line of text has its own page- and line-number. This page-line number is not part of the text but rather an internal prefix to every line. It is always available, but can be made visible to the degree desired by the user.

It is this page-line number that is updated when the text is reformatted by the composing program. In addition to the file needed for driving the photocomposer, the TUSTEP typesetting program provides a second output file. In this file, the text is encoded in the same way in which it was found in the input file (which is also the format required by the rest of the programs), but its page and line division is identical to that of the composed page.

4. Portability

A text-processing system that is to be used in large projects must also be equipped with a greater degree of organizational features. Among the most important ones are its applicability in a variety of hardware environments and its continued availability over many years. TUSTEP therefore tries to hide from the user the differences between, for example, PC and mainframe, between an ASCII-based and an EBCDIC-based machine or between different (versions of) operating systems. It does so by incorporating functions such as file handling and defining new commands, functions that are normally assumed by the job control language of the operating system. This not only saves the user the trouble of having to relearn commands when switching to a computer with a different operating system, but also lets him transport his data to the new environment, where he can continue to use the existing TUSTEP command sequences unchanged.

This high degree of portability has effects also on the character sets which must be available within the system itself. We therefore try to provide TUSTEP with the special characters and fonts which are required for humanities applications (presently including Greek, Hebrew, Coptic, Syriac, Cyrillic, Arabic, and the International Phonetic Alphabet), as well as less-common combinations of diacritics. We try to ensure that these characters are available in all versions of the system, be it a PC, a workstation or a mainframe, and on all output devices that TUSTEP supports, usually giving priority to their availability for the device used for the final output, the photocomposer.

More about TUSTEP

Wilhelm Ott

Modularity, Professionality, Integration: A Conception Revisited

1. Modularity

2. Professionality

3. Integration

4. Portability

Modularity, Professionality, Integration:
A Conception Revisited