Ott, Wilhelm, Universität Tübingen, Germany, wilhelm.ott@uni-tuebingen.de Ott, Tobias, Stuttgart Media University, Germany, ott@hdm-stuttgart.de Gasperlin, Oliver, pagina GmbH publication technologies, Germany, oliver.gasperlin@pagina-tuebingen.de
With TXSTEP, we present and put up to discussion the prototype of a new, powerful XML-based tool for scholarly research in the text-based humanities. Its architecture is based on more than 40 years of experience in supporting humanities projects at the University of Tübingen and beyond.
The purpose of TXSTEP is not to provide another toolbox containing ready-made solutions for pre-defined problems. Of course, tools like these are adequate for many purposes; but we see no urgency to add a further one to the existing packages of this kind.
In fact, TXSTEP has been designed as a high performing scripting environment for the serious humanities scholar and other professionals in text data processing who face problems not easily solvable by XSLT or other means. TXSTEP gives them complete control over every detail of the data processing part of their projects.
Software for serious humanities research has to have certain basic qualities:
TXSTEP tries to take into account these somewhat contradicting requirements by defining the fundamental operations necessary for the processing of textual data, and by providing a separate program module for each of these basic functions, which can be used without any knowledge of conventional programming or scripting languages.
These modules may be combined almost arbitrarily: each module reads from and writes to a single basic file structure. This allows to combine these modules like Unix filters in arbitrary ways.
Where necessary, the single modules can be adapted to special requirements by the user, who may change default parameters (e.g. for providing a sort key for a non-latin alphabet) or provide additional ones (e.g. for the omission of the definite article in the sort key for titles in bibliographic records).
However limited the scope of the single modules may be, the flexibility of their combination can be illustrated by the fact that, for example, there is no dedicated program for generating an alphabetical word list. For this purpose the user has to combine the module for text decomposition (for which he has to provide the parameters defining the single elements and the sort keys), the SORT module, and the module which reduces identical or partially identical records contained in the sorted file to single index entries, and adds – when required – informations like frequency counts and/or references to the source text.
The modules provided by TXSTEP include:
As the output of any one of these modules may serve as input to any other module, the range of research problems for which this system may be helpful is quite wide.
In fact, TUSTEP, the Tübingen System of Text Processig tools, has been developed in the past 40 years along these lines. It has been and still is successfully used for many humanities projects in the German speaking part of the world, as may be detect by visiting www.tustep.org.
But, since TUSTEP’s syntax is proprietary, not intuitive and supposed to be difficult to learn, users tend to help themselves with other – often less effective – tools or less specific programming languages.
TXSTEP gives an answer to this situation by providing a user-friendly XML-syntax, allowing beginners and advanced programmers to utilize the whole scope of TUSTEP services in a modern, established scripting environment. The benefits are obvious: support of an open standard, widespread dissemination, programming in every XML-editor, syntax highlighting, code completion and intelligible APIs. Moreover, TXSTEP is aided by the fact that there is no need to change the program’s actual core. TUSTEP itself is open source, as TXSTEP is soon going to be as well.
Development of TXSTEP began in 2009, when Tobias Ott, research associate and lecturer at the ‘Stuttgart Media University’ and CEO of pagina GmbH (a service provider for publishing houses) first came up with the idea to build an XML interface to the syntax of TUSTEP commands. This would all at once remove most of the barriers usually preventing people from using TUSTEP:
In the meantime, this idea resulted in a prototype of TXSTEP which we plan to demonstrate in more detail during the poster session. The prototype already contains the most important features of all the modules of TUSTEP listed above.
Not contained in TXSTEP is TUSTEPs typesetting module, which has been designed to meet the ambitious layout demands of publications in humanities research, including those needed by critical editions. The user may however use it in the original TUSTEP environment for publishing in print the results gained by TXSTEP, or he may even include it – in original TUSTEP syntax – into his TXSTEP scripts.
One of the features of TXSTEP is it’s capability to process almost all forms of textual data, whether this being XML-data or plain text files. Therefore, even if textual data have to be processed in the first place in order to gain, for example, TEI-data or to enhance the markup of insufficiently tagged XML data, TXSTEP is at it’s place.
The proposed demo is based on the mentioned prototype and shows the achieved state of our work in progress. The demonstration of TXSTEPs functionality will include tasks which can not easily be performed by existing XML tools.