Text Analysis Meets Text Encoding | Digital Humanities 2012

Home » conference » programme » abstracts » Text Analysis Meets Text Encoding

XML

Authors & Presenters

Bauman, Syd, Brown University, USA, Syd_Bauman@Brown.edu

Hoover, David, New York University, USA, david.hoover@nyu.edu

van Dalen-Oskam, Karina, Huygens Institute, The Netherlands, karina.van.dalen@huygens.knaw.nl

Piez, Wendell, Mulberry Technologies, Inc., USA, wapiez@mulberrytech.com

Aim and Organization

The main aim of this panel discussion is to bring together text encoding specialists and text analysis researchers. Recent DH conferences have comprised, in addition to other activities, two distinct sub-conferences – one focusing on text encoding in general and TEI in particular, and the other on text analysis, authorship attribution, and stylistics. The separation between the two is so extreme that their participants often meet only at breaks and social events. This is reﬂected in the way text encoding specialists and text analysis scholars do their work as well: they hardly ever work together on the same projects. Because of this lack of connection, some of the long-promised benefits of markup for analysis remain unrealized. This panel takes a step toward bridging the gap between markup and analysis.

We focus on both the causes for the gap and possible solutions. What could and should markup do that it doesn’t currently do? Why do analysts rarely work with the huge number of texts already encoded? How can text encoders and those who process encoded texts make their work more useful to text analysts, and how can text analysis specialists help encoders make their texts more useful? What opportunities currently exist for collaboration and cooperation between encoders and analysts, and how can more productive opportunities be created?

Panel Topic

The reasons for the present gap between markup and analysis are partly technical and partly non-technical, and arise from the disparate aims and methods of the two camps. While markup systems have generally been designed to meet the needs of (scholarly) publishing, markup adherents have often claimed that their markup is also useful for analytic purposes. However, the very concept of ‘markup’ itself is different for the two constituencies. XML, in particular, isn’t ‘markup’ in the full sense, as used by specialists in text processing. Rather, it is a data structuring methodology that imposes a single unitary hierarchy upon the text. Consequently, it is a poor instrument for the complete interpretive loop or spiral, where we start with text (with or without markup), perform analysis, use markup to record or ‘inscribe’ our findings into the text, and then return to analysis at a higher level. This is largely because the inscription step is usually inhibited by any prior (XML) markup. Consider two ﬂowcharts of document processing workﬂows at http://piez.org/wendell/papers/dh2010/743_Fig2a.jpg (what XML provides) and http://piez.org/wendell/papers/dh2010/743_Fig2b.jpg (what we need). (As noted on the page at http://piez.org/wendell/papers/dh2010 , these images were presented as part of a paper delivered at Digital Humanities 2010 in London Piez 2010].) The difference between these is essentially that in the current (XML-based) architecture, extending and amending our document schemas and processing require re-engineering the system itself; the stable system (which is designed to support publishing not research) does not naturally sustain that activity. A system that supported markup in the sense that text analysis requires – which would support, among other possibilities, multiple concurrent overlapping hierarchies (including rhetorical, prosodic, narrative and other organizations of texts) and arbitrary overlap (including overlap between similar types of elements or ranges) – would also support incremental development of processing to take advantage of any and all markup that researchers see fit to introduce.

Part of the solution to this problem is in the emergence of standard methodologies for encoding annotations above or alongside one or more ‘base’ layers of markup, perhaps using standoff markup or alternatives to XML. The details are less important than the capabilities inherent in a data model not limited to unitary trees (see Piez 2010 for discussion; for a more wide-ranging critique of markup, see Schmidt 2010). Over the long term, given suitable utilities and interfaces, textual analysts may be able to use such systems productively; in the medium term this is more doubtful.

This leads to the non-technical part of the problem: largely because XML is not very well suited to their requirements, text analysis tools typically cannot handle arbitrary encoding in XML along with the texts themselves, while at the same time there is not yet a viable alternative encoding technology, specified as a standard and suitable for supporting interchange. And so an analyst must begin by processing the encoded text into a usable form. While the markup specialist may be able (perhaps easily) to perform such a transformation on his or her XML texts using XSLT, this is usually more difficult for the text analyst, for several reasons. Moreover, we submit that many or most of these difficulties are not simply due to known limitations of current text-encoding technologies as described above, but will also persist in new, more capable environments.

Markup processing technologies such as XSLT are rarely among the text analyst’s armamentarium, and the benefits of XSLT (and we imagine this would be the case with any successor as well) are often not clear enough to the text analyst to justify its significant learning curve. XSLT need not be difficult, but it can be challenging to assimilate – especially because it seems superficially similar to text analysis. Those who do learn XSLT will find some of the tasks most useful to them relatively easy (e.g., filtering markup and splitting/aggregating texts). XSLT 2.0 also includes regular expressions, grouping, and stylesheet functions, and handles plain text input gracefully, making it a much more hospitable environment than XSLT 1.0 for text analysis. Yet more often than not, these features serve only to make XSLT tantalizing as well as frustrating to those for whom markup processing cannot be a core competency.

Moreover, intimate familiarity with the encoding system is typically necessary to process it correctly, yet the text analyst is frequently working with texts that he or she was not a party to creating. Many texts are encoded according to the TEI Guidelines, but analysts are often not experts in TEI in general, let alone expert at a particular application of TEI by a particular project. But such expertise is often required, as even ‘simple’ word extraction from an arbitrary TEI text can be problematic. Certainly, plenty of TEI encodings make the task non-trivial. Consider in particular the task of ingesting an arbitrary unextended TEI P5 text and emitting a word list, assuming that we can ignore one important detail: word order. What assumptions are necessary? What information will the data provider need to provide to the transformation process? Below are some preliminary considerations toward a tool that would present the user with a form to fill out and return an XSLT stylesheet for extracting words (note that many of these considerations would benefit from collaboration between the text encoder and the text analyst):

What metadata can be ignored? E.g., if a default rendition of display:none applies to an element, are its contents words?
Which element boundaries always/never/sometimes imply word breaks? When ‘sometimes’, how can we tell?
Which hierarchical content structures (if any) should be ignored? (E.g,. colophons, forwards, prefaces) Which trappings? (E.g., <interGrp>, <figDesc>)
Which elements pointed to by an <alt> element (or child of a <choice>) get ignored, and which get included?

The complexity of this analysis compounds the difficulty already described. And it must be performed anew for every encoded text or set of texts the analyst wishes to consider.

Furthermore, the publishing focus noted above means that text-encoding projects rarely encode the elements most useful for analysis, and you can only get out of markup what the encoder puts in. An analyst interested in how names are used in literary texts, for example, will find that even projects that encode proper names often encode only the names of authors and publishers (not very interesting for text analysis), or only personal names (the analyst may also be interested in place names and names of other things). Likewise, adding the desired markup to encoded texts requires that the analyst conform to (or perhaps modify) the existing encoding scheme, imposing another learning curve above that for the XSLT needed to extract the names once they have been encoded. Consequently, scholars interested in text analysis typically find it is more efficient to use plain texts without any markup, and then apply an ad hoc system of manual tagging, or named entity recognition (NER) tools in combination with manual correction of tagging. The analyst who wants to extract the speeches of characters in already-encoded plays will sometimes discover that the speeches are encoded in such a way that automatic extraction is quite problematic (e.g., at Documenting the American South). It is a rare literary encoding project that provides texts with the dialog encoded for speaker so that the speech of each character can be extracted (even apart, again, from the overlap problem) – a kind of analysis made popular by Burrows’s pioneering work (Burrows 1987), but very labor-intensive.

Finally, even if texts with the desired encoding are available and the analyst is willing to learn the basics of XSLT, typically the XSLT has to be rewritten or tweaked for each new collection of texts that is examined because of differences in encoding schemes. And it is just those ambitious encoding projects that are likely to encode more elements of interest to the text analyst that are also more likely to have complex, individualized, and difficult-to-process encoding (e.g., the Brown University Women Writers Project). From the analyst’s point of view, the process of using existing encoding may be more time-consuming and frustrating than doing the work manually. Surely this state of affairs is undesirable.

Yet there is another way forward besides text analysts learning new skills or text encoders offering their own text-extraction tools. While the problems described here add up to a formidable challenge, the very fact that we can enumerate them suggests that we are not entirely helpless. There is much work that can be done both to refine our understanding, and to develop tools and methodologies that will help to bridge this divide. We suggest beginning with closer collaboration between the two camps. If each supports, works with, and learns from the other, both sides will benefit, and we will have a better foundation of understanding on which to build the next generation of technologies – technologies that will be valuable for both camps.

References

Willa Cather Archive http://cather.unl.edu/

The Brown University Women Writers Project.http://www.wwp.brown.edu/

Documenting the American South.http://docsouth.unc.edu/

Piez, W. (2010). Towards hermeneutic markup: an architectural outline. DH2010, King’s College London, July 9.

Burrows, J. (1987). Computation into Criticism. Oxford: Clarendon P.

Schmidt, D. (2010). The inadequacy of embedded markup for cultural heritage texts. LLC 25: 337-356.

Burnard, L., and S. Bauman (eds.). TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 1.7.0. 2010-07-06.http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html