XSLT has often been criticized for its verbosity.1 Control structures, in particular, can require many more lines of code (and many more indications of start and end state) than nearly any modern programming language. Yet for all this, XSLT has features that make it an ideal language both for code generation (automatic generation of XSLT using XSLT itself) and as a target output format for other languages. In this paper, we discuss our attempts to tackle the problem of collections interoperability by exploiting both features of XSLT.
The rules governing the validity of particular XML instances are usually set forth in one of several standard forms of specification (such as RELAX NG,2 XML Schema,3 or the older Document Type Definition4). Such schemas typically specify the nature, number, and sequence of permissible elements, ensure their correspondence to particular data types, and enforce the overall referential integrity of the instance. We have developed a system (called Abbot) that uses schema definitions located in a target schema to generate a stylesheet that will effect the transformation of one or more document collections into instances that validate against that same schema.5
In the trivial case, where the target schema describes a proper subset of the collections in question, Abbot operates more-or-less automatically, but more complex transformations are also possible. One can, for example, give the system two collections and have it generate a stylesheet that makes one collection conform to the schema of the other. One can also make several collections target an entirely different schema. In these latter cases, it becomes necessary to describe particular mappings in a configuration file, but that configuration uses a simple syntax unrelated either to that of a schema language or XSLT.
The key step here is the automatic generation of an XSLT stylesheet. Our choice of XSLT as the language that generates that stylesheet might at first seem slightly perverse, but because XSLT is a homoiconic language – a language in which the primary representation of the language is itself a data structure in that same language – code generation can be undertaken through the use of metapgramming (in which code is passed into another, more abstract layer and evaluated).6
The Abbot system begins by running a ‘meta-stylesheet’ (analogous to a higher-order function in a traditional functional language) on both a target schema and a configuration file. The configuration file, while not written in XSLT, is nonetheless converted into XSLT by the surrounding runtime (using a translation method we discuss below). By default, that target schema describes TEI Analytics – a TEI subset that provides an encoding scheme optimized for text analysis. This meta-stylesheet generates, as its only output, a conversion stylesheet used for the actual transformation of the documents. This latter transformation yields files that will, in the majority of cases, validate against the target schema.
When Abbot reads the target schema, it accounts for all elements and associated attributes and generates a default XSLT template for each element. These default templates reflect the general assumption that elements and attributes in the input files resemble their counterparts in the target schema. If, for example, <foo n=”001″/> exists in the input file and is specifically allowed in the target schema, then Abbot will pass the element through unaltered under the assumption that the element is fully valid. Anything beyond that needs to be articulated in the configuration file.
Ultimately, the custom transformations set forth in the configuration file need to be instantiated as XSLT templates and included in the conversion stylesheet at runtime. For example, to replace the <temphead> element with <teiHeader> (its TEI P5 counterpart), the system would need to generate the following:
<transformation type=”xslt” activate=”yes”>
<desc>substitute ’temphead’ with ’teiHeader’</desc>
<xsl:template match=”*[lower-case(name())=’temphead’]” priority=”1″>
To replace spaces with underscores in the extent attribute of the <gap> element (a considerably more complex operation), requires substantially more code:
<transformation type=”xslt” activate=”yes”>
<desc>add underscore to ’gap’ @extents containing a space</desc>
<xsl:template match=”*[lower-case(name())=’gap’]” priority=”1″>
<xsl:when test=”contains(@extent,’ ’)”>
<xsl:when test=”contains(.,’ ’)”>
<xsl:value-of select=”replace(.,’ ’,’_’)”/>
Depending on the particular situation, the configuration file might have to contain dozens of hand-built templates for performing subtle transformations that cannot be deduced from the schema. But here, we undertake a second code-generation step using Clojure – a dialect of Lisp that runs on the Java Virtual Machine.
Because Lisp is also a homoiconic language, it too is well suited to code that reads and writes code. Moreover, XML is itself a first-class datastructure in Clojure, which can be easily (and lazily) transformed into a map object in which descendant nodes are represented as nested vectors. The problem of parsing a configuration file (in which complicated XSLT transformations are rendered in the form of a radically simplified DSL), becomes a matter of parsing the file into a map structure. Clojure can then trivially transform that map directly into XML (XSLT), which can be inserted at runtime into the conversion stylesheet. The first XSLT example above becomes something like:
temphead -> teiHeader
The second, more complicated example might be expressed as:
gap[@extent=’/ /’] -> gap[@extent=’/_/’]
In this way, Abbot becomes not merely a framework for effecting interoperability of XML document collections, but a general purpose XML transformation framework that avoids the need for XSLT itself.7
Thinking of XSLT as an intermediate form – a language that is targeted much as a compiler might target assembly – allows us to imagine radically simplified document transformation languages that can (potentially) exploit the full range of XSLT itself. In the case of Abbot, radical simplification is possible, in part, because the problem domain is itself highly constrained. But such constraints constitute precisely the rationale for domain-specific languages that try to map a user’s domain knowledge to a simplified syntax. Such languages, while smaller and simpler than more general-purpose languages, often still require the full range of language design tools (lexers, parser generators, the specification of a grammar, and so forth). Exploiting the homoiconicity of languages that possess this feature – including XSLT itself – makes the process of designing a ‘mini-language’ considerably easier.
Adler, S. (1997). A Proposal for XSL. World Wide Web Consortium (W3C) http://www.w3.org/TR/NOTE-XSL.html. (accessed 31 October 2011).
McIlroy, D. (1960). Macro Instruction Extensions of Compiler Languages. Communications of the ACM 3(4): 214-220.
Pytlik-Zillig, B. (2009). TEI Analytics: Converting Documents into a TEI Format for Cross-Collection Text analysis. Literary and Linguistic Computing 24(2): 187-192.
Pytlik-Zillig, B. (2011). TEI Texts that Play Nicely: Lessons from the MONK Project. Journal of the Chicago Colloquium on Digital Humanities and Computer Science 1(3): 1-5.
Unsworth, J. (2011). Computational Work with Very Large Text Collections: Interoperability, Sustainability, and the TEI. Journal of the Text Encoding Initiative 1 http://jtei.revues.org/215. (accessed 21 October 2011).
1.A Web search for ‘XML verbosity’ will bear this out amply, though we note that this ‘loquaciousness’ is itself consistent with the design goals of XML, which was intended to be a human readable format. In the first formal XSL proposal, the authors explicitly state that ‘Terseness in XSL markup is of minimal importance’ (see Adler 1997).
4.The Document Type Definition is currently defined by the XML specification itself, though it descends from earlier specifications associated with Standard Generalized Markup Language (SGML). The Extensible Markup Language (XML) specification is at http://www.w3.org/TR/REC-xml
5.We have called this method ‘schema harvesting’ (see Pytlik-Zillig 2011).
6.Homoiconicity is a property of several programming languages, including REBOL, SNOBOL, PostScript, XQuery, Prolog, and all dialects of Lisp. The concept of homoiconicity is first set forth in Douglas McIlroy’s 1960 article, ‘Macro Instruction Extensions of Compiler Languages’ (see McIlroy 1960).
7.Abbot cannot, of course, replace XSLT in all circumstances, since the DSL we are describing is not intended to capture the entire semantics of XSLT. Still, we imagine that Abbot can be usefully employed in many situations in which large bodies of texts are being transformed, even if interoperability is not a primary concern.