This paper will discuss recent research carried out in the context of two grant projects: ‘Poetry and Contingency’ (funded by a Social Sciences and Humanities Research Council of Canada [SSHRC] Insight Development Grant – Digital Economy Priority Area) and ‘Applying search and stylometry techniques to OED2 and poetic text corpora’ (funded by a UWaterloo SSHRC Institutional Grant). It will discuss the processes and methods developed for analyzing and comparing large, heterogeneous text datasets, as well as preliminary results of these analyses and their applications in and implications for traditional literary studies. With implications for digital humanities research, computational approaches to language and literature, linguistics, lexicography, poetics, and literary criticism, the paper will address conference themes of ‘cultures, languages, and methods’ and several of their intersections.
The Oxford English Dictionary (OED) is widely considered to be the greatest philological and lexicographical achievement in English. The core of the work is its 2.5 million quotations, a significant portion of them from poetic and other literary texts, which both shape and illustrate the various sense definitions of roughly 600,000 English words and word forms. Conversely, since its publication, poets have relied on the OED to guide their deployments and arrangements of English words in poems. This reciprocal intertextuality has led to two striking facts which have received insufficient scholarly attention: 1) that the OED’s definitions of English words depend to a significant degree on poetic language, which is striking because by any standard account, poetic usage tends away from the denotative or definitional and towards the connotative and metaphorical; and 2) that much English poetry of the last hundred years contains a philological, etymological, and lexicographical dimension, informed by the OED.
Although the Second Edition of the OED (Murray et al. 1989) was among the earliest large books to be presented to public and academic communities in digitized, marked-up form, and despite the ongoing comprehensive rewriting of the Dictionary (Simpson & Weiner 2000-) no version has ever been marked for quotation genre, meaning that until now the reciprocal influences between dictionary and author have been difficult to identify, and impossible or impracticable to quantify with reference to specific literary genres (e.g. poetry or verse drama, etc.). My projects use the 1989 electronic OED2, digitized at the University of Waterloo, alongside electronic corpora of poetry, derived from Project Gutenberg, Chadwyck-Healey, and other datasets, to generate quantitative and qualitative assessments in two broad fields of inquiry: 1) What has been the influence of poetry on the English language’s most comprehensive lexicographical work? and 2) What influence has the OED had on English-language poetry?
The first prong of the project involves creating a parallel OED2 in which poetic quotations are marked for genre, to allow for advanced search and comparison of poetic quotations. Marking OED2 for quotation genre will allow for fast, comparative analysis of the influence of poetic writing on the compilers of the OED. I will discuss the challenges of this task and the methods developed to achieve it, as well as the search and comparison scripts developed to query the new resource. I will give examples and discuss the kinds of queries and comparisons made possible by this new resource, and their implications for literary studies and lexicography.
If the first prong is designed to generate questions and answers about poetic influence on OED2, the second prong investigates a more intricate and subtle problem: the influence of the OED on poetry. For instance, it is a trivial thing for a critic to suppose that Geoffrey Hill’s talk of ‘wrinching and spraining the text’ is a reference to Milton, since it is contains nonstandard spelling, is italicized, and appears in a poem with Milton in the title (Hill 2008: 5). But OED2 also quotes exactly those words from Of Reformation (Milton 1698: 269) under both ‘wrench, v.’, and ‘sprain, v.’ suggesting a second possible source text for Hill. Testing an individual poem for potential OED2 intertextuality requires a set of text comparison techniques plus an appropriate method of applying these. The process can also be applied systematically, to an electronic corpus (or several corpora) of poetry written since c.1884, the year OED1 fascicles began to be published. Relatively simple text-similarity approaches to the two corpora (including, e.g., string, n-gram, functional n-gram, and low-probability statistical approaches) will yield hundreds if not thousands of instances such as these, without depending on the poet to flag his or her source. More sophisticated stylometric techniques will give more promising indications of potential poetic intertextuality with various dictionary entries, including their etymology, definition, and quotation fields. Using these tools, we can quickly identify a very large number of candidates for intertextuality in English poems. I will discuss the comparison methods developed, as well as preliminary results of these, and their implications for literary studies.
A few facts make OED2 intertextuality a special case, overlapping with cases of attribution or allusion, but not identical to these. For one, OED2 is already a multiauthored and intertextual text, written and compiled by hundreds of lexicographers over more than a hundred years, following varying practices and relying on thousands of sources comprising millions of quotations. Secondly, though OED2 carries linguistic information (such as pronunciations, etymologies, definitions, etc.) as well as historical usage information (in the quotations) about every English word that is likely to occur in a poem or anywhere else, it is not often the only text to carry any one subset of this information. This means that finding OED2 in poems may point to another source for a poetic passage than (just) OED2 itself. Comparing the etymology field of a word in OED2 to its occurrence in a poetic text, for instance, might point to an etymological play on words, without conclusively attributing this to OED2. Or comparing the quotation fields may suggest an allusion to a text that happens to be quoted in OED2, even if OED2 is not itself the source of the allusion. But it may also point to true influence, in the form of a poetic allusion or reference which has been occasioned by OED2 and not the original text. I will discuss, with the help of preliminary results, the value and implications of each type of discovery, and ways of differentiating among these possibilities when appropriate. This is, I will argue, a discussion which takes up the crucial question of how, and to what extent, computing technologies can benefit the field of literary criticism, as a species of literary scholarship with its own goals and commitments.
Brewer, C. (2008). Treasure-House of the Language: The Living OED. New Haven: Yale.
Brewer, C. (2009). Literary Quotations in the OED. Review of English Studies 61: 93-125.
Brewer, C. (2011). ‘Happy Copiousness’? OED’s Recording of Female Authors of the Eighteenth Century. Review of English Studies 62: 86-117.
Burchfield, R. W. (1989). Unlocking the English Language. London: Faber.
Forstall, C. W., and W. J. Scheirer (2010). Features from Frequency: Authorship and Stylistic Analysis Using Repetitive Sound. Proceedings of the Chicago Colloquium on Digital Humanities and Computer Science, 1(2). https://letterpress.uchicago.edu/index.php/jdhcs/article/view/56/67 (accessed March 2012).
Heaney, S. (1996). Opened Ground: Poems 1966-1996. London: Faber.
Hill, G. (2008). A Treatise of Civil Power. London: Penguin.
Hollander, J. (1981). The Figure of Echo. Berkeley: U of California P.
Holmes, David I. (1998). The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing 13(3): 111-17.
Irwin, W. (2001). What is an Allusion? The Journal of Aesthetics and Art Criticism 59(3): 287-297.
Koppel, M., and J. Schler (2004). Authorship Verification as a One-Class Classification Problem. Proceedings of the 21st International Conference on Machine Learning, pp. 489-95.
Koppel, M., J. Schler, and E. Bonchek-Dokow (2007). Measuring Differentiability: Unmasking Pseudonymous Authors Journal of Machine Learning Research 8: 1261-1276.
Literature Online. http://lion.chadwyck.com (accessed March 2012).
Milton, J. (1698). Historical, Political, and Miscellaneous Works. Amsterdam.
Muldoon, P. (2001). Poems, 1968-1998. London: Faber.
Murray, J., et. al., eds. (1989). Oxford English Dictionary. 2nd ed., compiled by J. A. Simpson and E. S. C. Weiner, 20 vols. Oxford: Oxford UP.
Neumann, P. Statistical metalinguistics and Zipf/Pareto/Mandelbrot. http://www.csl.sri.com/users/neumann/#12a (accessed March 2012).
Project Gutenberg. http://gutenberg.org (accessed March 2012).
Ricks, C. (2002). Allusion to the Poets. Oxford: Oxford UP.
Ruthven, K. K. (1969). The Poet as Etymologist. Critical Quarterly 11(1): 9-37.
Simpson, J., and E. S. C. Weiner (2000-). OED Online. 3rd ed., rev. J. A. Simpson et al. Oxford: Oxford UP.
Tompa, F., E. Blake, E., and T. Bray (1991). Shortening the OED: Experience with a grammar-defined database. Waterloo: UW Centre for the New Oxford Dictionary and Text Research.
Trillini, R. H., and S. Quassdorf (2010). A ‘Key to All Quotations’? A Corpus-Based Parameter Model of Intertextuality. Literary and Linguistic Computing 25(3): 269-86.
Vogel, C., and G. Lynch (2008). Computational Stylometry: Who’s in a Play? Lecture Notes in Computer Science 5042: 169-186.