Introduction to Stylometric Analysis using R

Home » conference » programme » abstracts » Introduction to Stylometric Analysis using R

Eder, Maciej, Pedagogical University, Kraków, Poland, maciejeder@gmail.com

Rybicki, Jan, Jagiellonian University, Kraków, Poland, jkrybicki@gmail.com

Brief Description

Stylometry, or the study of measurable features of (literary) style, such as sentence length, vocabulary richness and various frequencies (of words, word lengths, word forms, etc.), has been around at least since the middle of the 19th century, and has found numerous practical applications in authorship attribution research. These applications are usually based on the belief that there exist such conscious or unconscious elements of personal style that can help detect the true author of an anonymous text; that there exist stylistic fingerprints that can betray the plagiarist; that the oldest authorship disputes (St. Paul’s epistles or Shakespeare’s plays) can be settled with more or less sophisticated statistical methods.

While specific issues remain largely unresolved (or, if closed once, they are sooner or later reopened), a variety of statistical approaches has been developed that allow, often with spectacular precision, to identify texts written by several authors based on a single example of each author’s writing. But even more interesting research questions arise beyond bare authorship attribution: patterns of stylometric similarity and difference also provide new insights into relationships between different books by the same author; between books by different authors; between authors differing in terms of chronology or gender; between translations of the same author or group of authors; helping, in turn, to find new ways of looking at works that seem to have been studied from all possible perspectives. Nowadays, in the era of ever-growing computing power and of ever-more literary texts available in electronic form, we are able to perform stylometric experiments that our predecessors could only dream of.

This half-day workshop is a hands-on introduction to stylometric analysis in the programming language R, using an emerging tool, a collection of Maciej Eder’s and Jan Rybicki’s scripts, which perform multivariate analyses of the frequencies of the most frequent words, the most frequent word n-grams, and the most frequent letter n-grams. One of the scripts produces Cluster Analysis, Multidimensional Scaling, Principal Component Analysis and Bootstrap Consensus Tree graphs based on Burrows’s Delta and other distance measures; it applies additional (and optional) procedures, such as Hoover’s ‘culling’ and pronoun deletion. As by-products, it can be used to generate various frequency lists; a stand-alone word-frequency-maker is also available. Another script provides insight into state-of-the-art supervised techniques of classification, such as Support Vector Machines, k-Nearest Neighbor classification, or, more classically, Delta as developed by Burrows. Our scripts have already been used by other scholars to study Wittgenstein’s dictated writings or, believe it or not, DNA sequences!

The workshop will be an opportunity to see this in practice in a variety of text collections, investigated for authorial attribution, translatorial attribution, genre, gender, chronology. Text collections in a variety of languages will be provided; workshop attendees are welcome to bring even more texts (in either plain text format or tei-xml). No previous knowledge of R is necessary: our script is very user-friendly (and very fast)!

Tutorial Outline

During a brief introduction, (1) R will be installed on the users’ laptops from the Internet (if it has not been already installed); (2) participants will receive CDs/pendrives with the script(s), a short quickstart guide and several text collections prepared for analysis; (3) some theory behind this particular stylometric approach will be discussed, and the possible uses of the tools presented will be summarized. After that and (4) a short instruction, participants will move on to (5) hands-on analysis to produce as many different results as possible to better assess the various aspects of stylometric study; (6) additional texts might be downloaded from the Internet or added by the participants themselves. The results, both numeric and visualizations, will be analyzed. For those more advanced in R (or S, or Matlab), details of the script (R methods, functions, and packages) will be discussed.

Special Requirements

Participants should come with their own laptops. We have versions of scripts for Windows, MacOS and Linux. The workshop also requires a projector and Internet connection in the workshop room.

References

Baayen, H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge: Cambridge UP.

Burrows, J. (1987). Computation into Criticism: A Study of Jane Austen’s Novels and an Experiment in Method. Oxford: Clarendon Press.

Burrows, J. F. (2002). ‘Delta’: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing 17(3): 267-287.

Craig, H. (1999). Authorial attribution and computational stylistics: if you tell authors apart, have you learned anything about them? Literary and Linguistic Computing 14(1): 103-113.

Craig, H., and A. F. Kinney, eds. (2009). Shakespeare, Computers, and the Mystery of Authorship. Cambridge: Cambridge UP.

Eder, M. (2010). Does size matter? Authorship attribution, small samples, big problem. Digital Humanities 2010: Conference Abstracts. King’s College London, pp. 132-135.

Eder, M. (2011). Style-markers in authorship attribution: a cross-language study of the authorial fingerprint. Studies in Polish Linguistics 6: 101-116.

Eder, M., and J. Rybicki (2011). Stylometry with R. Digital Humanities 2011: Cconference Abstracts. Stanford University, Stanford, pp. 308-311.

Eder, M., and J. Rybicki (2012). Do birds of a feather really flock together, or how to choose test samples for authorship attribution. Literary and Linguistic Computing 27 (in press).

Hoover, D. L. (2004). Testing Burrows’s Delta. Literary and Linguistic Computing 19(4): 453-475.

Jockers, M. L., and D. M. Witten (2010). A comparative study of machine learning methods for authorship attribution. Literary and Linguistic Computing 25(2): 215-223.

Koppel, M., J. Schler, and S. Argamon (2009). Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1): 9-26.

Rybicki, J. (2012). The great mystery of the (almost) invisible translator: stylometry in translation. In M. Oakley and M. Ji (eds.), Quantitative Methods in Corpus-Based Translation Studies. Amsterdam: John Benjamins.

Oakes, M., and A. Pichler (2012). Computational Stylometry of Wittgenstein’s Diktät für Schlick. Bergen Language and Linguistic (Bells) Series, (in press).

Rybicki, J., and M. Eder (2011). Deeper Delta across genres and languages: do we really need the most frequent words?. Literary and Linguistic Computing 26(3): 315-321.