This paper presents a subset of the results of a larger project which explores ways to recognize and classify a narrative feature – speech, thought and writing representation (ST&WR) – automatically, using surface information and methods of computational linguistics.
Speech, thought and writing can be represented in various ways. Common categories in narratology are direct representation (He thought: ‘I am hungry.’), free indirect representation, which takes characteristics of the character’s voice as well as the narrator’s (Well, where would he get something to eat now?), indirect representation (He said that he washungry.) and reported representation, which can be a mere mentioning of a speech, thought or writing act (They talked about lunch.). ST&WR is a feature central to narrative theory, as it is important for constructing a fictional character and sheds light on the narrator-character relationship and the narrator’s stance. The favored techniques not only vary between authors and genres, but have changed and developed over the course of literary history. Therefore, an automated annotation of this phenomenon would be valuable, as it could quickly deal with a large number of texts and allow a narratologist to study regularities and differences between different time periods, genres or authors.
The approach presented here specifically aims at applying digital methods to the recognition of features conceptualized in narrative theory. This sets it apart from other digital approaches to literary texts which often operate purely on a vocabulary level and are more focussed on thematic issues or on author or group specific stylistics. Recent approaches to the goal of automatically identifying ST&WR are either not concerned with narrativity at all, like Krestel et al. who developed a recognizer of direct and indirect speech representation in newspaper texts to identify second hand information, or not interested in the techniques themselves, like Elson et al. who use recognition of direct speech representation to extract a network of interrelations of fictional characters. Also, both approaches are for the English language and can therefore not be used in this project.
Basis for the research is a corpus containing 13 short narratives in German written between 1786 and 1917 (about 57 000 tokens). The corpus has been manually annotated with a set of ST&WR categories adapted from narratological theory. This step is comparable to the annotation project conducted by Semino and Short for a corpus of English literary, autobiographical and newspaper texts, but is something that has never been done for German literary texts before. The manual annotation gives empirical insight into the surface structures and the complex nature of ST&WR, but also serves as training material for machine learning approaches and, most importantly, as reference for evaluation of the automatic recognizer.
The main focus of this paper is the automatic recognition. Rule-based as well as machine learning approaches have been tested for the task of detecting instances of the narratological categories. In the scope of this paper, a subset of these strategies is presented and compared.
For the rule-based approach, simple and robust methods are favored, which do not require advanced syntactic or semantic preprocessing, automatic reasoning or complex knowledge bases. The modules make use of conventions like punctuation, as well as lexical and structural characteristics for different types of ST&WR. A central feature is a list of words that signal ST&WR, e.g. to say (sagen), to whisper (flüstern). For the recognition of indirect ST&WR specifically, patterns of surfaces and morphological categories are used to match the dependent propositions (e.g. Er sagte, dass er hungrig sei. [He said that he was hungry.]: signal word – followed by comma – followed by a conjunction – followed by a verb in subjunctive mode). This methods achieves F1 scores of up to 0.71 in a sentence-based evaluation. Direct representation can be detected with an F1 score of 0.84 by searching for quotation patterns and framing phrases (e.g. he said). Annotating reported representation, which is quite diverse, achieves an F1 score of up to 0.57.
The machine learning approach uses a random forest learning algorithm trained on the manually annotated corpus. Features like sentence length, number of certain word types and types of punctuation are used as attributes. The advantage of this approach lies in the fact that it can also be used to handle types of ST&WR which do have less obvious structural or lexical characteristics, like free indirect representation, for which an F1 score of 0.43 can be achieved when performing sentence-based cross validation on the corpus. The F1 score for detecting direct representation is 0.81, for indirect representation 0.57 and for reported representation 0.51.
The components of the automatic recognizer are modular and realized as working prototypes in the framework GATE (General Architecture for Text Engineering) (http://gate.ac.uk). When the project is finished, it is intended to publish these components as GATE plugins and make them available as a free download.
In the paper, the advantages and disadvantages of rule-based and machine learning approaches as well as possibilities for combination are discussed. For example, rules can be used for ST&WR strategies with clear patterns and conventions, like direct and – to an extent – indirect representation, but machine learning for the more elusive types like free indirect representation. It is also possible to get results for the same ST&WR category from different modules and use those to calculate scores. E.g. merging the results of rule-based and ML methods improves the overall F1 score for recognizing direct representation in the corpus.
Though the figures above give a rough idea of expected success rates, evaluation is in fact an analytic task itself: Results are not only quite different for different types of ST&WR and dependent on the exact configuration of the recognizer modules. There is also the question of what kind of results should be prioritized for narratological research and how to deal with cases which are problematic even for a human annotator. However, the modular structure of the recognizer is designed to allow for customization and the main goal of the project is to shed light on the relationship between the manual annotation, generated with narratological concepts in mind, and the possibilities and limitations of ultimately surface-based automatic annotation.
Cunningham, H., et al. (2011). Text Processing with GATE (Version 6): http://www.tinyurl/gatebook (accessed 14 March 2012).
Elson, D. K., N. Dames, and K. R. McKeown (2010). Extracting Social Networks from Literary Fiction. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics 2010, pp. 138-147.
Krestel, R., S. Bergler, and R. Witte (2008). Minding the Source: Automatic Tagging of Reported Speech in Newspaper Articles. Proceedings of the Sixth International Language Resources and Evaluation Conference (LREC 2008), Marrakech, Morocco, May 28-30 2008.
Semino, E., and M. Short (2004). Corpus stylistics. Speech, writing and thought presentation in a corpus of English writing. London, New York: Routledge.