Information Extraction on Noisy Texts for Historical Research

Home » conference » programme » abstracts » Information Extraction on Noisy Texts for Historical Research

Blanke, Tobias, King’s College London, UK, tobias.blanke@kcl.ac.uk

Bryant, Michael, King’s College London, UK, michael.bryant@kcl.ac.uk

Speck, Reto, King’s College London, UK, reto.speck@kcl.ac.uk

Kristel, Conny, NIOD, Amsterdam, The Netherlands, c.kristel@niod.knaw.nl

The European Holocaust Research Infrastructure (EHRI)¹ project aims to create a sustainable Holocaust Research Infrastructure that will bring together documentary evidence from dispersed archives for historical research. EHRI involves 20 partner organisations in 13 countries. It aims to provide open access to Holocaust material such as documents, objects, photos, film and art. One of the challenges of the project is that the dispersed archives of interest to EHRI often do not have the means to sufficiently digitise their resources. Even if the resources are digitally available, they remain inaccessible to searching and browsing by researchers.

For EHRI, we investigated how we can use open source OCR infrastructure developed at King’s College London for the Ocropodium project (Bryant et al. 2010) so that its output can feed the semantic extraction of metadata useful for research discovery and analysis. Current commercial OCR technology does not serve well such specific research interests in historical document collections, as it cannot be easily customised. Most commercial OCR software products are proprietary ‘black boxes’ which provide digitisation staff with little scope for understanding their behaviour and customising parameters under which they run. At the source level, there is a marked reluctance of OCR software manufacturers to allow access to their code even in a collaborative environment.

In the context of Ocropodium, we developed a workflow tool Ocropodium Web Processing (OWP), with which various open source OCR tools can be combined to create the best possible OCR solution for specific document types. The OWP workflow environment is based on the principles of visual programming environments (Cox & Gauvin 2011). It allows the user to build custom workflows comprised of discrete processes. The workflows need not be purely linear; instead they take the form of a graph, specifically a directed acyclic graph (DAG). The DAG is comprised of connected nodes, which each perform a discrete function. Nodes can be thought of as functions which take one or more inputs and evaluate these to produce a result. Thus, OWP allows archive staff to embed further services beyond OCR. For this paper, we experimented with information extraction (IE) services to semantically enrich archival descriptions. We present our experiments to evaluate how common off-the-shelf IE services behave against potentially noisy OCR’d texts.

Semantically enriched library and archive federations have recently become an important part of research in digital libraries (Kruk & McDaniel 2009). Especially so, as research users often have more demands on semantics than is generally provided by archival metadata. For instance, in archival finding aids place names are often only mentioned in free-form narrative text and not especially indicated in controlled access points for places. Researchers would like to search for these locations. Place name extraction from the descriptions might support this. For the future EHRI infrastructure, we want to use IE services to enrich the researchers’ experience. In our experiments, we concentrated on extracting names and places facets, both immensely important for Holocaust research.

Our experiments demonstrate the principal workflow using IE tools. In our proof-of-concepts, we did not address larger problems of IE from historical texts, which are manifold. As for OCRing, off-the-shelf commercial IE software has often problems with delivering acceptable results here. For the problems of extracting semantic information from low quality textual data, please compare (Packer et al. 2010), while (Warner & Clough, 2009) describe plans for a larger extraction project from the UK National Archives. Instead of concentrating on improving the IE itself, we were mainly interested in mapping and evaluating the current state-of-the-art. For the presentation at DH2012, we will deliver exact evaluations of using off-the-shelf IE tools against various levels of underlying ‘dirty’ OCR text, commonly encountered in OCRed historical resources.

Even the off-the-shelf IE tools have already delivered encouraging results. For instance, we OCR’d PDF files of survivor testimonies sent by the Wiener Library,² an EHRI partner. The documents were typical fairly low resolution (612×790) grey-scale scans of typed documents. Due to the low resolution we needed to do some advanced preprocessing of the images and scaled the images by a factor of four using an anti-aliasing filter to approximate a typical resolution from 300 DPI scan and finally binarised them (converting a colour or grey-scale image to one containing only black and white). After binarisation, additional filters were applied to deskew the images and remove edge noise.

The resulting transcript produced by the open source Tesseract OCR engine (>Smith 2007) was fairly low quality, with around 90% character accuracy. We ignored further possible improvements by combining Tesseract with more advanced binarisation and pre-processing tools. As said, in this particular experiment, we were interested how standard IE services would react to low-quality textual input. We used the off-the-shelf OpenCalais service by ThompsonReuters to extract semantic information (Goddard & Byrne 2010). Even this standard setup has proven to produce useful results. OpenCalais proved successful at detecting the presence of a personal name in the transcript, even when the OCR was imperfect. For example, it detected that ‘Dmulaltsr Tappe’ (Dr. Walter Tappe) was a name. It also marked up several instances of places, such as Berlin and Wilmersdorf. Other incorrectly OCR’d locations such as ‘Slchsischestraeae’ (Schlesisches) were also marked up as places, due to the (correctly OCR’d) preceding phrase ‘lived in’. Further semantic data marked up by OpenCalais included positions (‘lawyer’, ‘auditor’, ‘actor’) and industry terms (‘food’). In several OCR transcripts it detected the topic as ‘politics’. Social tags given included ‘Berlin’, ‘Geography’ and ‘Geography of Europe’. We repeated our experiment with other IE tools such as the open source Annie tool from GATE (>Bontcheva et al., 2002) and could further improve our results, especially as we could rely on advanced gazetteers of place and person names of the Holocaust that are based on long-running community projects.³

These initial successful results will mean that we will develop our OCRing of finding aids into a full service in the context of the EHRI project. We think we can thereby significantly enhance the research experience of using historical archives.

References

Bontcheva, K., H. Cunningham, D. Maynard, V. Tablan, and H. Saggion (2002). Developing reusable and robust language processing components for information systems using GATE. In Database and Expert Systems Applications. Proceedings 13th International Workshop. Berlin: Springer, pp. 223-227.

Bryant, M., T. Blanke, M. Hedges, and R. Palmer (2010). Open Source Historical OCR: The OCRopodium Project Research and Advanced Technology for Digital Libraries. In M. Lalmas et al. (eds.), Research and advanced technology for digital libraries. Berlin: Springer.

Cox, P. T., and S. Gauvin (2011). Controlled dataflow visual programming languages. Proceedings of the 2011 Visual Information Communication – International Symposium. Hong Kong: ACM.

Goddard, L., and G. Byrne (2010). Linked Data tools: Semantic Web for the masses. First Monday 15, 1.

Kruk, S. R., and B. McDaniel (2009). Semantic Digital Libraries. Berlin: Springer.

Packer, T. L., J. F. Lutes, A. P. Stewart, D. W. Embley, E. K. Ringger, K. D. Seppi, and L. S. Jensen (2010). Extracting person names from diverse and noisy OCR text. Proceedings of the fourth workshop on Analytics for noisy unstructured text data. Toronto, ON, Canada: ACM.

Smith, R. (2007). An Overview of the Tesseract OCR Engine. Document Analysis and Recognition. Proceedings 9th International Conference. Berlin:Springer, pp. 629-633.

Warner, A., and P. Clough (2009). A Proposal for Space Exploration at The National Archives. 2011. Available: http://ir.shef.ac.uk/cloughie/papers/York2009.pdf

Notes

1.http://www.ehri-project.eu/

2.http://www.wienerlibrary.co.uk/

3.http://resources.ushmm.org/hsv/