No source: created in electronic format.
The European Holocaust Research Infrastructure (EHRI)
For EHRI, we investigated how we can use open source OCR infrastructure developed at King’s College London for the Ocropodium project (Bryant et al. 2010) so that its output can feed the semantic extraction of metadata useful for research discovery and analysis. Current commercial OCR technology does not serve well such specific research interests in historical document collections, as it cannot be easily customised. Most commercial OCR software products are proprietary ‘black boxes’ which provide digitisation staff with little scope for understanding their behaviour and customising parameters under which they run. At the source level, there is a marked reluctance of OCR software manufacturers to allow access to their code even in a collaborative environment.
In the context of Ocropodium, we developed a workflow tool Ocropodium Web Processing (OWP), with which various open source OCR tools can be combined to create the best possible OCR solution for specific document types. The OWP workflow environment is based on the principles of visual programming environments (Cox & Gauvin 2011). It allows the user to build custom workflows comprised of discrete processes. The workflows need not be purely linear; instead they take the form of a graph, specifically a directed acyclic graph (DAG). The DAG is comprised of connected nodes, which each perform a discrete function. Nodes can be thought of as functions which take one or more inputs and evaluate these to produce a result. Thus, OWP allows archive staff to embed further services beyond OCR. For this paper, we experimented with information extraction (IE) services to semantically enrich archival descriptions. We present our experiments to evaluate how common off-the-shelf IE services behave against potentially noisy OCR’d texts.
Semantically enriched library and archive federations have recently become an important part of research in digital libraries (Kruk & McDaniel 2009). Especially so, as research users often have more demands on semantics than is generally provided by archival metadata. For instance, in archival finding aids place names are often only mentioned in free-form narrative text and not especially indicated in controlled access points for places. Researchers would like to search for these locations. Place name extraction from the descriptions might support this. For the future EHRI infrastructure, we want to use IE services to enrich the researchers’ experience. In our experiments, we concentrated on extracting names and places facets, both immensely important for Holocaust research.
Our experiments demonstrate the principal workflow using IE tools. In our proof-of-concepts, we did not address larger problems of IE from historical texts, which are manifold. As for OCRing, off-the-shelf commercial IE software has often problems with delivering acceptable results here. For the problems of extracting semantic information from low quality textual data, please compare (Packer et al. 2010), while (Warner & Clough, 2009) describe plans for a larger extraction project from the UK National Archives. Instead of concentrating on improving the IE itself, we were mainly interested in mapping and evaluating the current state-of-the-art. For the presentation at DH2012, we will deliver exact evaluations of using off-the-shelf IE tools against various levels of underlying ‘dirty’ OCR text, commonly encountered in OCRed historical resources.
Even the off-the-shelf IE tools have already delivered encouraging results. For
instance, we OCR’d PDF files of survivor testimonies sent by the Wiener Library,
The resulting transcript produced by the open source Tesseract OCR engine (>Smith 2007) was fairly low quality, with around 90%
character accuracy. We ignored further possible improvements by combining Tesseract
with more advanced binarisation and pre-processing tools. As said, in this
particular experiment, we were interested how standard IE services would react to
low-quality textual input. We used the off-the-shelf OpenCalais service by
ThompsonReuters to extract semantic information (Goddard
& Byrne 2010). Even this standard setup has proven to produce useful results.
OpenCalais proved successful at detecting the presence of a personal name in the
transcript, even when the OCR was imperfect. For example, it detected that
‘Dmulaltsr Tappe’ (Dr. Walter Tappe) was a name. It also marked up several instances
of places, such as Berlin and Wilmersdorf. Other incorrectly OCR’d locations such as
‘Slchsischestraeae’ (Schlesisches) were also marked up as places, due to the
(correctly OCR’d) preceding phrase ‘lived in’. Further semantic data marked up by
OpenCalais included positions (‘lawyer’, ‘auditor’, ‘actor’) and industry terms
(‘food’). In several OCR transcripts it detected the topic as ‘politics’. Social
tags given included ‘Berlin’, ‘Geography’ and ‘Geography of Europe’. We repeated our
experiment with other IE tools such as the open source Annie tool from GATE (>Bontcheva et al., 2002) and could further improve our
results, especially as we could rely on advanced gazetteers of place and person
names of the Holocaust that are based on long-running community projects.
These initial successful results will mean that we will develop our OCRing of finding aids into a full service in the context of the EHRI project. We think we can thereby significantly enhance the research experience of using historical archives.
Bontcheva, K., H. Cunningham, D. Maynard, V. Tablan, and H.
Saggion (2002). Developing reusable and robust language processing
components for information systems using GATE. In Database
and Expert Systems Applications. Proceedings 13th International
Workshop. Berlin: Springer, pp. 223-227.
Bryant, M., T. Blanke, M. Hedges, and R. Palmer (2010).
Open Source Historical OCR: The OCRopodium Project Research and Advanced
Technology for Digital Libraries. In M. Lalmas et al. (eds.), Research and advanced technology for digital
libraries. Berlin: Springer.
Cox, P. T., and S. Gauvin (2011). Controlled dataflow
visual programming languages. Proceedings of the 2011
Visual Information Communication - International
Symposium. Hong Kong: ACM.
Goddard, L., and G. Byrne (2010). Linked Data tools:
Semantic Web for the masses. First Monday 15, 1.
Kruk, S. R., and B. McDaniel (2009). Semantic Digital Libraries. Berlin: Springer.
Packer, T. L., J. F. Lutes, A. P. Stewart, D. W. Embley, E.
K. Ringger, K. D. Seppi, and L. S. Jensen (2010). Extracting person
names from diverse and noisy OCR text. Proceedings of the
fourth workshop on Analytics for noisy unstructured text data.
Toronto, ON, Canada: ACM.
Smith, R. (2007). An Overview of the Tesseract OCR
Engine. Document Analysis and Recognition. Proceedings 9th International
Conference. Berlin:Springer, pp. 629-633.
Warner, A., and P. Clough (2009). A
Proposal for Space Exploration at The National Archives. 2011.
Available: http://ir.shef.ac.uk/cloughie/papers/York2009.pdf