HisDoc: Historical Document Analysis, Recognition, and Retrieval

Home » conference » programme » abstracts » HisDoc: Historical Document Analysis, Recognition, and Retrieval

XML

Baechler, Micheal, University of Fribourg, Switzerland, micheal.baechler@unifr.ch

Fischer, Andreas, University of Bern, Switzerland, afischer@iam.unibe.ch

Naji, Nada, University of Neuchatel, Switzerland, nada.naji@unine.ch

Ingold, Rolf, University of Fribourg, Switzerland, rolf.ingold@unifr.ch

Bunke, Horst, University of Bern, Switzerland, bunke@iam.unibe.ch

Savoy, Jacques, University of Neuchatel, Switzerland, jacques.savoy@unine.ch

The HisDoc project aims at developing automatic generic tools to support cultural heritage. More precisely, we have designed and implemented a system that performs a fully automated conversion of manuscripts into the corresponding electronic format and provides effective search capabilities. To demonstrate the effectiveness of the proposed solution, we have conducted a large-scale experiment using medieval handwritten manuscripts written in Middle High German. The corpus used in our experiments is based on a well known medieval epic poem called Parzival and attributed to Wolfram von Eschenbach. The first version dates to the first quarter of the thirteenth century. Currently, we can find several versions (with variations) but the St. Gall collegiate library cod. 857 is the one used for experimental evaluation.

The complete system does not make strong assumptions about the current corpus, and it can be adapted with little effort to handle other types of documents and languages. The proposed HisDoc system is subdivided into three main components, namely: image analysis, text recognition and information retrieval.

The image analysis module basically has two essential goals, namely image enhancement, which aims at improving image quality in order to make handling all the subsequent processing steps an easier task. The second goal is to perform layout analysis that provides a structural description of the scanned document pages. This meta-information may correspond to the number of columns, the number of lines per column, the location of the headline, etc. It is known that these old manuscripts suffer from ink bleeding, holes and stitches on parchments in addition to the artifacts surrounding the non-uniform handwritten text which represents a difficult challenge at this phase as well as in the phases that follow.

The resulting digital images and their meta-information are then passed to the text recognition phase, the second module of HisDoc, in order to produce the corresponding digital transcription. Here, flexible and robust recognition systems based on Hidden Markov Models (Marti & Bunke 2001) and Neural Networks (Graves et al. 2009) are used for the task of automating the transcription of historical texts (Fischer et al. 2009). In this context, flexibility means that the system can be adapted to new writing styles without great effort, and robustness means that the recognizer should attain a high rate of correctly recognized words. Besides the automatically generated version, an error-free transcription was created manually with the help of experts. This latter version forms our ground-truth text and is used to assess the performance levels in the different stages of our project.

For the analysis and recognition of historical documents, only little work has been reported in the literature so far, (Bunke & Varga 2007; Likforman-Sulem et al. 2007) present surveys of approaches to off-line handwriting recognition and on different state-of-the-art text line extraction methods respectively. Commercial systems with a high recognition accuracy are available only for restricted domains with modern handwritten scripts, e.g., for postal address (Srihari et al. 1996) and bank check reading (Gorski et al. 2001).

The HisDoc recognition module has evolved in terms of performance and achieved its current word-accuracy which is close to 94% thus a word-error rate of around 6% using Hidden Markov Models and character similarity features (Fischer et al. 2010). As can be seen, it is unfortunately impossible to obtain flawless recognition especially with the existence of all of the aging and decorative issues mentioned above. The level of the error rate depends on various factors such as the accuracy of the recognition system, the quality of the contrast between the background and the ink, and the regularity of the handwriting. Finally, the size of the training set also plays a role, but relatively good performance can be achieved with a relatively small training set (around 6,000 words).

In order to reduce the human effort needed to generate training samples, HisDoc’s text recognition module also includes methods for efficient ground-truth creation and text-image alignment in case of an existing electronic text edition (Fischer et al. 2011).

Performing effective searches on the resulting transcriptions is the goal of the third and last module in HisDoc, the information retrieval (IR) module. The cascading effects of manuscript quality, graphical characteristics and recognition error rate will have an impact on the performance level. But we must also consider additional factors particularly related to medieval writing. Our corpus was handwritten during the thirteenth century, when spelling was not standardized which introduces yet an additional challenge. Moreover, the co-existence of different spellings referring to the same entity in the same manuscript would also reduce the retrieval effectiveness. Besides that, one should also keep in mind the fact that grammar used in medieval languages was clearly different once compared to that of our modern days, which allowed more flexibility for the writer, thus varying from one region to another, or even from one writer to another residing in the same region. All these grammatical and orthography-related matters (spelling variations, punctuation, initials’ capitalization, etc.) in addition to the challenges faced in the first two modules of HisDoc would absolutely burden the retrieval process as they impose their negative impact on retrieval effectiveness causing the performance level to fall if the retrieval system is built the conventional way. To quantify this, the retrieval effectiveness is compared to that obtained using the error-free transcription.

In HisDoc’s retrieval module, certain techniques were introduced in order to integrate the possible variants to accordingly enhance the retrieval effectiveness by allowing some sort of soft-matching between the query terms representing the user’s information needs and four different text surrogates of the transcription, each of which incorporates a certain intensity level of variants’ inclusion. As a concrete example, the term ‘Parzival’ appeared in the original manuscript as ‘Parcifal’, ‘Parcival’ and ‘Parzifal’. All of these variants are possible and must be considered as correct spellings. At this lexical level, one might also consider the inflectional morphology where various suffixes were possibly added to nouns, adjectives and names to indicate their grammatical case. With this additional aspect, the proper name ‘Parcival’ may also appear as ‘Parcivale’, ‘Parcivals’ or ‘Parcivalen’, increasing the number of potential spelling variants that we need to take into account.

Regarding text representations, we have implemented whole words representation, short overlapping sequences of characters within each word (n-gram) or the first n characters of each word (trunc-n). During this indexing stage, we have also evaluated the effect of different word normalization procedures such as stemming. This procedure tries to extract the stem from a surface form by automatically removing its inflectional suffixes (number, gender) (light stemmer) or both inflectional and derivational suffixes (aggressive stemmer). As for search strategies, we have evaluated the classical vector-space model tf idf and its variants as well as several probabilistic models.

Based on our experiments and using a well-known evaluation methodology (Voorhees & Harman, 2005), using either short (single term) or long queries (three terms), we found that probabilistic IR models tend to produce better retrieval effectiveness (Naji & Savoy 2011). Regarding the stemming procedure, aggressive stemming tends to produce slightly better retrieval performance when facing with longer queries. On the other hand, ignoring the stemming normalization with short queries usually offers the best performance.

The presence of compound words was also analyzed in order to improve the quality of retrieval. The German language is known for this compound construction which occurs more frequently than in English. The main concern here is the fact that the same concept can be expressed with or without a compound form (e.g., Bankpräsident or Präsident der Bank). Applying an automatic decompounding strategy may provide some successful improvement, particularly for longer queries.

We have assessed the various recognition corpora against the ground-truth version. Compared to this error-free version, the simplest transcription surrogate (the classical output of recognition system, i.e., including no variants) shows a mean degradation in retrieval performance around -10.24% for single-term queries. Considering systematically three or seven variants per input word usually tends to cause the retrieval effectiveness to decrease significantly (from -5.42% for 3-term queries to -64.34% with single-term queries). Including more possible terms per recognition is therefore not an effective solution. We have thus proposed a wiser strategy that includes word variants according to their likelihood probabilities obtained during the recognition stage. This approach produces clearly better retrieval performance. To somehow illustrate our achievement, we can compare the best results obtained using two versions of an English language corpus having character error rates of 5% and 20%. The retrieval performance degradation was around -17% and -46% respectively (Voorhees & Harman, 2005). While our best results for the Parzival corpus, which has a word-error rate of around 6%, the retrieval performance degradation was only limited to around -5%.

The best practices found and the conclusions drawn from our experiments can then be applied to further manuscripts, hence making them digitally accessible and effectively searchable with the least cost possible by partially or totally eliminating the need to having them manually transcribed which in turn will result in saving a lot of resources (time, human effort, finances, etc.). With these documents being completely searchable via digital means, more user’s information needs can thus be satisfied via the facilities provided by web-service-based search engine.

Acknowledgement

The authors wish to thank Prof. Michael Stoltz, Dr. Gabriel Viehhauser (University of Bern, Switzerland), Prof. Anton Näf and Mrs. Eva Wiedenkeller (University of Neuchatel) for their valuable support. This research is supported by the Swiss NSF under Grant CRSI22_125220.

References

Bunke, H., and T. Varga (2007). Off-line Roman Cursive Handwriting Recognition. In B. Chaudhuri (ed.), Digital Document Processing: Major Directions and Recent Advances Berlin: Springer, pp. 165-173.

Fischer, A., M. Wüthrich, M. Liwicki, V. Frinken, H. Bunke, G. Viehhauser, and M. Stolz (2009). Automatic Transcription of Handwritten Medieval Documents. Proceedings of the 15th International Conference on Virtual Systems and Multimedia, pp. 137-142.

Fischer, A., K. Riesen, and H. Bunke (2010). Graph Similarity Features for HMM-Based Handwriting Recognition in Historical Documents. Proceedings of the 12th International Conference on Frontiers in Handwriting Recognition, pp. 253-258.

Fischer, A., E. Indermühle, V. Frinken, and H. Bunke (2011). HMM-Based Alignment of Inaccurate Transcriptions for Historical Documents. Proceedings of the 11th International Conference on Document Analysis and Recognition, pp. 53-57.

Gorski, N., V. Anisimov, E. Augustin, O. Baret, and S. Maximor (2001). Industrial Bank Check Processing: The A2iA Check Reader. International Journal on Document Analysis and Recognition 3(4): 196-206.

Graves, A., M. Liwicki, S. Ferdnandez, R. Bertolami, H. Bunke, and J. Schmidhuber (2009). A Novel Connectionist System for Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(5): 855-868.

Likforman-Sulem, L., A. Zahour, and B. Taconet (2007). Text Line Segmentation of Historical Documents: A Survey. International Journal on Document Analysis and Recognition 9(2): 123-138.

Marti, U.-V., and H. Bunke (2001). Using a Statistical Language Model to Improve the Performance of an HMM-Based Cursive Handwriting Recognition System. International Journal of Pattern Recognition and Artificial Intelligence 15(1): 65-90.

Naji, N., and J. Savoy (2011). Information Retrieval Strategies for Digitized Handwritten Medieval Documents. Proceedings of the Asian Information Retrieval Symposium, Dubai, December 2011, LNCS #7097. Berlin: Springer, pp. 103-114.

Srihari, S. N., Y. Shin, and V. Ramanaprasad (1996). A System to Read Names and Addresses on Tax Forms. In Proceedings of the IEEE 84(7): 1038-1049.

Voorhees, E. M., and D. K. Harman(2005). TREC. Experiments and Evaluation in Information Retrieval. Cambridge, MA: MIT Press.