The European Library holds several million bibliographic records from the national libraries of Europe. The National Bibliographies are one of the main bibliographic data sources in each country and are key for Digital Humanities scholars.
Their purpose is to list every publication in a country, under the auspices of a national library or other government agency. Depending on the country, all publishers will need to send a copy of every published work to the national legal deposit, or in other countries, a national organisation will need to collect all publications.
Given that the publisher domain is very heterogeneous and that thousands of publishers might exist in a country, National Bibliographies are effectively the single point of reference with which to comprehensively identify all the publications in a country.
In Europe, National Bibliographies are typically created and maintained by national libraries. Whenever a book is published in a country, it is recorded in the corresponding national library catalogue from where the national bibliography is derived.
Currently, The European Library holds approximately 75 million bibliographic records in its centralised repository. This number is constantly increasing, as more national libraries’ catalogues are included in the centralised repository. By the end of 2012 the total bibliographic universe of The European Library is expected to be approximately 200 million records.
The centralisation of European bibliographic data in The European Library is creating new possibilities for the exploitation of this data, in order to improve existing services, enable the development of new ones, or provide a good setting for research.
The centralisation of the bibliographic data enables the automatic linkage of the National Bibliographies across countries, through the use of data mining technologies. Our ongoing work is focusing the linkage of the main entities used to describe bibliographic resources: persons, organizations, locations, historical periods, and subjects.
In this poster, we present the current status of our work on the consolidation of authors across the National Bibliographies of Europe. When complete, it will allow the exploration of an author’s work across time and space in the European bibliographic universe.
Entity resolution is the process of, given a specific context, determining if two or more references correspond to the same entity. An entity might have multiple different representations, and each representation might match the description of multiple objects (i.e., reference and referent ambiguity). The variations found in the descriptions may have multiple origins, such as misspellings, typing errors, different conventions for abbreviations, naming varying over time, heterogeneous data schemas, etc.
Entity resolution is a common problem to many different research communities, although the term used is not always the same. Common designations include record linkage, record matching, merge-purge, data de-duplication, instance identification, database hardening, name matching, reference reconciliation, reference disambiguation, and object consolidation (Elmagarmid et al. 2007; Dong et al. 2005).
Different communities have proposed several techniques, but most frequently we find applications of algorithms from machine learning, artificial intelligence and data mining (Elmagarmid et al. 2007).
Entity resolution is very dependent on the context. The processes need to be adapted to the data that they are being applied to, in order to achieve acceptable results. Our work focuses on exploring the structural and semantic richness of the bibliographic data, as it exists in the National Bibliographies.
In bibliographic records, references to persons are found as authors or contributors of works, and sometimes as the subject of the work. The value of these references being as complete as possible is recognised by cataloguing rules, which indicate that the references should contain not only the name of the person but also their birth and death years (ALA et al. 2002). However, these dates do not always exist, as they are not always known by the cataloguers. It is also often the case that these dates are approximations of real dates – cataloguing rules comprise conventions for these cases. Although this information is not always fully structured, the common conventions used when encoding the information allow it to be reliably parsed automatically.
Our approach leverages two key aspects of the National Bibliographies and consolidation work on authors carried out by libraries.
The first aspect is that national libraries already individually perform a manual consolidation of authors through their ongoing work to maintain National Bibliographies.
The second aspect is that some European national libraries actively work on the construction of the Virtual International Authority File, or VIAF (Bennett et al. 2006). VIAF is a joint project of several national libraries from all continents. It hosts a consolidated data set containing data that national libraries have gathered for many years about the authors of the bibliographic resources held at the libraries. It is available as open data.
Using VIAF, we can already consolidate the authors across the VIAF participating countries and soon we will exploit this resource to consolidate authors in other countries. By extracting statistics about authors consolidated in VIAF, from the National Bibliographies of VIAF participants, we expect to be able to derive a probabilistic model that will allow us to consolidate the authors from other countries not participating in VIAF.
The author consolidation system is being built as an ETL (Extraction, Transformation and Loading) process, a typical approach for performing consolidation of data in data warehouses.
The process starts with the preparation of data for consolidation processing. This step comprises tasks for selecting the relevant data from the National Bibliographies that will be used to represent an author during the consolidation process. The following data about the authors is gathered at the level of each individual national bibliography:
Known titles of publications.
As is typical in ETL processes, the final decision to match two author references is made by reasoning on the similarity scores obtained by comparing the each of the above data elements.
Based on previous research on this topic (Freire et al. 2008), we use the Jaro-Winkler similarity metric (Jaro 1989) for calculating the similarity between names.
Comparison of remaining data fields is based on the number of common values found in the two records versus the total number of available values. For example, similarity of co-authors is given by the number of common co-authors between two authors, and similar calculations for titles, and publishers.
This information is compared across National Bibliographies to identify, for example, two authors who can actually be considered to be the same.
The solution for reasoning on the outcome of comparisons will be based on supervised machine learned model. It will process the comparison results between two of the above records of the authors and determine the likelihood that the authors are the same entity. Several machine learning techniques for classification will be tested.
As ground truth, for building and testing the machine learned model, we will use a data set extracted from the National Bibliographies of VIAF participants, therefore removing the need for creating a manually annotated collection.
This poster presented the current status of our work on the consolidation of authors across the National Bibliographies of Europe. When complete, it will allow the exploration of an author’s work across time and space, in a much more comprehensive way then is possible today.
We believe that this consolidation will provide new opportunities for the creation of statistical datasets, resulting from data analysis and mining of bibliographic data. Although no concrete plans are ready at this time, we expect these kinds of data sets to emerge as a future result.
This work will also have impact in intellectual property rights identification processes, such as those of the ARROW project1 (Accessible Registries of Rights Information and Orphan Works). In such processes, the identification of all publications of a work by an author is essential, and can benefit from this consolidated author’s bibliographies.
ALA, CLA, CILIP (2002). Anglo-American Cataloguing Rules: 2002 Revision.
Bennett, R., C. Hengel-Dittrich, E. O’Neill, and B. B. Tillett (2006). VIAF (Virtual International Authority File): Linking Die Deutsche Bibliothek and Library of Congress Name Authority Files.
Dong, X., A. Halevy, and J. Madhavan (2005). Reference reconciliation in complex information spaces. In Proceedings of the 2005 ACM SIGMOD international Conference on Management of Data. SIGMOD ’05. ACM: New York, NY, pp. 85-96.
Elmagarmid, A. K., P. G. Ipeirotis, and V. S. Verykios (2007). Duplicate Record Detection: A Survey. IEEE Transactions on knowledge and data engineering 19(1): 1-16, DOI: 10.1109/TKDE.2007.250581.
Freire, N., J. Borbinha, and B. Martins (2008). Consolidation of References to Persons in Bibliographic Databases. ICADL 2008. The 11th International Conference on Asian Digital Libraries, Universal and Ubiquitous Access to Information.Berlin: Springer-Verlag, pp. 256-265.
Jaro, M. A. (1989). Advances in record linking methodology as applied to the 1985 census of Tampa Florida. Journal of the American Statistical Society 64: 1183-1210.
1.1. Accessible Registries of Rights Information and Orphan Works website:http://www.arrow-net.eu/