Latent Semantic Analysis Tools Available for All Digital Humanities Projects in Project Bamboo

Home » conference » programme » abstracts » Latent Semantic Analysis Tools Available for All Digital Humanities…

Hooper, Wallace Edd, Indiana University, Bloomington, Indiana, USA, whooper@indiana.edu

Cowan, Will, Indiana University, Bloomington, Indiana, USA, wgcowan@indiana.edu

Jiao, David, Indiana University, Bloomington, Indiana, USA, djaio@indiana.edu

Walsh, John A., Indiana University, Bloomington, Indiana, USA, jawalsh@indiana.edu

The Chymistry of Isaac Newton Project (http://www.chymistry.org) added a new component to its website that allows researchers and interested readers to use the extensive results of its Latent Semantic Analysis studies of Newton’s alchemical manuscripts (NSF STS Project #0620868).

The new LSA web component on the Chymistry website allows researchers to form their own queries to discover correlated passages across the published corpus. The component will return 250-word or 1000-word passages which share significant vocabulary from within a published corpus of 62 manuscripts and over 450,000 words of seventeenth-century English, French, and Latin. At the user’s request, the component will also draw network graphs that visualize the structure of semantic relations among the passages in the documents. These network graphs can be examined in Network Workbench, which was developed at the Cyberinfrastructure for Network Science Center in Indiana University’s School of Library and Information Science.

The new Chymistry LSA component will also allow researchers to use regular expressions to form queries to investigate the relationships between words and draw concept maps as network graphs. Unlike many Bayesian topic analysis techniques in which the words drop out in favor of topics (Blei et al., 2003), latent semantic analysis keeps the vocabulary visible and available for direct text analysis.

The new LSA web component on the Chymistry website currently provides results for the 62 manuscripts that have been released in the Chymistry Project’s digital edition. All 119 manuscripts and an expanded LSA component will be released in 2012.

Walsh and Hooper presented initial results of this LSA work with Newton’s alchemical papers and with the poetry and literary criticism of Algernon Charles Swinburne at DH 2011. New directions in their collaboration include undertaking further LSA work that combines the King James Version of Bible with Swinburne’s corpus to detect patterns of influence and identify Swinburne’s extensive borrowing of language from the King James Bible. Similarly, we intend to combine the Geneva bible with the corpus of Newton’s theological and alchemical texts to detect patterns of influence, allusion, and borrowing or quoting in Newton’s texts.

Since that presentation, there has been considerable interest in their LSA methods from other projects and independent digital humanists.

The Digital Library Program at Indiana University is a partner institution in Project Bamboo the mission of which is to deliver a research environment for humanities scholars and a corresponding infrastructure for librarians and technologists supporting humanities research. Cowan and Jiao of the IU Digital Library Program have been involved in current Project Bamboo activities. They have worked on creating a digital humanities research environment using the HubZero platform, a collaboration and research platform that was originally created to serve research communities in the natural sciences. Cowan and Jiao have built several digital humanities tools such as a Java based page-turning tool and a tool that provides topic modeling analysis of textual contents. Hooper and Walsh from the Chymistry of Isaac Newton project with Cowan and Jiao, from the Digital Library Program, will port the LSA algorithms into a tool on HubZero.

Both the Chymistry Project and the Swinburne Project are partners with the IU Digital Library Program, which provides the technical infrastructure for both projects. Cowan and Jiao raised the possibility with Hooper and Walsh of modifying the code they had developed for those projects and making it available as part of Project Bamboo.

Cowan, Jiao, and Hooper argue that every humanities project will be interpreting and analyzing entirely unique content and their public end-user websites will likewise have their own unique design constraints. What needs to be provided is a framework that will allow each project to bring its data sets to a processing interface and submit it, with the full and reasonable expectation that they will receive standardized outputs that they can process and exploit in their own design contexts.

The Chymistry and Swinburne projects use Perl scripts to extract data streams from TEI/XML-encoded documents, then process the streams in MATLAB on a supercomputer at Indiana to produce CSV data that is subsequently imported into MySQL databases for use with their public websites.

Our collaboration, however, has decided to provide tools that provide tools that accomplish just the numerical processing step in MATLAB, so we ask user projects to submit text data in simple text input structures and we return wimple and well-defined CSV data that can be exploited in any manner desired for end-user web interfaces or for further numerical or algorithmic processing.

The LSA analysis workflow in our MATLAB programs contains several components: first the corpus is read into the system and converted to a term/document frequency matrix. Terms of very low frequency and of high frequency (user-defined stop words) are removed. Secondly a TF/IDF (Term Frequency/Inverse Document Frequency) matrix is calculated based on the term/document frequency matrix. Then we apply singular value decomposition (SVD) on the TF/IDF matrix and create the matrices that are used to calculate document distances as well as term/document matrices, which are saved to CSV files as the standard outputs and returned to users. Several steps like the SVD calculation and the distance matrix computations are computationally expensive and require high performance computing resources. Therefore, the MATLAB-based workflow is located and executed on a cluster of IBM servers running Red Hat Linux at Indiana University. The process is invoked and monitored by the HubZero submit module. The technical details of the LSA algorithms are thus hidden from the end users. Digital humanities researchers can utilize our LSA algorithm in their research without having to worry about the supercomputing aspect. To use this service, they only need to preprocess their corpus and provide them in the simple format that the tool accepts, and post-process the outputs of the algorithms such as the document/document distances. We expect such a tool would leverage the usage of the LSA algorithm in the digital humanities research, and can serve as a model to expand the application of computing-expensive algorithms in digital humanities.