Technical Challenges

Challenges
Gathering the Content
Searching Consistently
Adding Structure to Search
The Way of API
Changing the Culture

Challenges

The hand-crafted, specialised nature of many online medieval resources presented us with a number of challenges when it came to developing a clustering methodology for Manuscripts Online:

How do we pull together such a diverse range of resources when some of them are freely available, some are only available through subscription and some are poorly maintained?
How do we enable users to search consistently across a body of data when non-Latin characters have been represented in different ways, spelling is not standardised and different languages are used?
How do we encourage a culture of collaboration and sharing within the manuscript studies research community?

Gathering the Content

A key aspect of this project was our relationship with the content providers, which included individual academics, large organisations such as The British Library and The National Archives, and publishers such as ProQuest and Gale Cengage Learning. The project undoubtedly benefited from relationships and trust which had been built during the Connected Histories project.

We had a dedicated Project Manager who was responsible for managing these relationships: communicating our aims, negotiating the Material Transfer Agreements (content licences) and arranging access to the data. We also scheduled our data processing workflow into three ‘data bundles’, each bundle representing our estimation of how fast or lengthy the negotiations with the content providers would be. For example, all data owned by the project's partner institutions were scheduled as Bundle #1, being the easiest to acquire consent to use.

Perhaps the most complicated negotiations with our content providers concerned the EEBO-TCP (Early English Books Online – Text Creation Partnership) which is a body of full-text transcriptions that can be accessed in three ways via ProQuest, JISC Historic Books and EEBO-TCP itself. EEBO-TCP is a consortium of UK and North American HEIs, all with a stake in the data. Here we discovered that the licensing of digital data can be complex in instances where no single organisation necessarily owns the entire dataset and there might be multiple gateways to the same (or slightly different) content available to end users. Our solution in the case of EEBO was to take time to understand the complex licensing arrangements already in place between the stakeholders, to clarify exactly what we were asking permission to do and from whom, and then to provide two URLs as points of access to the same content (all three points of access will become available eventually).

Searching Consistently

Providing users with a consistent search experience was the greatest challenge for this project. Materials for this period use un-standardised spellings (eg. church, chirch, chirche, cherche, churche, kirk and kirke), non-Latin characters (eg. ð, þ and 3) and a range of languages (e.g. English, Latin and Anglo-Norman French). These problems meant that it would be difficult for users to acquire a useful body of results when undertaking a search: a search for the word 'therefore' would ignore possible variations such as 'therfore', 'therfor', 'therefoure', 'þerefore', 'þerfore' and 'þerfor'. This was complicated even further by the different approaches which content creators had taken in representing non-Latin characters. For example, the 'yogh' character (3) can be represented using a UTF8 glyph, the numeral 3 with which it is visually similar, the HTML entity &yogh;, the decimal entity ȝ, editorial notation such as [yogh] or character substitution (when the non-Latin character is replaced by its Latin equivalent, which in the case of yogh might be transcribed as gh, z, y or th depending on its context). We solved the problem by implementing both a data processing workflow and a search architecture which sought to remove these anomalies and achieve consistency.

For the data processing workflow we undertook an audit of each dataset to establish the presence of non-Latin characters and their different types of representation. We then resolved to apply consistency for the three commonest non-Latin characters - yogh, thorn and eth - by substituting each one with a single Unicode entity within the actual data prior to any further processing. Character substitution did not concern us, because it is impossible to know when this has taken place without looking at the original document.

In order to address un-standardised spellings, we have provided users with the facility to include variant spellings in their search. When this option is checked, the search keyword is parsed in order to generate variant forms. There are two types of variant:

Dictionary variants (these are drawn, for example, from the Middle English Dictionary for keywords, and from The Gough Map for places) which provide known spelling variations.
Generated variants – common character substitutions are programmatically generated. For example, a search for 'þou3' contains both the thorn (þ) and the yogh (3) characters. Thorn can be substituted with 'th' as well as the character ð (eth) whilst yogh can be substituted with gh, z, y and th. So we end up with a programmatically generated list of the following variants: þou3, þough, þouz, þouy, þouth, though, thouz, thouy, thouth, dou3, dough, douz, douy, douth etc. Further, character substitution is performed on common Middle English spelling forms, such as '-oun' as an alternative to '-on' and 'i' instead of 'j' (the letter 'j' was not used in medieval times). This process of character substitution is performed on each dictionary variant, so the final search query can become quite long.

Some of the programmatically generated terms in the search query will be nonsensical but each variant is presented as a link that the user may click on to re-run the search using that specific form if they identify a useful variant.

Adding Structure to Search

We developed Natural Language Processing algorithms, specifically a technique called automated entity recognition, in order to help us identify and automatically tag different categories of data. We sought to capture different languages, person names, place names, dates and document references. The techniques involved gazetteer lookups and grammar-based rules.

The development work for identifying person names, place names and dates had been largely established during Connected Histories, although un-standardised spelling, different ways of capturing names and vaguer dates meant that the NLP was less successful in identifying entities than in Connected Histories. However the nature of the datasets often worked to our advantage – entities were often already tagged in the small, hand-crafted datasets we used, albeit in a variety of forms. Place name tagging drew heavily on place name gazetteers, including the Taxatio dataset and the Gough Map. The Gough Map was an unplanned addition to the Manuscript Online resources, identified during our research into viable online sources of historical place name data.

There are a range of languages used within medieval manuscript sources, even those which originate from Britain: English, Latin, Anglo-Norman French and Greek. We used a statistical and dictionary-based approach to attempt to identify Latin and French phrases, to enable the user to search, for example, for “benefice” but only within a Latin context. However, the large number of words that are common to more than one of these languages meant that it was often difficult to identify different languages, and, in particular, the boundaries between language use. This was exacerbated because Latin and French often appeared in short phrases of only three of four words.

We decided that we would not tag Greek because it is relatively uncommon in the sources and its non-Latin alphabet would require specific functionality to be built into the search interface (eg. a Greek keyboard).

We also decided that we would not attempt to distinguish between Old English, Middle English and Modern English (the presence of Modern English is particularly prevalent in resources that comprise manuscript descriptions). The boundaries between these phases in the development of English are not clear and it would be impossible in some cases to say that such-and-such a word was Middle English rather than Modern English. Given that the most prevalent language within our datasets was English (whether Old, Middle or Modern) we chose to assume that anything not identified as being Latin or Anglo-Norman would be English by default, and so we only needed to analyse the datasets for two languages. Our Latin algorithms were supported by vocabulary and grammar rules derived from the Perseus Digital Library whilst our Anglo-Norman algorithms were supported by vocabulary derived from the Anglo-Norman Online Hub.

The extent to which our NLP has been successful is open to question. However, we anticipate that it will continue to improve as the addition of new resources will provide new names that can be incorporated into NLP to improve accuracy. As such, we always recommend that users resort to basic keyword searching if searching by language, person, place, date or reference does not appear to be yielding the results which they expect.

The Way of API

Manuscripts Online has a Web API at the heart of its architecture, communicating between the user interface and the search engine. Many projects cite the development of an API as one of the means of facilitating greater data re-use by third parties. However, the Manuscripts Online API is integral to the system architecture and so although we too have documented it for others to use, it is the benefit of an API-based approach to in-house development which has already proved to be useful. By using an API for communication between the interface and the search engine - essentially bridging two separate processes - it has proved much easier for different personnel to work on different components and quicker to resolve bugs and other issues. Further, we hope that the API will assist with the sustainability of the site long term, because the user interface and/or the search engine can be modified with minimal impact on one another.

You can view documentation which explains how to implement the Manuscripts Online API here: https://www.manuscriptsonline.org/api

Changing the Culture

During the Manuscripts Online project we attempted to make tentative steps towards changing the culture of manuscript studies. Despite this research community being an early adopter of digital techniques - converging image digitisation and text analysis tools to reinvent the 'critical edition' - manuscript studies appears to remain a conservative discipline on matters concerning user generated content. For example, we originally explored an idea whereby users would be able to assign vocabulary to dialectal regions, using crowd sourcing techniques to judge the results, and thereby develop a community-generated linguistic atlas of Middle English. However, medievalists on the team felt that users of the website would not have the knowledge or expertise to make these judgements. Given that the majority of our users are anticipated to be medievalists at postgraduate and lecturer levels, some of us were surprised by these views. Similarly, there were concerns about introducing a feature whereby users would be able to explore the search paths of other users, for fear of making public an individual's research agenda. This contradicted the HRI's experience of other projects in which other disciplines actively request this type of feature. The lesson which the project team takes away from this is perhaps obvious: not all humanities disciplines are the same. When developing research infrastructure for a specific discipline, one needs to fully appreciate the discipline's research methods and intellectual values as well as its knowledge domain. These are, after all, what makes the digital humanities more than simply a branch of information science.

Our solution was twofold: we introduced a facility for creating comments and storing search pathways which can be made public or private; and we developed a mapping feature whereby users can plot their comments on Google Maps if the comments have a geographical significance. Both the commenting and the mapping features have no instructions dictating what constitutes an appropriate contribution, so long as contributions are not offensive. In other words, rather than trying to predict what users might wish to do with the site, we have simply provided them with some tools for generating content and we will now leave them to it. This approach to the mapping feature was a last-minute addition to the design which delayed the launch of the site by one month. We hope that these features will provide a first step towards changing this culture, and that the global community of medievalist will respond positively to the opportunity we have created.