Guest post by Minyue Dai, Carrie Yang, Reeve Ingle, and Meaghan J. Brown.
Hundreds of years ago, scholars might spend hours in a library searching through thousands of pages to find a useful paragraph.Things get much easier when we can work with digitized text. Optical Character Recognition (OCR) systems can automatically recognize text on images—such as printed books, handwritten letters, and photos—and convert it to a readable and editable digital format. Such technology supports the creation of digital library materials and makes available an incredible amount of text across the Internet. With OCR’d texts, people can easily search for whatever they want.
OCR technologies were first developed to read printed text, so early modern handwriting poses new challenges. The Folger Shakespeare Library’s Early Modern Manuscripts Online (EMMO) project provides access to some of the Folger’s sixteenth and seventeenth-century English manuscripts through images and highly-accurate transcriptions, along with related metadata. EMMO supports researchers in the humanities, allowing them to investigate early modern manuscripts and their contents more conveniently. Technology researchers also benefit from the programming challenges of this dataset. Even humans have trouble with secretary hand; it’s really hard for computers. This summer the Folger gave the GoogleOCR team a dataset of 100 early modern manuscript letters to experiment with.
The GoogleOCR team in Google AI Perception focuses on providing accurate and efficient OCR systems for Google products and external users. Cloud Vision, which provides image analysis based on Google’s pretrained models, employs an OCR system for recognizing text in an image.1 For those interested, there is an online demo. Currently this production system works well on printed text, but historical documents pose a more challenging task. Those manuscripts might be written in a historical language which the current system does not support, and authors used a variety of historical “hands” or ways of forming characters. The Elizabethan secretary hand is particularly challenging. Here is an example of a page image from EMMO and how the current production model and trained historical model perform.
The production system is given a language hint, but the OCR result has many errors and the full transcript is unreadable. Such mistakes are perhaps to be expected, since the system has not been trained on historical text. In comparison, our trained historical model performs much better and we can better understand the content. To make the OCR system support various input images as much as possible, this summer we worked on improving the OCR system’s performance on historical manuscripts.
The lack of data is one of the biggest challenges in recognizing historical manuscripts. Google’s OCR system is based on machine learning models, which require a large amount of data for training a good model and a small set of test data for evaluating how this model performs. An ideal dataset would contain the original digitized page image, accurate transcriptions of all of the text in the image, as well as position information that specifies the location of each line of text on the page. However, historical manuscript annotation often requires domain-knowledge of the specific language and writing style, making it quite challenging to obtain high-quality annotations. EMMO is an important resource that provides highly accurate transcriptions of historical manuscripts, and we are grateful that the Folger Shakespeare Library is willing to authorize us to use data from the EMMO project.
While the need for accurate transcriptions is not surprising, the line position information is also crucial for training OCR models. Unfortunately, the EMMO data do not contain line position information. Of course the “brute-force” solution is hiring somebody to draw a line bounding box on page-level data manually, but we developed an iterative algorithm to automatically generate line-bounding boxes utilizing the OCR system. Here is a flow diagram visualizing the process.
As shown above, our OCR system outputs layout information, including paragraph, line, word, symbol-bounding boxes, and corresponding OCR result. Although the OCR quality may be not great, it is still possible to match the OCR result with a given transcript in the dataset. Here is an example line from the page shown above.
The matching is computed based on character error rate (CER), which denotes the difference between two sentences—specifically, it counts the number of character insertion, deletion, and substitution operations needed to transform one sentence into the other and divides this number by the length of the correct or “reference” sentence. For convenience, the result is multiplied by 100 to yield a percentage-like quantity. Some user-defined parameters control the matching requirement and the cropped image quality.
In terms of experiments on other historical datasets, a small set of training data will significantly improve the model’s performance on corresponding test data. Thus we retrain an OCR system with all generated line-level data from EMMO dataset and then repeat the OCR result matching process in order to extract more data with the more accurate OCR model. Also, our algorithm supports two types of transcripts: those with or without line breaks. The EMMO dataset provides line-level transcript for each line in the page image, but other historical datasets provide only page transcriptions without line breaks. This data extraction pipeline is set on a distributed system that is able to process a large amount of data simultaneously.
This algorithm enables us to train OCR models and test model accuracy on the EMMO dataset. The bar chart demonstrates how much line-level data is extracted by our line-matching procedure.
There are 2520 lines in the EMMO dataset, and we are able to use 75% of the data without any extra manual annotation. The retraining and recropping steps help to extract 16% more data. Adding this new data into our model helps us get 21.25% final CER, which can be interpreted as 1 error in 5 reference symbols on average.
The CER drops from 37% to 21%, and in terms of word recognition, over half of the words in the EMMO test dataset had no error. While in absolute terms the error rate is still higher than for modern printed text, it is clear that our specifically trained OCR system achieves significant improvements on historical manuscript recognition. This final historical Latin-script model benefits from our line extraction pipeline, given 85% of data does not provide line position information. This pipeline can also be used to improve historical OCR accuracy for other languages where line-annotated data may not be available. While this historical model is not yet publicly available, improvements in OCR quality of historical material may be accessible through the Cloud Vision API.
In the future, GoogleOCR aims to keep improving OCR quality and coverage; for historical manuscript recognition there are many possible applications. For example, historical data transcription and annotation requires domain knowledge and relies on a limited number of volunteers. A high-quality OCR system can preprocess manuscripts and generate reference OCR results for transcribers. If the character accuracy of the OCR system is high enough, this can be a considerable time-saving tool for transcribers, who only need to validate and correct occasional OCR errors.
Another more ambitious application is searchable historical documents. Currently Google has already supported searching text on printed documents. For example, we can perform a web search for a short phrase like “filii et heredis,” and find many instances of the OCR’d phrase in printed books, localized within the page image. In principle we should be able to apply the same technique to historical documents. Imagine that an online library provides historical document images and their OCR’d text. Since our OCR system also returns the positions of words, users can search “shakespeare” and the library will show all documents with keyword “shakespeare” and highlight the word back to the page image! More importantly, fuzzy search can include all similar words. If the OCR system recognizes the word as “shakesper” or “shakspeare” (both legitimate spellings of the name in the early modern period), or labels it as “sliakespeare” by mistake, ideally the search engine would still be able to find and display this word, robust to our 20% character error rate.
While some additional technology probably needs to be developed to fully enable such a scenario, the value of such capability in assisting scholars is compelling, enabling them to quickly find manuscripts and analyze historic handwriting. There are many other possibilities, and hopefully the collaboration between tech and humanists will continue to be fruitful.
Minyue Dai and Carrie Yang were interns on the GoogleOCR team in Summer 2018. Reeve Ingle is a software engineer on the GoogleOCR team.