For several years now Ancestry has been publishing collections of records from the U.S. that have been “transcribed” using a method we call Entity Extraction.
One example is the U.S. City Directory collection. A precursor to modern telephone books, city directories listed all of the inhabitants of a city, along with their address, occupation, and often their spouse’s name. These directories were published nearly every year for larger cities, creating billions of entries over time.
Too large to transcribe by hand, the best treatment had been to use Optical Character Recognition (OCR) to turn the scanned images of the books into text that can be crudely searched. But without the ability to apply any meaning to the words, searching for my ancestor John G. Brown would return any instance of any of those words on the page, many of which would be advertisements or other text that had nothing to do with a person.
Using a patented new system for processing these records, we have learned how to extract key information about a person, separate all words and phrases into searchable fields, and present them as results in a search query. These records have led to tens of millions of new discoveries for our members and have added interesting color to the stories of our ancestors.
Having solved that tricky problem and produced more than 2 billion new records for our members, we are moving on to a harder problem—printed books in German.
The main challenge with Entity Extraction of German content is surprisingly not the language. Our Natural Language Processing algorithms can be adapted to other languages with relatively little effort. Rather, the main challenge with German is the common use of the Gothic or Fraktur font when printing. This special script-like font is particularly difficult to recognize using the best OCR tools available today. Words and especially names can be misread as many of the characters in this special font are extremely similar.
To have a quality experience using these new databases, we want to have a high degree of precision and recall when searching them. Precision is the measure of accuracy of the results you return. Recall is the measure of whether you returned all of the possible results or if you missed some. With German Fraktur, the precision numbers were pretty low.
To improve precision, we have come up with a learning QA system that allows our internal reviewers to import a list of all surnames and given names extracted via entity extraction into a QA program that checks each instance against a German name authority. Those words that are not found in the authority are sent to an interface where the reviewer can mark them as “New” (words to add to the authority), “Delete” (words to delete from the database) and “Map” (words that the OCR engine misread, but that we can correct).
Using this new correction system recursively, we can process a German collection and learn where the OCR engine is struggling with similar characters, fix those errors, and run it again. Each time we make a pass over the data, the quality improves. Over time we expect the precision rates to improve significantly.
When these records are finished, a random list of German words in a very difficult-to-read font will have been turned into a set of records about people that looks a lot like an annual census.
Once this data is fielded or structured to be searchable, all sorts of analytics can be run over the data. The questions are endless. How many blacksmiths living in Berlin were married to someone named Sarah? What was the most common surname in Hannover in 1880? How many people were employed as butchers in Münich in 1910?
This new collection of German data will become available on Ancestry over the next few years as we process millions of images of new content.
While you are waiting for the German databases to be published, you can click here to search U.S. City Directories.