Posted by Laryn Brown on May 6, 2016 in Operations

For several years now Ancestry has been publishing collections of records from the U.S. that have been “transcribed” using a method we call Entity Extraction.

One example is the U.S. City Directory collection. A precursor to modern telephone books, city directories listed all of the inhabitants of a city, along with their address, occupation, and often their spouse’s name. These directories were published nearly every year for larger cities, creating billions of entries over time.

Too large to transcribe by hand, the best treatment had been to use Optical Character Recognition (OCR) to turn the scanned images of the books into text that can be crudely searched. But without the ability to apply any meaning to the words, searching for my ancestor John G. Brown would return any instance of any of those words on the page, many of which would be advertisements or other text that had nothing to do with a person.

Using a patented new system for processing these records, we have learned how to extract key information about a person, separate all words and phrases into searchable fields, and present them as results in a search query. These records have led to tens of millions of new discoveries for our members and have added interesting color to the stories of our ancestors.

Having solved that tricky problem and produced more than 2 billion new records for our members, we are moving on to a harder problem—printed books in German.

The main challenge with Entity Extraction of German content is surprisingly not the language. Our Natural Language Processing algorithms can be adapted to other languages with relatively little effort. Rather, the main challenge with German is the common use of the Gothic or Fraktur font when printing. This special script-like font is particularly difficult to recognize using the best OCR tools available today. Words and especially names can be misread as many of the characters in this special font are extremely similar.

To have a quality experience using these new databases, we want to have a high degree of precision and recall when searching them. Precision is the measure of accuracy of the results you return. Recall is the measure of whether you returned all of the possible results or if you missed some. With German Fraktur, the precision numbers were pretty low.

To improve precision, we have come up with a learning QA system that allows our internal reviewers to import a list of all surnames and given names extracted via entity extraction into a QA program that checks each instance against a German name authority. Those words that are not found in the authority are sent to an interface where the reviewer can mark them as “New” (words to add to the authority), “Delete” (words to delete from the database) and “Map” (words that the OCR engine misread, but that we can correct).

Using this new correction system recursively, we can process a German collection and learn where the OCR engine is struggling with similar characters, fix those errors, and run it again. Each time we make a pass over the data, the quality improves. Over time we expect the precision rates to improve significantly.

When these records are finished, a random list of German words in a very difficult-to-read font will have been turned into a set of records about people that looks a lot like an annual census.

Once this data is fielded or structured to be searchable, all sorts of analytics can be run over the data. The questions are endless. How many blacksmiths living in Berlin were married to someone named Sarah? What was the most common surname in Hannover in 1880? How many people were employed as butchers in Münich in 1910?

This new collection of German data will become available on Ancestry over the next few years as we process millions of images of new content.

While you are waiting for the German databases to be published, you can click here  to search U.S. City Directories.

Laryn Brown

Laryn is a Sr. Product Manager at Ancestry.com and joined the company in 1998 as the first product manager, then went on to launch Ancestry.co.uk as the first international website with the Ancestry brand. Currently he is the product manager for a small Research and Development team focused on natural language extraction from OCR and web crawled source material. Prior to working in R&D, Laryn managed the Document Preservation team. This team digitizes and indexes all of Ancestry’s historical records globally. He has also worked as the head of content partnership development, based in London. Working in genealogy as a profession and a hobby, Laryn is actively involved in the genealogy community. The threads of his own genealogy include Birmingham bricklayers, Canadian homesteaders, American colonists, and Norwegian farmers.

3 Comments

  1. Mary D. Taffet

    Hello Laryn! We met years ago at one of the GenTech conferences. I’d be interested in knowing more about the new system, mostly because I am in the NLP field, and Information Extraction is my specialty at this time. Are you using a home-grown system, or one of the systems by one of the typical vendors, with some domain customization? I realize that this is likely to be proprietary competitive information, but if you can point me to the patent, I’d love to hear more.

  2. Garth

    Is this similar to how Canada Voters Lists were OCR’d? While I am very appreciative of the records I must say that the indexing of MANY of the lists is so bad as to be unusable. I hope that as processes improve that you consider re-indexing the lists.

Comments are closed.