At Ancestry we have a specific team dedicated to working on new technologies to extract information from historical sources of “Big Data.”
One example is the U.S. City Directory collection. A precursor to modern telephone books, city directories listed all of the inhabitants of a city, along with their address, occupation, and often their spouse’s name. These directories, starting in the early 1800’s, were published nearly every year for larger cities, creating billions of entries over time.
In the summer of 2011 a newly formed team at Ancestry.com took on the challenge to make this large collection of data more useful to our customers. As the collection was too massive to transcribe by hand, we had been using the latest technology of Optical Character Recognition [OCR] to turn directory book digital images into text that could be crudely searched. The inefficiency with this was the inability to apply any meaning to the words in these historical documents. Searching for my ancestor John G. Brown would return any instance of any of those words on the page, many of which would be advertisements or other text that had nothing to do with a person.
To help solve this challenge we brought in experts in Natural Language Processing (NLP) to wrestle this billion-name gorilla, taking a page of seemingly random text and turn it into records about real people.
One of the high level industry wide challenges with NLP is to balance precision and recall. Precision is the measure of accuracy of the results you return. Recall is the measure of whether you returned all of the possible results or if you missed some.
For example, if I searched for my ancestor’s name John Brown, I might have received a result for a “Brown Shoe Polish” ad instead of a record for my ancestor. Ancestry.com members would not be pleased with that. A problem with recall would be where “John Brown” was put in as a search term and “John G. Brown” would likely be omitted because it wasn’t an exact match, even though it was the correct record I was looking for.
First we tackled precision, creating algorithms that omit pages that look like advertising or front matter. Various features on these pages such as font size, number of lines on the page, density of words, and other page formatting clues allow us to determine whether a page is good or bad. Then we looked at individual OCR “zones” or blocks of text for similar type features. Finally we examine each line to determine the likelihood that we have a person not a “Brown Shoe Polish” ad.
Once the data has been filtered, then the NLP engine processes the line and tries to apply meaning to each word. Is “Bank” a person’s last name, part of a business name (First National Bank), part of an address, or an occupation? Contextual clues and no small amount of magic allow us to determine what each component of the line is.
These components are then reassembled to form given name, last name, spouse’s given name, spouses last name, occupation, employer name, street address, residence city, residence state and residence year fields.
When finished, we were able to turn a bag of random words into a set of searchable records about people that looked a lot like an annual census. Best of all, we made the collection more useful to our customers.
Once the data is fielded or structured to be searchable, all sorts of analytics can be run over the data. The questions are endless. How many blacksmiths living in the Bronx were married to someone named Sarah? What was the most common surname in Flint, Michigan in 1880? How many people were employed by Eastman Kodak in Rochester, New York in 1967?
This new collection of fielded data has been available in beta form on Ancestry.com for about a year now. Weighing in at 1.1 billion people, it could be the largest single collection of historical big data records about people in the world.
We are still working on improving both precision and recall on our collections. An upcoming update to the current U.S. City Directory collection will remove approximately 90 million incorrect results (think “Brown Shoe Polish”) and add 315 million new entries that were missed the first time.
Check out the U.S. City Directories collection to find your own ancestors and to see NLP extraction in action (if you’re not a member, you can access it free through a 14-day free trial.