Posted by on July 10, 2014 in Big Data, Uncategorized

When interpreting historical documents for the intent of researching your ancestors, you are often presented with less than perfect data. Many of the records that are the backbone of family history research are bureaucratic scraps of paper filled out decades ago in some government building. We should hardly be surprised when the data entered is vague, confusing, or just plain sloppy.

Take for example, a census form from the 1940’s. One of the columns of information is the place of birth of each individual in the household. Given no other context, these entries can be extremely vague and in some cases, completely meaningless to the modern generation.

Here are some examples:

  • Prussia
  • Bohemia
  • Indian Territory

Additionally, there are entries that on the face of them seem clear, but with more context have new complexity:

  • Boston (England)
  • Paris (Idaho)
  • Provo (Bosnia)

And finally, we have entries that are terrifically vague and cannot be resolved without more context:

  • Springfield
  • Washington
  • Lincoln

If we add the complexity of automatic place parsing, where we try to infer meaning from the data and normalize it to a common form that we can search on, the challenges grow.

In the above example, if I feed “Springfield” into our place authority, which is a tool that normalizes different forms of place names to a single ID, I get 63 possible matches in a half dozen countries. This is not that helpful. I can’t put 63 different pins on a map, or try and match 63 different permutations to create a good DNA or record hint.

I need more context to narrow down the field to the one Springfield that represents the intent of that census clerk a hundred years ago.

One rather blunt approach is to sort the list by population. Statistically, more people will be from a larger city of Springfield than from a smaller. But this has all sorts of flaws, such as excluding rural places from ever being legitimate matches. If you happen to be from Paris, Idaho we are never going to find your record.

Another approach would be to implement a bunch of logical rules, where for the case of a name that matches a U.S. state we would say things like “Choose the largest jurisdiction for things that are both states and cities.” So “Tennessee” must mean the state of Tennessee, not the five cities in the U.S. that share the same name. Even if you like those results, there are always going to be exceptions that break the rule and require a second rule – such as the state of Georgia and the country of Georgia. The new rule would have to say “Choose the largest jurisdiction for things that are both states and cities, but don’t choose a Georgia as a country because it is really a state.”

It is clear that a rules-based approach will not work. But since we still need to resolve ambiguity, how is it to be done?

I propose a blended strategy that takes three approaches.

  1. Get context from wherever you can to limit the number of possibilities. If the birth location for Grandpa is Springfield and the record set you are studying is the Record of Births from Illinois, then the additional context may give you enough data to make a conclusion that Springfield=Springfield, Illinois, USA. What seems obvious to a human observer is actually pretty hard with automated systems. These systems need to learn where to find this additional context and Natural Language parsers or other systems need to be fed more context from the source to facilitate a good parse.
  2. Preserve all unresolved ambiguity. If the string I am parsing is “Provo” and my authority has a Provo in Utah, South Dakota, Kentucky, and Bosnia, I should save all of these as potential normalized representations of “Provo.” It is a smaller set to match on when doing comparisons and you may get help later on to pick the correct city.
  3. Get a human to help you. We are all familiar with applications and websites that give us that friendly “Did you mean…” dialogue. This approach lets a user, who may have more context, choose the “Provo” that they believe is right. We can get into a lot of trouble by trying to guess what is best for the customer instead of presenting a choice to them. Maybe Paris, Idaho is the Paris they want, maybe not. But let them choose for you.

In summary, context is the key to resolving ambiguity when parsing data, especially ambiguous place names. Using a blended approach that makes use of all available context, preserves any remaining ambiguity, and presents those ambiguous results to the user for resolution seems like the most successful strategy to solving the problem.

About Laryn Brown

Laryn is a Sr. Product Manager at Ancestry.com and joined the company in 1998 as the first product manager, then went on to launch Ancestry.co.uk as the first international website with the Ancestry brand. Currently he is the product manager for a small Research and Development team focused on natural language extraction from OCR and web crawled source material.Prior to working in R&D, Laryn managed the Document Preservation team. This team digitizes and indexes all of Ancestry’s historical records globally. He has also worked as the head of content partnership development, based in London.Working in genealogy as a profession and a hobby, Laryn is actively involved in the genealogy community. The threads of his own genealogy include Birmingham bricklayers, Canadian homesteaders, American colonists, and Norwegian farmers.

2 Comments

Tien Le 

I need to go back to college to understand this article.

July 14, 2014 at 1:06 am
ichard770 

very important subject sent 1998, wish personal number for each person shows up with name and date location, and each system has it own. how to do this? has ancestry answer this?

August 13, 2014 at 5:03 pm

We really do appreciate your feedback, and ask that you please be respectful to other commenters and authors. Any abusive comments may be moderated.

Commenting is open until Thursday, 24 July 2014