In my previous post, I outlined some of the problems and strategies we use at Ancestry.com to determine if two people who appear in the same household are related.
As promised, I want to focus this time on how to resolve ambiguous results.
In my early days of doing family history research, I made an assumption that finding a William Anderson married to an Isabella in 1901 in a small town in Scotland was enough evidence to establish a link to my ancestor and become the basis for further research. Unfortunately, what I learned after many months of further research was that “Anderson” is a very common Scottish surname. “William” and “Isabella” are very common given names in that region. A traditional naming pattern ends up creating cousins who are about the same age and whom have the same name. Even though I was dealing with a small town with a tiny population, it turns out that there were four men named William Anderson who married a woman named Isabella in 1901 in that town.
The question becomes then, how to tell if the features being examined are conclusive or ambiguous. This question is not only for family historians who can use intuition, experience, and convention, but more importantly for natural language processing and automated systems.
I understand that there are statistical models and other tools to narrow down the likelihood that the features being compared are conclusive, but I want to take a detour from that line of thinking and instead model what a genealogist does to determine whether the evidence is strong enough to draw a conclusion or whether the conclusion should remain ambiguous.
Ambiguity comes from insufficient data. Is a birthplace that has been listed as “Springfield” in Illinois or Massachusetts? In her article, “When No Record Proves a Point,” professional genealogist Elizabeth Shown Mills argues that in the absence of conclusive data we need to build our case with reliable alternative data. In the case of “Springfield” our team uses what we call a “reference place,” other place data that exists in the metadata about the record.
For example, if we are parsing an obituary and the text states that the deceased was from Springfield, but gives no other data, pulling all place details from the metadata on the record such as the title of the newspaper, when it was published, and importantly, where it was published can generate a new feature, the reference location, that can assist in disambiguation.
Another effective strategy is to eliminate all other possibilities. Sir Arthur Conan Doyle had Sherlock Holmes said it this way, “… that when you have eliminated the impossible, whatever remains, however improbable, must be the truth.”
When faced with several possible interpretations of a natural language parse of data, business rules can be applied to eliminate impossible conclusions. These rules are often not 100% accurate, but within certain tolerances can function very effectively at reducing false positives. For example, checking a name authority for the occurrences of the term “Restaurant,” finds a very low likelihood that we are looking at a given name. Further queries show that the term occurs frequently in business names. From this we draw the conclusion that we have processed a business listing in a directory and should not include it as a personal name.
Other examples of conclusions might include: children are not born before their parents, spouses are usually married after age 12; and in many cultures, children are given the surname of their parents.
Getting these business rules correct is where bringing in an experienced researcher to consult can really help. You need someone who has proven conclusions before with limited data, and probably has made some mistakes, someone with intuition, experience, and who knows the conventions for establishing conclusions. Very often, your best source of business rules is the person who has been doing this process manually.
In summary, drawing conclusions from ambiguous data can be the greatest challenge of Big Data–the holy grail of Big Data parsing algorithms. At Ancestry.com we have found that starting with a good feature set, teasing out any additional clues from metadata or other related data, then eliminating the impossible using business rules informed by an expert in the area can greatly reduce the occurrences of ambiguity in the data. Where ambiguity can’t be resolved, it should be allowed to persist in the data and perhaps even in the presentation to the customer where additional context may be available at a later time to draw a firm conclusion.