In the recent uncovering of NSA activity revealed by Edward Snowden, we see that the relationships between people can be some of the most valuable data that can be inferred from big data. The knowledge of who a person knows, who they have contacted, and who they are related to is apparently critical information for determining threats to society.
At Ancestry.com, we are not in the business of identifying national security threats, but these relationships inside our genealogical data are still as valuable as ever. Linking a person to their parents, grandparents, spouse, and children is the core activity our customers are engaged in and any technology we can provide to make that process easier is considered.
In the most recent years available (1940-1880), the United States Federal Census has included a field for “relationship to head of household” that can be used to reconstruct family relationships. This allows our customers to automatically link individuals in their family tree to each other. In earlier census years (1870, 1860, 1850) and in similar time periods internationally, this field was omitted. The “household” is still identified, so we can tell who is living at the same address, but we can’t determine what their relationship is to each other.
The Content Services team at Ancestry.com focuses on ways to enrich our collections with more meaning. We have taken a look at this problem and wondered whether using natural language processing and a lot of research on real world cases, could enable us to infer those relationships with an acceptable degree of accuracy.
At first we looked at some simple cases from the 1880 U.S. Federal Census where the relationship to head of household is given. If that field were removed, how closely could our algorithm reproduce the relationship data that was entered by the census enumerator all those years ago? Here is a case:
It would seem natural to assume that we have a married couple with two teenage children. When we reveal the relationships as originally entered on the form, this assumption is confirmed.
Cathcart, John | 45 | Male | Head (inferred)
Cathcart, Ellen | 44 | Female | Wife
Cathcart, Annie | 16 | Male | Daughter
Cathcart, Penrose | 13 | Male | Son
This looks like an easy problem to solve using surname, age, line order, and gender as features to determine the inferred relationship.
Let’s look at some more challenging examples.
We see two older adults living with a younger pair of adults and young children, all with the same surname, but different from the older adults. Then we see two other adults in the household with different surnames. Hmmm.
The original data reveals a very common household pattern, but not one that may be readily apparent from the available features and data. We see a married daughter living with family in her parent’s home with two servants.
Johnson, Samuel | 79 | Male | Head (inferred)
Johnson, Louisa | 68 | Female | Wife
Cobb, Asa S. | 43 | Male | Son in Law
Cobb, Louise | 37 | Female | Son’s wife
Cobb, Franklin S. | 16 | Male | Grand Son
Cobb, Olive L. | 12 | Female | Grand Daughter
Cobb, George J. | 8 | Male | Grand Son
Cobb, Edith | 3 | Female | Grand Daughter
Collyer, Mary | 23 | Female | Servant
Brokaw, Garret Q. | 19 | Male | Servant
The further from the traditional family structure we get, the more ambiguous the data becomes. Uncles living with the family. Visiting nephews with the same surname. Distinguishing between a married couple with no children and a pair of adult siblings living in the same household. All of these present some real challenges to our algorithm.
In my next blog post, I’ll give some specifics on how we dealt with the issue of ambiguous interpretations and some precision/recall measurement challenges we had in determining if the algorithm was “good enough” to infer these relationships and release the upgraded data to our customers.