Posted by Ancestry Team on December 13, 2013 in Big Data

In the recent uncovering of NSA activity revealed by Edward Snowden, we see that the relationships between people can be some of the most valuable data that can be inferred from big data. The knowledge of who a person knows, who they have contacted, and who they are related to is apparently critical information for determining threats to society.

At Ancestry.com, we are not in the business of identifying national security threats, but these relationships inside our genealogical data are still as valuable as ever. Linking a person to their parents, grandparents, spouse, and children is the core activity our customers are engaged in and any technology we can provide to make that process easier is considered.

In the most recent years available (1940-1880), the United States Federal Census has included a field for “relationship to head of household” that can be used to reconstruct family relationships. This allows our customers to automatically link individuals in their family tree to each other. In earlier census years (1870, 1860, 1850) and in similar time periods internationally, this field was omitted. The “household” is still identified, so we can tell who is living at the same address, but we can’t determine what their relationship is to each other.

The Content Services team at Ancestry.com focuses on ways to enrich our collections with more meaning. We have taken a look at this problem and wondered whether using natural language processing and a lot of research on real world cases, could enable us to infer those relationships with an acceptable degree of accuracy.

At first we looked at some simple cases from the 1880 U.S. Federal Census where the relationship to head of household is given. If that field were removed, how closely could our algorithm reproduce  the relationship data that was entered by the census enumerator all those years ago? Here is a case:

Screen Shot 2013-10-18 at 3.34.10 PM

 

 

 

 

 

It would seem natural to assume that we have a married couple with two teenage children. When we reveal the relationships as originally entered on the form, this assumption is confirmed.

Screen Shot 2013-10-18 at 3.36.42 PM

 

 

 

 

Cathcart, John | 45 | Male | Head (inferred)

Cathcart, Ellen | 44 | Female | Wife

Cathcart, Annie | 16 | Male | Daughter

Cathcart, Penrose | 13 | Male | Son

 

This looks like an easy problem to solve using surname, age, line order, and gender as features to determine the inferred relationship.

Let’s look at some more challenging examples.

Screen Shot 2013-10-18 at 2.38.59 PM

 

 

 

 

 

 

 

 

 

 

 

We see two older adults living with a younger pair of adults and young children, all with the same surname, but different from the older adults. Then we see two other adults in the household with different surnames. Hmmm.

The original data reveals a very common household pattern, but not one that may be readily apparent from the available features and data. We see a married daughter living with family in her parent’s home with two servants.

Screen Shot 2013-10-18 at 3.24.17 PM

 

 

 

 

 

 

 

 

Johnson, Samuel | 79 | Male | Head (inferred)

Johnson, Louisa | 68 | Female | Wife

Cobb, Asa S. | 43 | Male | Son in Law

Cobb, Louise | 37 | Female | Son’s wife

Cobb, Franklin S. | 16 | Male | Grand Son

Cobb, Olive L. | 12 | Female | Grand Daughter

Cobb, George J. | 8 | Male | Grand Son

Cobb, Edith | 3 | Female | Grand Daughter

Collyer, Mary | 23 | Female | Servant

Brokaw, Garret Q. | 19 | Male | Servant

 

The further from the traditional family structure we get, the more ambiguous the data becomes. Uncles living with the family. Visiting nephews with the same surname. Distinguishing between a married couple with no children and a pair of adult siblings living in the same household. All of these present some real challenges to our algorithm.

In my next blog post, I’ll give some specifics on how we dealt with the issue of ambiguous interpretations and some precision/recall measurement challenges we had in determining if the algorithm was “good enough” to infer these relationships and release the upgraded data to our customers.

Comments

Join the Discussion

We really do appreciate your feedback, and ask that you please be respectful to other commenters and authors. Any abusive comments may be moderated. For help with a specific problem, please contact customer service.