The DNA matching research and development life cycle

Posted by Julie Granka on August 19, 2014 in DNA, Science

Research into matching patterns of over a half-million AncestryDNA members translates into new DNA matching discoveries 

Among over 500,000 AncestryDNA customers, more than 35 million 4th cousin relationships have been identified – a number that continues to grow rapidly at an exponential rate.  While that means millions of opportunities for personal discoveries by AncestryDNA members, it also means a lot of data that the AncestryDNA science team can put back into research and development for DNA matching.

At the Institute for Genetic Genealogy Annual Conference in Washington, D.C. this past weekend, I spoke about some of the AncestryDNA science team’s latest exciting discoveries – made by carefully studying patterns of DNA matches in a 500,000-member database.


Graph showing growth in the number of 4th cousin matches between pairs of AncestryDNA customers over time

Graph showing growth in the number of 4th cousin matches between pairs of AncestryDNA customers over time

DNA matching means identifying pairs of individuals whose genetics suggest that they are related through a recent common ancestor. But DNA matching is an evolving science.  By analyzing the results from our current method for DNA matching, we have learned how we might be able to improve upon it for the future.


Life cycle of AncestryDNA matching research and development

Life cycle of AncestryDNA matching research and development

The science team targeted our research of the DNA matching data so that we could obtain insight into two specific steps of the DNA matching procedure.

Remember that a person gets half of their DNA from each of their parents – one full copy from their mother and one from their father.  The problem is that your genetic data doesn’t tell us which parts of your DNA you inherited from the same parent.  The first step of DNA matching is called phasing, and determines the strings of DNA letters that a person inherited from each of their parents.  In other words, phasing distinguishes the two separate copies of a person’s genome.


Observed genetic data only reveals the pairs of letters that a person has at a particular genetic marker.  Phasing determines which strings of letters of DNA were inherited as a unit from each of their parents.

Observed genetic data only reveals the pairs of letters that a person has at a particular genetic marker. Phasing determines which strings of letters of DNA were inherited as a unit from each of their parents.

If we had DNA from everyone’s parents, phasing someone’s DNA would be easy.  But unfortunately, we don’t.  So instead, phasing someone’s DNA is often based on a “reference” dataset of people in the world who are already phased.  Typically, those reference sets are rather small (around one thousand people).

Studies of customer data led us to find that we could incorporate data from hundreds of thousands of existing customers into our reference dataset.  The result?  Phasing that is more accurate, and faster.  Applying this new approach would mean a better setup for the next steps of DNA matching.

The second step in DNA matching is to look for pieces of DNA that are identical between individuals.  For genealogy research, we’re interested in DNA that’s identical because two people are related from a recent common ancestor.  This is called DNA that is identical by descent, or IBD.  IBD DNA is what leads to meaningful genealogical discoveries: allowing members to connect with cousins, find new ancestors, and collaborate on research.

But there other reasons why two people’s DNA could be identical. After all, the genomes of any two humans are 99.9% identical. Pieces of DNA could be identical between two people because they are both human, because they are of the same ethnicity, or because they share some other more ancient shared history.  We call these pieces of DNA only identical by state (IBS), because the DNA could be identical for a reason other than a recent common ancestor.

We sought to understand the causes of identical pieces of DNA between more than half a million AncestryDNA members.  Our in-depth study of these matches led us to find that in certain places of the genome, thousands of people were being estimated to have DNA that was identical to one another.

What we found is that thousands of people all having matching DNA isn’t a signal of all of them being closely related to one another.  Instead, it’s likely a hallmark of a more ancient shared history between those thousands of individuals – or IBS.


Finding places in the genome where thousands of people all have identical DNA is likely a hallmark of IBS, but not IBD.

Finding places in the genome where thousands of people all have identical DNA is likely a hallmark of IBS, but not IBD.

In other words, our analysis revealed that in a few cases where we thought people’s DNA was identical by descent, it was actually identical by state.  These striking matching patterns were only apparent after viewing the massive amount of matching data that we did.

So while the data suggested that our algorithms had room for improvement, that same data gave us the solution.  After exploring a large number of potential fixes and alternative algorithms, we discovered that the best way to address the problem was to use the observed DNA matches to determine which were meaningful for genealogy (IBD) – and distinguish them from those due to more ancient shared history.  In other words, the matching data itself has the power to help us tease apart the matches that we want to keep from those that we want to throw away.

The AncestryDNA science team’s efforts – poring through mounds and mounds of DNA matches – have paid off.  From preliminary testing, it appears that these latest discoveries relating to both steps of DNA matching may lead to dramatic DNA matching improvements. In the future, this may translate to a higher-quality list of matches for each AncestryDNA member: fewer false matches, and a few new matches too.

In addition to the hard work of the AncestryDNA science team, the huge amount of DNA matching data from over a half-million AncestryDNA members is what has enabled these new discoveries.  Carefully studying the results from our existing matching algorithms has now allowed us to complete the research and development “life cycle” of DNA matching: translating real data into future advancements in the AncestryDNA experience.


Past Articles

Core Web Accessibility Guidelines

Posted by Jason Boyer on August 13, 2014 in Accessibility, CSS/HTML/JavaScript

How do you ensure accessibility on a website that is worked on by several hundred web developers? That is the question we are continually asking ourselves and have made great steps towards answering. The approach we took was to document our core guidelines and deliver presentations and trainings to all involved. This included our small… Read more

Maintaining Balance by Using Feedback Loops in Software

Posted by Chad Groneman on July 29, 2014 in C#, Performance

Maintaining Balance by Using Feedback Loops in Software Feedback is an important part of daily life.  Used wisely, feedback loops can help us make better decisions, resulting in overall improvements.  Feedback loops are useful in all sorts of situations, from relationships to what we eat for dinner.  Software can also be made to take advantage… Read more

Building an Operationally Successful Component – Part 3: Robustness

Posted by Geoff Rayback on July 23, 2014 in DevOps, Uncategorized

Building an Operationally Successful Component – Part 3: Robustness My previous two posts discussed building components that are “operationally successful.”  To me, a component cannot be considered successful unless it actually operates as expected when released into the wild.  Something that, “works on my machine,” cannot be considered a success unless it also works on… Read more

Lessons Learned Building a Messaging Framework

Posted by Xuyen On on July 1, 2014 in Big Data

We have built out an initial logging framework with Kafka 0.7.2, a messaging system developed at LinkedIn. This blog post will go over some of the lessons we’ve learned by building out the framework here at Most of our application servers are Windows-based and we want to capture IIS logs from these servers. However,… Read more

Adventures in Big Data: Commodity Hardware Blues

Posted by Bill Yetman on June 20, 2014 in Big Data

One of the real advantages of a system like Hadoop is that it runs on commodity hardware. This will keep your hardware costs low. But when that hardware fails at an unusually high rate it can really throw a wrench into your plans. This was the case recently when we set up a new cluster… Read more

Website Performance 101

Posted by Jeremy Johnson on June 17, 2014 in Performance, Web

Here at, we have a team dedicated to monitoring, measuring, and helping the company improve the performance of the website. Trying to do this is a very fun and interesting challenge. With a website that has many billions of records and other content (10 petabytes), making it fast is no small task! To illustrate… Read more

Building an Operationally Successful Component – Part 2: Self Correction

Posted by Geoff Rayback on June 10, 2014 in DevOps

Building an Operationally Successful Component – Part 2: Self-correction In my last post I talked about building components that are “operationally successful,” by which I mean that the software functions correctly when it is deployed into production.  I suggested that there are three things that a software component must have, to some degree, in order… Read more

Featured Article: Want Great APIs? Start With Training

Posted by Harold Madsen on June 3, 2014 in API, has awesome software engineers, products, and APIs. However, programmers are not always trained as API designers and when it comes to API development, consistency matters. As companies build their API programs using multiple teams, APIs tend to develop their own personalities and become radically different from one another. That’s a problem. Fortunately, it doesn’t… Read more

Reinventing Search with Smart Filtering

Posted by Sam Smith on May 30, 2014 in Search

Last summer introduced a new but lesser known feature called “Smart Filtering”.  Smart Filtering is a technology that applies only when searching for someone in one of your trees.  When you search for someone in your tree, a lot more data becomes available to the search engine including life events (birth, marriage, death, etc…),… Read more to Present at Hadoop Summit

Posted by Bill Yetman on May 27, 2014 in Technology Conferences

Interest in direct-to-consumer DNA testing has grown dramatically in the past few years. When you’re measuring more than 700,000 DNA markers for each individual, how do you analyze all that data across a rapidly growing database, while providing actionable results for your customers? At the Hadoop Summit next week,

Find A Grave Engineering

Posted by Robert Schultz on May 21, 2014 in Development, Mobile, Web

Last October acquired a very exciting property called Find A Grave which focused on collecting content around the graves of family, loved ones and famous people. With the acquisition we wanted to take Find A Grave to the next level and provide the current users new and better experiences around consuming and contributing content.… Read more

Learning Alternative Name Spellings Technical Report

Posted by Jeffrey Sukharev on May 1, 2014 in Data Science Publishes Technical Report on data-driven technique for finding alternative name spellings.    In this article we discuss the problem of finding alternative name spelling, an important component of name matching (part of the record linkage field). We started this project primarily because of real issues that we encountered while working on name matching for… Read more

Migrating From TFS to Git-based Repositories (Part I)

Posted by Seng Lin Shee on April 29, 2014 in Agile, Development, Technical Management

Git, a distributed revision control and source code management system has been making waves for years, and many software houses have been slowly adopting this system as not only their source code repository, but also as a way software development projects are managed. There is much debate about using either a centralized or distributed revision… Read more to Present Jermline on DNA Day at the Global Big Data Conference

Posted by Jeremy Pollack on April 9, 2014 in Big Data, Data Science, Development, DNA, Science

Interested in genealogy?  Curious about DNA?  Fascinated by the world of big data?  If so, come check out my talk  at the Global Big Data Conference on DNA day this Friday, April 25 at 4pmPT in the Santa Clara Convention Center!  I’ll cover Jermline, our massively-scalable DNA matching application.  I’ll talk about our business, give a run-through… Read more