Jeremy Pollack, an engineer on the DNA Pipeline Team, and I presented together at QCon San Francisco this week. It was a real tag team effort from two different points-of-view – the “Manager” and the “Developer” view of the same project. Having both of us on stage was a first, but it seemed to work really well. We kept the focus on how Ancestry.com uses Hadoop and HBase to scale the DNA matching steps in the pipeline. It is a bit intimidating to be presenting at the same time Facebook is talking about how they do Continuous Delivery, but we had a large crowd who were interested in our tech story.
I started the presentation with an informal survey. I asked people to raise their hands if they were using Hadoop, and to keep their hands up if they were also using HBase (about ½ the Hadoop hands went down), and then finally asked them to keep their hand up if they were doing anything at all with DNA. To my surprise, one hand was still in the air. I immediately pointed to that hand and asked the individual to come up after the talk. I ended up meeting Eric Turcotte, a Solutions Architect working on “technology pipeline solutions” at Monsanto. It just so happens that they use Hadoop and HBase to create pipelines to process plant DNA. I was really glad he stuck around after the talk. Finding someone else working with the same technologies (plants, not humans) was really unexpected.
The AncestryDNA project really is an AGILE software development story with population geneticists and bioinformatics PhDs working alongside software engineers to create a unique product that provides another way to deliver family history discoveries to our users. We followed a “measure everything” principle that clearly showed us the steps in our pipeline and predicted with surprising accuracy exactly when they would become a problem. As the slide below shows, there were three different pain points:
- Static steps that were linear based on the sample size we were processing (we usually add 1,000 samples each run). For the most part, we were aware of these steps, but could ignore them.
- Linear steps that grew as the DNA pool size grew. We had to watch these and continually make tweaks to get improvements.
- Time bombs – the steps that were going quadratic. Specifically, the matching steps in the pipeline. No getting around it, these steps had to be addressed.
This lead to a complete rewrite of the GERMLINE DNA matching algorithm using Hadoop and HBase (we call the rewrite “Jermline”) that resulted in a 1700% performance improvement. The graph below shows the dramatic results we achieved after releasing Jermline. It really was a huge, innovative step forward.
What advantages do we get from these technologies? HBase allows us to hold our results permanently between runs; it provides an easy way to continually add new samples (simply add another column to the table), and an efficient way to retrieve and compare the data during the matching step. Hadoop allows us to run multiple samples in parallel, and process the DNA data quickly and efficiently. Hadoop and HBase have been the right tools to get the job done.
Ancestry.com will continue to evolve our DNA product and improve the matching. As the total DNA pool grows in size, we’ll face and resolve new technical challenges. I’ll keep you posted.
Finally, anytime you do a presentation there is always feedback. At QCon the crowd can pick a green card, yellow card, or red card. Pretty obvious how you’re voting. We had 219 people attend (5 walked out), our color scores were: 61% green, 37% yellow, and 2% red. It just proves that developers are a tough crowd to please. I welcome comments from anyone who was at the presentation or anyone who looks through the slides.