Posted by Ancestry Team on September 26, 2013 in Agile, Big Data

I recently had the opportunity to present the story of the Ancestry.com DNA  pipeline project at the Utah Big Mountain Conference put on by Utah Geek Events. It really is a great story:

  • It has a unique cast of characters – Population Geneticists and Bioinformatics PhDs working alongside Software Engineers.
  • A real world AGILE software development project. Start with something from academia that works but needs to scale in a production environment. Get things working and iterate.
  • A “measure everything” philosophy that clearly showed us the steps in our process that would go quadratic and predicted with surprising accuracy exactly when they would become a problem.
  • A complete rewrite of the GERMLINE DNA matching algorithm using Hadoop and HBase (we call the rewrite “Jermline”) that resulted in a 1700% performance improvement.
  • A walk through of how DNA matching works – using Battlestar Gallactica characters. This made for a fun example. I really expected a few more fans of the show in the audience. Only a few enthusiastic individuals showed their appreciation.
  • And the story will continue as our Software and DNA Science Teams continue to work together to improve the pipeline. That’s the best part of the story – Ancestry.com is in the middle of the journey.

One of the key slides shows how the project delivered incremental changes, continually improving and adjusting while supporting the business.

PipelineStepsGraph-LRG

You can see the various releases where the team delivered incremental improvements. There are two key releases:

  • H1, moved the ethnicity processing to Hadoop.
  • H2, this was replacing GERMLINE with Jermline – a distributed implementation that uses Hadoop and HBase.

Let’s talk about the event. I live in Draper and work in Provo (near the river bottoms). Every day I pass the Adobe building in Lehi. This was my first time inside that building. I gave my talk in the “Atrium”. A very large room, lots of windows and glass, and with a very high ceiling. There were chairs set up on one end, a projector and screen, and no microphone. A bit intimidating if you are going to give a presentation. I was the second talk of the day in that room and I managed to pull in about 80 people who were interested in the subject.

It was a very diverse and interesting audience. There were people who were new to Hadoop and HBase, others who had significant experience with both, and even people who were working with DNA at their jobs. There were great questions during the talk and at the end. It really is nice to see a “geek subculture” in Utah that is interested in learning about Big Data and sharing their experiences. If Utah is going to create a technology corridor, there needs to be more collaboration and knowledge sharing between Utah developers and their companies. Take a lesson from Silicon Valley and create a culture of meet ups, open source projects, and the willingness to help others.

As for me, I need to share the cool Big Data technology projects Ancestry.com is working on and exactly what we’ve learned. Help others avoid the mistakes and give them the confidence to try new technologies and innovate.

Comments

  1. Brad Stone

    The performance improvements using Hadoop and HBase for your DNA pipeline are very impressive. I notice in some of your later posts that your team is working with the Bay Area HBase community. I live in Provo and am very interested in Big Data – especially how it can be used to solve impossible problems. Do you know if communities have formed in the Utah tech corridor since you posted this message?

    • Bill Yetman

      Brad,

      Thank you for the comment and the interest in what we’re doing. There is a Utah Hadoop User’s Grouo (UHUG – http://www.uhug.org/) and they are running a Big Data competition around Utah’s air quality. The other group to look into is Utah Geek Events (http://www.utahgeekevents.com/). You will see the Big Data competition mentioned there as well. Try both of these groups.

      Thanks,
      -Bill-

Join the Discussion

We really do appreciate your feedback, and ask that you please be respectful to other commenters and authors. Any abusive comments may be moderated. For help with a specific problem, please contact customer service.