I recently had the opportunity to present the story of the Ancestry.com DNA pipeline project at the Utah Big Mountain Conference put on by Utah Geek Events. It really is a great story:
- It has a unique cast of characters – Population Geneticists and Bioinformatics PhDs working alongside Software Engineers.
- A real world AGILE software development project. Start with something from academia that works but needs to scale in a production environment. Get things working and iterate.
- A “measure everything” philosophy that clearly showed us the steps in our process that would go quadratic and predicted with surprising accuracy exactly when they would become a problem.
- A complete rewrite of the GERMLINE DNA matching algorithm using Hadoop and HBase (we call the rewrite “Jermline”) that resulted in a 1700% performance improvement.
- A walk through of how DNA matching works – using Battlestar Gallactica characters. This made for a fun example. I really expected a few more fans of the show in the audience. Only a few enthusiastic individuals showed their appreciation.
- And the story will continue as our Software and DNA Science Teams continue to work together to improve the pipeline. That’s the best part of the story – Ancestry.com is in the middle of the journey.
One of the key slides shows how the project delivered incremental changes, continually improving and adjusting while supporting the business.
You can see the various releases where the team delivered incremental improvements. There are two key releases:
- H1, moved the ethnicity processing to Hadoop.
- H2, this was replacing GERMLINE with Jermline – a distributed implementation that uses Hadoop and HBase.
Let’s talk about the event. I live in Draper and work in Provo (near the river bottoms). Every day I pass the Adobe building in Lehi. This was my first time inside that building. I gave my talk in the “Atrium”. A very large room, lots of windows and glass, and with a very high ceiling. There were chairs set up on one end, a projector and screen, and no microphone. A bit intimidating if you are going to give a presentation. I was the second talk of the day in that room and I managed to pull in about 80 people who were interested in the subject.
It was a very diverse and interesting audience. There were people who were new to Hadoop and HBase, others who had significant experience with both, and even people who were working with DNA at their jobs. There were great questions during the talk and at the end. It really is nice to see a “geek subculture” in Utah that is interested in learning about Big Data and sharing their experiences. If Utah is going to create a technology corridor, there needs to be more collaboration and knowledge sharing between Utah developers and their companies. Take a lesson from Silicon Valley and create a culture of meet ups, open source projects, and the willingness to help others.
As for me, I need to share the cool Big Data technology projects Ancestry.com is working on and exactly what we’ve learned. Help others avoid the mistakes and give them the confidence to try new technologies and innovate.