A typical web application starts with a blank page. Then in further sprints, you can add features to it. (I sound like one of your Agile coaches, don’t I?) But in reality, the business needs you to deliver more value than a blank page. So, how can you quantify the minimum value you are delivering in a product release?
Here was our approach in the release of AncestryDNA by using Agile processes alongside our DNA Backend Engineering team:
Let me start with some background. In May of 2012, Ancestry.com produced a revolutionary new DNA testing service—AncestryDNA. At a high level, this test gives users a percentage breakdown of their ethnicity, and connects them to distant cousins based on DNA matches.
In preparation for the launch, we kicked off the software development of the DNA backend pipeline late in 2011. We faced two main challenges: first, the pipeline needed to be able to process the DNA raw data to yield ethnicity and matching prediction; second, the performance needed to be acceptable.
The first task was easy; we defined the acceptance criteria of our ethnicity and matching prediction accuracy using Test Drive Development (TDD) to make it reach the done-done stage.
The second challenge of performance proved to be more difficult because the reality says, “it depends” on multiple factors. Our pipeline processes DNA samples in batches. As our business grows and the size of the DNA database increases, we will need to have bigger batches. We calculated, “if we don’t improve this,” the numbers will be “X” in two months. Add to this, that different parts in our DNA pipeline respond differently—some static, some linear and some quadratic.
Our next step included a plan to address the growth: first, upgrade the hardware; second, adopt Apache Hadoop to address ethnicity; third, improve disc management to adopt HBase for the academic algorithm Germline, which finds hidden family relationships within a reservoir of DNA (my colleague’s series of posts address how we scaled this academic algorithm). As you can imagine, this original Agile plan evolved as our “what-if” scenarios changed. We then juggled these scenarios again and planned performance enhancement features to solve the next “what-if” scenarios.
The above chart illustrates a snapshot of the running time by all pipeline parts at the end of 2012 when we resolved our scalability challenges. We made the pipeline scale horizontally in almost every part (we really love the “stable” flat line there). The pipeline turned out to be a constantly modified one. As a result of the frequently done-done and code roll, we increased the batch size several times throughout the period, so the overall performance improvement was more than the scale drawn in the above chart. Our hard work on this project, appropriate planning and performance goals enabled us to deliver value to the business and customers early on. Creating a scalable pipeline also saved us from overinvesting in engineering resources. 2012 had a happy ending for the DNA team – we now have, in-hand, a capable and steady pipeline that allows us to process DNA samples at scale.
Now that you have the background on our DNA pipeline, in future posts, I or my coworkers will blog other development efforts in DNA.
About Aaron Ling
Aaron Ling is director of engineering at Ancestry.com. He oversees the company’s data engineering efforts to implement Big Data strategies, such as creating efficient and scalable data warehouse and Hadoop pipelines, log messaging and DNA data processing. Aaron came to Ancestry.com with over 10 years of experience in various manager and developer roles at such companies as IBM DemandTec and Ariba. He holds an M.S. in Computer Science from Yale University and a B.S. in Computer Science from Nanjing University. When he’s not busy solving complex engineering problems you'll find him spending time with his family or playing video games when alone.