Tech Roots » Aaron Ling http://blogs.ancestry.com/techroots Ancestry.com Tech Roots Blogs Tue, 21 Apr 2015 00:10:15 +0000 en-US hourly 1 http://wordpress.org/?v=3.5.2 On Track to Data-Drivenhttp://blogs.ancestry.com/techroots/on-track-to-data-driven/ http://blogs.ancestry.com/techroots/on-track-to-data-driven/#comments Wed, 25 Dec 2013 05:25:43 +0000 Aaron Ling http://blogs.ancestry.com/techroots/?p=1697 Ancestry.com becomes more and more aware of the value of the data our website generates every single day. We have a lot of customers coming to the website to discover, preserve and share their family history. They come from different parts of the world and are looking for information that helps them tell the story… Read more

The post On Track to Data-Driven appeared first on Tech Roots.

]]>
Ancestry.com becomes more and more aware of the value of the data our website generates every single day. We have a lot of customers coming to the website to discover, preserve and share their family history. They come from different parts of the world and are looking for information that helps them tell the story of their ancestors’ lives and to learn more about themselves. We certainly acknowledge and to some extent “celebrate the heterogeneity” of our customers, as Professor Fader from Wharton School once stated as a marketing principle in his Coursera course as he mentioned that relying on data is the key. But where do we start?

Most companies begin their big data journey by capturing customer data, and soon run into the problem of how to store and process their humongous set of data. Smart people from Google had great ideas on how to solve this problem via MapReduce and Google File System, and soon people from Yahoo! and other companies caught up with an open source version called Hadoop. We are using Hadoop at Ancestry.com, and I am glad we were not the ones running into a Big Data problem without a solution in the first place, though we have had to develop our own innovative methods to scale to handle a growing business and content.

We are getting more and more familiar with Hadoop from the success we’ve had with the AncestryDNA project, which I covered in a previous blog post.  Now, we have the place to store and process the data, which is great, but we have to ask ourselves, did we capture all of the necessary the data? In our case, the answer is yes and no. We started by looking at the existing data we had, and much of it was webserver logging. It only gives us limited information about page visits. Thanks to the Ancestry Framework team, we are now able to capture logging events through aspects (AOP) and are able to stitch the information from multiple stacks that collectively serve a single end user web request.

How would we get the newly instrumented logging data into Hadoop? Let’s take a look at how Hadoop was adopted at its infancy. Historically, people would drop the data into Hadoop and run long-running MapReduce programs to process the data. Very soon, those people realized they needed something faster than a batch processing job. Many stream processing frameworks were invented as a result.

At Ancestry.com we’re working towards real-time processing in Hadoop as well. After looking at a few frameworks, we chose Apache Kafka to stream the data into Hadoop. (Again, thanks to those who ran into difficult problems and created these elegantly designed systems.) With help from the LinkedIn‘s Kafka and Hadoop folks, along with the open source community, we were able to go live with Kafka on a few major stacks collecting data just before Christmas this year. We then plan to have a full scale rollout to the rest of the stacks. This would achieve our need for handling the massive amounts of data by pumping it into Hadoop, as well as better positioning us for real-time stream processing, which might come next.

We are hoping all this will help our Analytics group and Data Science team to understand user data better in order to improve our customers’ experience. After a quick holiday break, the data engineering team will be busy again to ensure 2014 is on track to become a data-driven year.

Happy Holidays!

The post On Track to Data-Driven appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/on-track-to-data-driven/feed/ 0
How Ancestry.com Practices Agile to Solve Challenges with Consumer DNA Testinghttp://blogs.ancestry.com/techroots/how-ancestry-com-practices-agile-to-solve-challenges-with-consumer-dna-testing/ http://blogs.ancestry.com/techroots/how-ancestry-com-practices-agile-to-solve-challenges-with-consumer-dna-testing/#comments Thu, 29 Aug 2013 18:47:54 +0000 Aaron Ling http://blogs.ancestry.com/techroots/?p=1072 A typical web application starts with a blank page. Then in further sprints, you can add features to it. (I sound like one of your Agile coaches, don’t I?) But in reality, the business needs you to deliver more value than a blank page. So, how can you quantify the minimum value you are delivering… Read more

The post How Ancestry.com Practices Agile to Solve Challenges with Consumer DNA Testing appeared first on Tech Roots.

]]>
A typical web application starts with a blank page. Then in further sprints, you can add features to it. (I sound like one of your Agile coaches, don’t I?) But in reality, the business needs you to deliver more value than a blank page. So, how can you quantify the minimum value you are delivering in a product release?

Here was our approach in the release of AncestryDNA by using Agile processes alongside our DNA Backend Engineering team:

Let me start with some background. In May of 2012, Ancestry.com produced a revolutionary new DNA testing service—AncestryDNA. At a high level, this test gives users a percentage breakdown of their ethnicity, and connects them to distant cousins based on DNA matches.

In preparation for the launch, we kicked off the software development of the DNA backend pipeline late in 2011. We faced two main challenges: first, the pipeline needed to be able to process the DNA raw data to yield ethnicity and matching prediction; second, the performance needed to be acceptable.

The first task was easy; we defined the acceptance criteria of our ethnicity and matching prediction accuracy using Test Drive Development (TDD) to make it reach the done-done stage.

The second challenge of performance proved to be more difficult because the reality says, “it depends” on multiple factors. Our pipeline processes DNA samples in batches. As our business grows and the size of the DNA database increases, we will need to have bigger batches. We calculated, “if we don’t improve this,” the numbers will be “X” in two months. Add to this, that different parts in our DNA pipeline respond differently—some static, some linear and some quadratic.

Our next step included a plan to address the growth: first, upgrade the hardware; second, adopt Apache Hadoop to address ethnicity; third, improve disc management to adopt HBase for the academic algorithm Germline, which finds hidden family relationships within a reservoir of DNA (my colleague’s series of posts address how we scaled this academic algorithm). As you can imagine, this original Agile plan evolved as our “what-if” scenarios changed. We then juggled these scenarios again and planned performance enhancement features to solve the next “what-if” scenarios.

Final_Results

The above chart illustrates a snapshot of the running time by all pipeline parts at the end of 2012 when we resolved our scalability challenges. We made the pipeline scale horizontally in almost every part (we really love the “stable” flat line there). The pipeline turned out to be a constantly modified one.  As a result of the frequently done-done and code roll, we increased the batch size several times throughout the period, so the overall performance improvement was more than the scale drawn in the above chart. Our hard work on this project, appropriate planning and performance goals enabled us to deliver value to the business and customers early on. Creating a scalable pipeline also saved us from overinvesting in engineering resources. 2012 had a happy ending for the DNA team – we now have, in-hand, a capable and steady pipeline that allows us to process DNA samples at scale.

Now that you have the background on our DNA pipeline, in future posts, I or my coworkers will blog other development efforts in DNA.

 

The post How Ancestry.com Practices Agile to Solve Challenges with Consumer DNA Testing appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/how-ancestry-com-practices-agile-to-solve-challenges-with-consumer-dna-testing/feed/ 0