Ancestry.com becomes more and more aware of the value of the data our website generates every single day. We have a lot of customers coming to the website to discover, preserve and share their family history. They come from different parts of the world and are looking for information that helps them tell the story of their ancestors’ lives and to learn more about themselves. We certainly acknowledge and to some extent “celebrate the heterogeneity” of our customers, as Professor Fader from Wharton School once stated as a marketing principle in his Coursera course as he mentioned that relying on data is the key. But where do we start?
Most companies begin their big data journey by capturing customer data, and soon run into the problem of how to store and process their humongous set of data. Smart people from Google had great ideas on how to solve this problem via MapReduce and Google File System, and soon people from Yahoo! and other companies caught up with an open source version called Hadoop. We are using Hadoop at Ancestry.com, and I am glad we were not the ones running into a Big Data problem without a solution in the first place, though we have had to develop our own innovative methods to scale to handle a growing business and content.
We are getting more and more familiar with Hadoop from the success we’ve had with the AncestryDNA project, which I covered in a previous blog post. Now, we have the place to store and process the data, which is great, but we have to ask ourselves, did we capture all of the necessary the data? In our case, the answer is yes and no. We started by looking at the existing data we had, and much of it was webserver logging. It only gives us limited information about page visits. Thanks to the Ancestry Framework team, we are now able to capture logging events through aspects (AOP) and are able to stitch the information from multiple stacks that collectively serve a single end user web request.
How would we get the newly instrumented logging data into Hadoop? Let’s take a look at how Hadoop was adopted at its infancy. Historically, people would drop the data into Hadoop and run long-running MapReduce programs to process the data. Very soon, those people realized they needed something faster than a batch processing job. Many stream processing frameworks were invented as a result.
At Ancestry.com we’re working towards real-time processing in Hadoop as well. After looking at a few frameworks, we chose Apache Kafka to stream the data into Hadoop. (Again, thanks to those who ran into difficult problems and created these elegantly designed systems.) With help from the LinkedIn‘s Kafka and Hadoop folks, along with the open source community, we were able to go live with Kafka on a few major stacks collecting data just before Christmas this year. We then plan to have a full scale rollout to the rest of the stacks. This would achieve our need for handling the massive amounts of data by pumping it into Hadoop, as well as better positioning us for real-time stream processing, which might come next.
We are hoping all this will help our Analytics group and Data Science team to understand user data better in order to improve our customers’ experience. After a quick holiday break, the data engineering team will be busy again to ensure 2014 is on track to become a data-driven year.
About Aaron Ling
Aaron Ling is director of engineering at Ancestry.com. He oversees the company’s data engineering efforts to implement Big Data strategies, such as creating efficient and scalable data warehouse and Hadoop pipelines, log messaging and DNA data processing. Aaron came to Ancestry.com with over 10 years of experience in various manager and developer roles at such companies as IBM DemandTec and Ariba. He holds an M.S. in Computer Science from Yale University and a B.S. in Computer Science from Nanjing University. When he’s not busy solving complex engineering problems you'll find him spending time with his family or playing video games when alone.