About Xuyen On

Xuyen On is a Senior Software Engineer at Ancestry.com who works in the Data Services Team where he is building out a new infrastructure to collect Big Data and make it available to company.

Past Articles

Lessons Learned Building a Messaging Framework

Posted on July 1, 2014 in Big Data

We have built out an initial logging framework with Kafka 0.7.2, a messaging system developed at LinkedIn. This blog post will go over some of the lessons we’ve learned by building out the framework here at Ancestry.com. Most of our application servers are Windows-based and we want to capture IIS logs from these servers. However,… Read more

Using Mappers to Read and Partition Large amounts of Data from Kafka into Hadoop

Posted on April 8, 2014 in Big Data

In my previous posts, I outlined how to import data into Hive tables using Hive scripts and dynamic partitioning. However, we’ve found that this only works for small batch sizes and it is not scalable for larger jobs. Instead, we found that it is faster and more efficient to partition the data as they are… Read more

Handling Dynamic JSON Schemas

Posted on February 5, 2014 in Big Data

In my last post, I introduced our first steps in creating a scalable, high volume messaging system and would like to provide an update on our progress. We have built out a 0.7.2 Kafka cluster to start ingesting data from our servers. The cluster consists of the following: 5 x  Kafka nodes •    Dual 6… Read more

First steps to building a scalable high volume messaging system

Posted on November 16, 2013 in Big Data

At Ancestry.com we are becoming more data driven. That means we want to capture more data about our systems, including how our users are interacting with them. Part of that strategy is to capture the log files from our application servers and put them into our Hadoop cluster. We have tried using MSMQ and RabbitMQ… Read more

A Quick and Efficient Way to Update Hive Tables Using Partitions

Posted on August 7, 2013 in Big Data

A Quick and Efficient Way to Update Hive Tables Using Partitions In my previous post, I outlined a strategy to update mutable data in Hadoop by using Hive on top of HBase. In this post, I will outline another strategy to update data in Hive. Instead of using a backend system to update data like… Read more

Using Hive and HBase to Query and Maintain Mutable Data

Posted on May 23, 2013 in Big Data

Hive is good at querying immutable data like log files. These are files that do not change after they are written. But what if you want to query data that can change? For example, users of our site frequently make modifications to their family trees. Some of this data sits in very large and frequently… Read more