Xuyen On is a Senior Software Engineer at Ancestry.com who works in the Data Services Team where he is building out a new infrastructure to collect Big Data and make it available to company.
In my previous posts, I outlined how to import data into Hive tables using Hive scripts and dynamic partitioning. However, we’ve found that this only works for small batch sizes and it is not scalable for larger jobs. Instead, we found that it is faster and more efficient to partition the data as they are… Read more
In my last post, I introduced our first steps in creating a scalable, high volume messaging system and would like to provide an update on our progress. We have built out a 0.7.2 Kafka cluster to start ingesting data from our servers. The cluster consists of the following: 5 x Kafka nodes • Dual 6… Read more
At Ancestry.com we are becoming more data driven. That means we want to capture more data about our systems, including how our users are interacting with them. Part of that strategy is to capture the log files from our application servers and put them into our Hadoop cluster. We have tried using MSMQ and RabbitMQ… Read more
A Quick and Efficient Way to Update Hive Tables Using Partitions In my previous post, I outlined a strategy to update mutable data in Hadoop by using Hive on top of HBase. In this post, I will outline another strategy to update data in Hive. Instead of using a backend system to update data like… Read more
Hive is good at querying immutable data like log files. These are files that do not change after they are written. But what if you want to query data that can change? For example, users of our site frequently make modifications to their family trees. Some of this data sits in very large and frequently… Read more