It is interesting to reflect on how we thought we would work with Big Data and compare it to our day-to-day processes. We anticipated writing MapReduce jobs in Java that process our data, transform it, and produce aggregate results. Reality is somewhat different. It turns out to be much more efficient to use scripting (PERL, Python, etc.) to get data ingestion going quickly, and then partition the data into Hive tables. The common, repeatable steps are:
You need some kind of workflow infrastructure to control the execution of your scripts. You may be tempted to write your own job scheduler and process control. Take a hard look at Azkaban or Oozie to control the execution steps. Setting up a workflow that runs the ingestion, creates the Hive partitions, and runs analysis jobs can be done using either framework. We are currently investigating both.
By getting the data partitioned into Hive tables, it is available for your analysts to do ad hoc investigations. They can leverage their SQL skills in HiveQL. If an insight is identified and a set of queries need to be run regularly to help run the business, it is easy to create a simple script that can extract the data at regular intervals and add it into your daily processing.
We are finding that scripting and Hive are pretty efficient ways to ingest and process our unstructured data.
This blog is focused on the technology used behind the scenes at Ancestry.com. It’s a place to learn about the experiences we have, the challenges we face and the solutions we use on our engineering and tech teams to create the Ancestry.com experience.Visit Ancestry.com