Posted by Ancestry Team on August 10, 2013 in Big Data

It is interesting to reflect on how we thought we would work with Big Data and compare it to our day-to-day processes. We anticipated writing MapReduce jobs in Java that process our data, transform it, and produce aggregate results. Reality is somewhat different. It turns out to be much more efficient to use scripting (PERL, Python, etc.) to get data ingestion going quickly, and then partition the data into Hive tables. The common, repeatable steps are:

  • Identify the data you need to ingest and then put together a simple script to copy the raw data into Hadoop. The script can be executed at regular time intervals.
  • Next, create a script that validates the data and creates the correct Hive partitions and tables. For example, for our log data the natural partitions are year, month and day. Our tables break out the User ID, Session ID, etc.
  • Now you can start running analytics jobs against the Hive tables. Once again, a script can be used to run the queries and output the results in a separate table or text file.
  • An optional final step can be to take the output and move it into our data warehouse where it is available for business reports.

You need some kind of workflow infrastructure to control the execution of your scripts. You may be tempted to write your own job scheduler and process control. Take a hard look at Azkaban or Oozie to control the execution steps. Setting up a workflow that runs the ingestion, creates the Hive partitions, and runs analysis jobs can be done using either framework. We are currently investigating both.

By getting the data partitioned into Hive tables, it is available for your analysts to do ad hoc investigations. They can leverage their SQL skills in HiveQL. If an insight is identified and a set of queries need to be run regularly to help run the business, it is easy to create a simple script that can extract the data at regular intervals and add it into your daily processing.

We are finding that scripting and Hive are pretty efficient ways to ingest and process our unstructured data.

Comments

Join the Discussion

We really do appreciate your feedback, and ask that you please be respectful to other commenters and authors. Any abusive comments may be moderated. For help with a specific problem, please contact customer service.