It is interesting to reflect on how we thought we would work with Big Data and compare it to our day-to-day processes. We anticipated writing MapReduce jobs in Java that process our data, transform it, and produce aggregate results. Reality is somewhat different. It turns out to be much more efficient to use scripting (PERL, Python, etc.) to get data ingestion going quickly, and then partition the data into Hive tables. The common, repeatable steps are:
- Identify the data you need to ingest and then put together a simple script to copy the raw data into Hadoop. The script can be executed at regular time intervals.
- Next, create a script that validates the data and creates the correct Hive partitions and tables. For example, for our log data the natural partitions are year, month and day. Our tables break out the User ID, Session ID, etc.
- Now you can start running analytics jobs against the Hive tables. Once again, a script can be used to run the queries and output the results in a separate table or text file.
- An optional final step can be to take the output and move it into our data warehouse where it is available for business reports.
You need some kind of workflow infrastructure to control the execution of your scripts. You may be tempted to write your own job scheduler and process control. Take a hard look at Azkaban or Oozie to control the execution steps. Setting up a workflow that runs the ingestion, creates the Hive partitions, and runs analysis jobs can be done using either framework. We are currently investigating both.
By getting the data partitioned into Hive tables, it is available for your analysts to do ad hoc investigations. They can leverage their SQL skills in HiveQL. If an insight is identified and a set of queries need to be run regularly to help run the business, it is easy to create a simple script that can extract the data at regular intervals and add it into your daily processing.
We are finding that scripting and Hive are pretty efficient ways to ingest and process our unstructured data.
About Bill Yetman
Bill Yetman has served as VP of Engineering at Ancestry.com since January 2014. Bill has held multiple positions with Ancestry.com from August 2002, including Senior Director of Engineering, Director of Sites, Mobile and APIs, Director of Ad Operations and Ad Sales, Senior Software Manager of eCommerce and Senior Software Developer. Prior to joining Ancestry.com, he held several developer and programmer roles with Coresoft Technologies, Inc., Novell/Word Perfect, Fujitsu Systems of America and NCR. Mr. Yetman holds a B.S. in Computer Science and a B.A. in Psychology from San Diego State University.