Hive is good at querying immutable data like log files. These are files that do not change after they are written. But what if you want to query data that can change? For example, users of our site frequently make modifications to their family trees. Some of this data sits in very large and frequently updated transactional tables in our databases. They have records that can be inserted, updated and possibly deleted. We would like to query this data in order to see how our users interact with the site. So, how do you maintain data that is mutable in Hive? We will outline a strategy to do this.
First we need to get the data from the database into Hive staging tables. We do this by an ETL (Extract, Transform, and Load) process to export the SQL transactional logs for each table we want to export into CSV (Comma Separated Value) files. These CSV files record each insert, update or delete operation for a given table. These files need to have three fields in addition to the fields you want to export into Hive:
1. row_id – This field is a unique identifier for each row. It should be an incrementing integer value that follows the order of the update operation as they should be applied to the table.
2. op – This field identifies the update operation. It can be either an insert, update, or delete. This column can be an enumeration or an integer type (i.e.: insert = 1, update = 2, delete = 3).
3. op_timestamp – This field is a timestamp that records when the update operation was applied to the table.
Let’s take a simple example of two tables to track active users on the site:
|1||Insert||2013-01-01 01:30:00||100||2013-01-01 1:00:00||1000, 1010, 1011|
|2||Update||2013-01-02 13:10:00||100||2013-01-02 13:00:00||1000, 1010, 1011, 1050|
These tables are used to track the activities of active users. Here we have a user named John Smith who created a new account on 2013-01-01. He found and added three ancestors to his family tree (IDs: 1000, 1010 and 1011) so we create a new insert record for this user in UserActivity and ActiveUser. We see that John logged in the following day and found another ancestor with ID 1050, so he added it to his family tree. We add a new row with an update operation with the updated list of ancestor IDs: 1000, 1010, 1011 and 1050. But let’s say a user leaves the site and doesn’t come back for a year. The system looks for these inactive accounts and deletes them. This is recorded by row_id 2 in the ActiveUser table.
So we need to create the Hive staging tables. For example:
CREATE TABLE IF NOT EXISTS UserActivityStg (
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE
LOCATION ‘/hive_stage/ UserActivityStg /’;
CREATE TABLE IF NOT EXISTS UserStg (
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE
LOCATION ‘/hive_stage/ UserStg /’;
The CSV files can simply be copied into the staging directories that we specified in the create statements above. Analysts may want to query the historical usage or update pattern of the tables and this data can get quite large so we should store the historical transaction logs in Hive tables instead of a database. But this doesn’t capture the current state of the tables. In order to do that we have a Java process that reads from the Hive staging tables through a JDBC connector and applies the transaction operation to a Hive table that is backed by HBase. This is an example of how these tables would be created within Hive:
CREATE TABLE UserActivity (
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key, cf1:row_id, cf1:op, cf1:op_timestamp, cf1:user_id, cf1:first_name, cf1:last_name”)
TBLPROPERTIES (“hbase.table.name” = “UserActivity”);
We used the HBase Java API to apply the inserts, updates and deletes directly to the HBase tables. By mapping Hive tables on top of HBase, we get the best of both worlds with the ability to alter the data through HBase and the ability to query data with joins through Hive. So everything should be great and our problems solved right? Not quite. Although we were able to do queries with joins on these tables backed by HBase, they were four to five times slower than querying regular Hive tables that were based on regular files and the query speed decreased as more columns were added.
So we are exploring different options to see how we can get everything working with acceptable speed. One option is to create another Hive table backed by a regular file and to simply schedule a job to copy the data from the HBase table. This seems to work for smaller data sets, but it’s still too slow for the larger data sets so we’re tweaking HBase and Hive to see if we can optimize the process.
If you have any experience querying and maintaining mutable data, we would love to hear from you.
I’ll be the first to say that testing and code metrics can improve software quality and increase productivity, but an overzealous application of either could incur a heavy cost.Tests are code, code is overhead, and while some overhead is necessary and even advisable, overhead is debt and should be minimized whenever possible.There is no perfect… Read more
Much has been written about fragmentation in Android – the fact that there’s a practically limitless variety of screen sizes, shapes and resolutions available on devices running the Android operating system as well as many different versions of the Android OS running on those devices. In addition there are devices running highly-customized versions of Android… Read more
This post is the fourth in a series about the Ancestry.com Image Processing Pipeline (IPP). The IPP is the part of the content pipeline that is responsible for digitizing and processing the millions of images we publish to our site. In this post I will present a bit of information about our microfilm scanning process.… Read more
One of the key initiatives that our front-end development (FED) team has been tasked with at Ancestry.com is to help define, develop, document, implement, and enforce a global set of CSS/HTML standards for the organization. The fact that our team is in the business of creating standards (laws) and working to enforce (govern) them makes… Read more
This post is the third in a series about the Ancestry.com Image Processing Pipeline (IPP). The IPP is the part of the content pipeline that is responsible for digitizing and processing the millions of images we publish to our site. In part 1 of this series, The Good, the Bad, and the Ugly, I gave… Read more
For the last year and a half, we’ve been breaking in a new concept at Ancestry.com called a DevOps engineer. There is a ton of material on the internet about what DevOps means to various groups, and how they’ve implemented it. A lot of it revolves around SCRUM, Agile processes, and other approaches to increase… Read more
About 450 years ago John Heywood wrote, “many hands make light work.” The same can be said of image and data processing. Distributed parallel computing (DPC) makes it possible for us to do the work described by Michael Murdock in his series on the image processing pipeline. If you haven’t already, take a moment to… Read more
As is often the case with technical managers, I started out as a software engineer, and miss the experience of day-to-day coding. I became a team lead and then “officially” moved into the ranks of management as I took responsibility for multiple development teams. As I considered how to be more aware of how these… Read more
A little over 8 months ago, I was asked to build a data mining cluster at Ancestry using Hadoop. Even though Ancestry has been using Hadoop for nearly 3 years, this was my first exposure to the technology and the company’s initial attempt to collect everything. Honestly, I did not know where or how to start.… Read more
This blog is focused on the technology used behind the scenes at Ancestry.com. It’s a place to learn about the experiences we have, the challenges we face and the solutions we use on our engineering and tech teams to create the Ancestry.com experience.Visit Ancestry.com