Posted by on August 7, 2013 in Big Data

A Quick and Efficient Way to Update Hive Tables Using Partitions

In my previous post, I outlined a strategy to update mutable data in Hadoop by using Hive on top of HBase. In this post, I will outline another strategy to update data in Hive. Instead of using a backend system to update data like HBase, it may be better to simply overwrite the data with the new values. But updating Big Data can be very expensive if you have to scan all the data. One way to mitigate this is to use partitions in Hive to limit the amount of data that is processed. Hive treats partitions as directories and this allows us to limit the amount of data that we have to scan. For example, let’s say we have a very large table that logs customers’ actions on the site. It is created with the following hive schema:

CREATE TABLE IF NOT EXISTS CustomerLog (
uid string comment ‘User id’,
action_id int comment ‘id to identify some user action’,
action_date timestamp comment ‘Timestamp of action operation’,
description string comment ‘Description of user action’,
)
PARTITIONED BY (dateString string);

The CustomerLog table has these columns:
1. uid – Stores user id.
2. action_id – An integer value to identify some sort of action on the site, for example 1 = sign up, 2 = buy, 3 = cancel.
3. action_date – Timestamp value of when action occured.
4. description – Detailed description of the action, for example “Product was added to cart”.
5. dateString – This is a special column that is used for partitioning. It is a string representation of the action_date that is in the format yyyy-mm-dd.

You might wonder why you would have 2 columns to store the same information. Furthermore, dateString only stores down to the day whereas timestamp can store down to the millisecond. We use dateString because it is a good way to partition our data. I use this example because I find that most data will have a time component, which provides an easy way to filter data. Hive even has a built-in UDF (user defined function) called to_date() to convert timestamps to the date string format I outlined earlier. You can use it in a select statement. Let’s say we have another table called CustomerLogSource, which contains the raw logs. We can then use this query: SELECT uid, action_date, description, to_date(action_date) FROM CustomerLogSource. This will convert a timestamp ex: “2013-01-01 01:00:00″ to “2013-01-01″. This allows you to perform simple filter operations dateString ex: SELECT * from CustomerLog WHERE dateString >= ’2012-01-01′ AND dateString <= ’2012-01-10′. Filter operations on timestamps in Hive is more complicated because you have to use UDF’s rather than the familiar <,>.= operators.

Now that we have an easy way to filter and limit the amount of data that we have to process, we can use this as way to quickly and efficiently update data in large tables within Hive. Let’s say that our CustomerLog table has millions of rows that spans over decades of logs. However, we only want to update the last 10 days worth of logs. Without partitioning, we would have to scan all of our record data. But with partitioning, we can limit the number of records to scan because Hive will only have to scan the data in the directories within the partition range that we specify. In this example, Hive would scan the data within these directories:
dateString=’2013-01-01′,dateString=’2013-01-02′,dateString=’2013-01-03′ … dateString=’2013-01-10′.

Here is a sample Hive query that will update the CustomerLog data, but only in the folders within the partition range of 2013-01-01 to 2013-01-10:

INSERT OVERWRITE TABLE CustomerLog PARTITION (dateString)
SELECT uid, action_id, action_date, description, to_date(action_date) AS dateString
FROM CustomerLogSource
WHERE dateString >= ’2013-01-01′ AND dateString <= ’2013-01-10′

In conclusion, we have outlined a method to quickly and efficiently update data in Hive tables by partitioning the data and filtering on string fields with dates. By partitioning the data into smaller chunks, we limit the amount of data that we have to scan and this makes our queries quicker and more efficient. In my next tech blog, I’ll discuss how we ingest comma separated value (CSV) files into Hive tables.

About Xuyen On

Xuyen On is a Senior Software Engineer at Ancestry.com who works in the Data Services Team where he is building out a new infrastructure to collect Big Data and make it available to company.


We really do appreciate your feedback, and ask that you please be respectful to other commenters and authors. Any abusive comments may be moderated.

Commenting is open until Wednesday, 21 August 2013