I decided to write this blog post to help people who are working with Big Data and Hadoop and would benefit from my experience. I always learn more from mistakes. I have lots of scars to prove that point. Even so, this blog is a bit painful to write.
As you start working with Hadoop, it can be overwhelming to see the infrastructure that needs to be written in order to continually ingest data, validate it, partition it for Hive, and handle various failure modes. What is obvious now, given 20-20 hindsight, is focusing on this infrastructure will keep you from delivering what the business needs – value from evaluating data. With that in mind, here are my suggestions to avoid making this mistake:
- Focus on getting the data into your cluster in the simplest way possible. Write simple, manually executed scripts that ingest a representative set of that data. Set up very simple Hive partitions.
- Get this initial “test data” in the hands of an analyst to see if it has value to the business. Can they derive insights from the test data you’ve collected?
- Once an analyst verifies that the data is correct and has value (if the data has no value, delete it, and move to something else), use the scripts created to set up a simple, repeatable process to ingest the data at regular intervals. This can be as simple as running them by hand once a day or once a week – whatever makes senses. Get the data flowing and start adding business value!
- Now it is time to work on the automated ingestion of the data. Even this should be done in steps. Automate the process with a minimum amount of error handling. If errors occur, delete the last data set and rerun the process. Keep it simple.
- The final step is to make the ingestion process fault tolerant and able to handle errors automatically. This “process hardening” step should be the last item you do. Once you have one full ingestion pipeline in place, there will probably be pieces of it you can enhance and reuse.
The focus has to be on providing value to the business, especially for the first few projects you take on. Since the data provides the value, make sure the initial steps of any project you do focus on getting that data into your cluster as quickly and simply as possible. Remember, you and your team will be green and learning. Projects will take longer than you expect. Easy wins on early projects will lead to long term success.
About Bill Yetman
Bill Yetman has served as Senior Director of Engineering at Ancestry.com since January 2011. Bill has held multiple positions with Ancestry.com from August 2002, including Senior Director of Engineering, Director of Sites, Mobile and APIs, Director of Ad Operations and Ad Sales, Senior Software Manager of eCommerce and Senior Software Devloper. Prior to joining Ancestry.com, he held several developer and programmer roles with Coresoft Technologies, Inc., Novell/Word Perfect, Fujitsu Systems of America and NCR. Mr. Yetman holds a B.S. in Computer Science and a B.A. in Psychology from San Diego State University.