A little over 8 months ago, I was asked to build a data mining cluster at Ancestry using Hadoop. Even though Ancestry has been using Hadoop for nearly 3 years, this was my first exposure to the technology and the company’s initial attempt to collect everything. Honestly, I did not know where or how to start. This blog post represents some of what I have learned. I hope it will guide you if you have been entrusted with a similar assignment. Here are eight key insights that will get you moving in the right direction.
- Hadoop is a young technology that is still being developed. Do not underestimate it. The learning curve is steep. Find and hire a great Hadoop administrator.
- There is an ecosystem of open source projects under the Hadoop umbrella. Hive, HBase, Mahout, Kafka, Zookeeper, Pig, and more. Understand your needs and pick the projects that will work for you. We started with Hive.
- The Enterprise Data Warehouse (EDW) and Big Data teams need to be in the same organization. This will increase the synergy and cooperation that both organizations need to succeed.
- Put all your data in one production cluster. This cluster should be the reservoir for everything. Data is heavy, put it in one place and bring the processing to the data.
- Use a free distribution of Hadoop. Cloudera, MapR, or HortonWorks can be used. These distributions are well-tested and validated. Specific versions of projects such as Pig or Hive will be certified with a distribution. All three vendors sell licensed versions of their software. Use the free distributions and learn.
- If you can’t find them – train them. Hiring Hadoop engineers is a long, hard process. These developers are in high demand and are highly compensated. Try looking for people within your organization. Software developers with data and ETL experience seem to adapt quickly. Training can be as simple as telling them to read and experiment. It will take time, but you need to build Hadoop expertise within your organization.
- Hiring data scientists is even more difficult than finding Hadoop engineers. Once again, look for people within the organization with the intelligence, math skills, and the curiosity to be data scientists.
- Start small. Identify pain points in your current EDW and attack those first. Ancestry has a well-established EDW that contains some very large data sets used to track customer behavior. This is a bit like pounding a square peg (unstructured data) into a round hole (structured SQL tables). We are moving those data sets into Hadoop, aggregating them, and sending the results to the EDW.
Remember, each company’s data, processes, and problems are different. What works for one company may not work for yours. Have well defined long-term goals but execute incrementally, pivot, and adjust. Be ready for a wild ride. I plan to keep you informed as my journey with Big Data at Ancestry continues.