Adventures in Big Data: How do you start?

Posted by Ancestry Team on April 20, 2013 in Big Data

A little over 8 months ago, I was asked to build a data mining cluster at Ancestry using Hadoop. Even though Ancestry has been using Hadoop for nearly 3 years, this was my first exposure to the technology and the company’s initial attempt to collect everything. Honestly, I did not know where or how to start. This blog post represents some of what I have learned. I hope it will guide you if you have been entrusted with a similar assignment. Here are eight key insights that will get you moving in the right direction.

Hadoop is a young technology that is still being developed. Do not underestimate it. The learning curve is steep. Find and hire a great Hadoop administrator.
There is an ecosystem of open source projects under the Hadoop umbrella. Hive, HBase, Mahout, Kafka, Zookeeper, Pig, and more. Understand your needs and pick the projects that will work for you. We started with Hive.
The Enterprise Data Warehouse (EDW) and Big Data teams need to be in the same organization. This will increase the synergy and cooperation that both organizations need to succeed.
Put all your data in one production cluster. This cluster should be the reservoir for everything. Data is heavy, put it in one place and bring the processing to the data.
Use a free distribution of Hadoop. Cloudera, MapR, or HortonWorks can be used. These distributions are well-tested and validated. Specific versions of projects such as Pig or Hive will be certified with a distribution. All three vendors sell licensed versions of their software. Use the free distributions and learn.
If you can’t find them – train them. Hiring Hadoop engineers is a long, hard process. These developers are in high demand and are highly compensated. Try looking for people within your organization. Software developers with data and ETL experience seem to adapt quickly. Training can be as simple as telling them to read and experiment. It will take time, but you need to build Hadoop expertise within your organization.
Hiring data scientists is even more difficult than finding Hadoop engineers. Once again, look for people within the organization with the intelligence, math skills, and the curiosity to be data scientists.
Start small. Identify pain points in your current EDW and attack those first. Ancestry has a well-established EDW that contains some very large data sets used to track customer behavior. This is a bit like pounding a square peg (unstructured data) into a round hole (structured SQL tables). We are moving those data sets into Hadoop, aggregating them, and sending the results to the EDW.

Remember, each company’s data, processes, and problems are different. What works for one company may not work for yours. Have well defined long-term goals but execute incrementally, pivot, and adjust. Be ready for a wild ride. I plan to keep you informed as my journey with Big Data at Ancestry continues.

-Bill-

Comments

Clarence

July 30, 2013 at 7:40 pm

Thanks for the pointers; I am a WebFOCUS Data Architect, Business Analyst, Certified SAP Consultant, and also a certified Systems Engineering Business Analysis professional. I am the kind of person that loves processes and methodologies, thinking out of the box, that has strong problem solving skills – comes naturally. I am thinking about becoming a Data Scientist to further satisfy this uncontrollable thirst for knowing how to do everything. I am looking to acquire some advanced statistics skills and more knowledge of BIG Data principles. Do you have any recommendations of training schools in which I can become a certified Data Scientist?

Reply
Rajendra

August 3, 2013 at 5:53 pm

Very useful thank you

Reply

Join the Discussion

We really do appreciate your feedback, and ask that you please be respectful to other commenters and authors. Any abusive comments may be moderated. For help with a specific problem, please contact customer service.