Tech Roots » Big Data http://blogs.ancestry.com/techroots Ancestry.com Tech Roots Blogs Wed, 19 Nov 2014 23:53:37 +0000 en-US hourly 1 http://wordpress.org/?v=3.5.2 Big Data for Developers at Ancestryhttp://blogs.ancestry.com/techroots/big-data-for-developers-at-ancestry/ http://blogs.ancestry.com/techroots/big-data-for-developers-at-ancestry/#comments Thu, 25 Sep 2014 22:59:00 +0000 Seng Lin Shee http://blogs.ancestry.com/techroots/?p=2800 Big Data has been all the craze. Business, marketing and project managers like it because they can plot out trends to make decisions. To us developers, Big Data is just a bunch of logs.  In this blog post, I would like to point out that Big Data (or logs with context) can be leveraged by… Read more

The post Big Data for Developers at Ancestry appeared first on Tech Roots.

]]>
Big Data has been all the craze. Business, marketing and project managers like it because they can plot out trends to make decisions. To us developers, Big Data is just a bunch of logs.  In this blog post, I would like to point out that Big Data (or logs with context) can be leveraged by development teams to understand how our APIs are used.

Developers have implemented logging for a very long time. There are transaction logs, error logs, access logs and more. So, how has logging changed today? Big Data is not all that different from logging. In fact, I would consider Big Data logs as logs with context. Context allows you to do perform interesting things with the data. Now, we can correlate user activity with what’s happening in the system.

A Different Type of Log

So, what are logs? Logs are record of events, and frequently created in the case of applications with very little user interaction. It goes without saying that many logs are transaction logs or error logs.

However, there is a difference between forensics and business logs. Big Data is normally associated with events, actions and behaviors of users when using the system.  Examples include records of purchases, which are linked to a user profile and spanned across time. We call these business logs.  Data and business analysis would love to get a hold on this data; run some machine learning algorithms and finally predict the outcome of a certain decision to improve user experience.

Now back to the developer. How does Big Data help us? On our end, we can utilize forensics logs. Logs get more interesting and helpful if we can combine records from multiple sources. Imagine; hooking in and correlating IIS logs, method logs and performance counters together.

Big Data for Monitoring and Forensics

I would like to advocate that Big Data can and should be leveraged by web service developers to:

  1. Better understanding the system and improve performance of critical paths
  2. Investigate failure trends which might lead to errors or exacerbate current issues.

Logs can include:

  1. Method calls (including context of call – user login, ip address, parameter values, return values etc.)
  2. Execution time of method
  3. Chain of calls (e.g. method names, server names etc.)
    This can be used to trace where method calls originate

With the various data being logged for every single call, it is important that the logging system is able to hold and process huge volume of data. Big Data has to be handled on a whole different scale. The screenshots below are charts from Kibana. Please refer here to find out how to set up data collection and dashboard display using this suite of open source tools.

Example Usage

Based on the decision as to what kind of monitoring is required, the relevant information (e.g. context, method latency, class/method names) should be included in Big Data logs.

Detecting Problematic Dependencies

Plotting time spent in classes of incoming and outgoing components provides us with visibility into the proportion amount of time spent in each layer of the service. The plot below revealed that the service was spending more and more time in a particular component; thus warranting an investigation.

Time in Classes

Discovering Faulty Queries

Logging all exceptions, together with the appropriate error messages and details, allows the developers to determine the circumstances under which a method would fail. The plot below shows that MySql Exceptions started occurring at 17:30. Due to the team including parameters within logs, we were able to determine that invalid queries were used (typos and syntax errors).

Exceptions

Determine Traffic Pattern

Tapping into the IP address of incoming request reveals very interest traffic patterns. In the example below, the graph indicates a spike in traffic. However, upon closer look, this graph shows that spike spanned across ALL countries. This concludes that this spike in traffic is not due to user behavior and this leads to further investigation other possible causes (e.g., DOS attacks, simultaneous updates for mobile apps, error in logs etc.) In this case, we found out it was a false positive; repeated reads in log forwarders through the logging infrastructure.

Country Traffic With Indicator

Determine Faulty Dependents (as opposed to dependencies)

Big Data log generations can be enhanced to include IDs to track the chain of service calls from clients through to the various services in the system. The first column below indicates that traffic from the iOS mobile app passes through the External API gateway before reaching our service. Other columns indicate different flows, thus allowing developers enough information to detect and isolate problems to different systems if needed.

Event Flows

Tracking Progression Through Various Services

Ancestry.com has implemented a Big Data framework across all services to support call tracking across different services. This helps developers (who are knowledgeable on the underlying architecture) to debug whenever a scenario doesn’t work as expected. The graph below depicts different methods being exercised across different services, where each color refers to a single scenario. Such data provides full visibility on the interaction amongst different services across the organization.

Test Tracking

Summary

Forensic logs can be harnessed and used with Big Data tools and framework to greatly improve the effectiveness of development teams. By combining various views (such as the examples above) into a single dashboard, we are able to provide developers with a health snapshot of the system at any time in order to determine failures or to improve architectural designs.

By leveraging Big Data for forensics logging, we, as developers are able to determine faults and reproduce errors messages without the conventional debugging tools. We have full visibility into the various processes in the system (assuming we have sufficient logs). Gone were the days when we need to instrument code on LIVE boxes because the issue only occurs in the LIVE environment.

All of these work are done independently of the Business Analysts and are in fact, very crucial to the agility of the team to quickly react to issues and to continuously improve the system.

Do your developers use Big Data as part of daily development and maintenance of web services? What would you add to increase visibility in the system and to reduce bug-detection time?

The post Big Data for Developers at Ancestry appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/big-data-for-developers-at-ancestry/feed/ 2
The Importance of Context in Resolving Ambiguous Place Datahttp://blogs.ancestry.com/techroots/the-importance-of-context-in-resolving-ambiguous-place-data/ http://blogs.ancestry.com/techroots/the-importance-of-context-in-resolving-ambiguous-place-data/#comments Thu, 10 Jul 2014 20:01:05 +0000 Laryn Brown http://blogs.ancestry.com/techroots/?p=2512 When interpreting historical documents for the intent of researching your ancestors, you are often presented with less than perfect data. Many of the records that are the backbone of family history research are bureaucratic scraps of paper filled out decades ago in some government building. We should hardly be surprised when the data entered is… Read more

The post The Importance of Context in Resolving Ambiguous Place Data appeared first on Tech Roots.

]]>
When interpreting historical documents for the intent of researching your ancestors, you are often presented with less than perfect data. Many of the records that are the backbone of family history research are bureaucratic scraps of paper filled out decades ago in some government building. We should hardly be surprised when the data entered is vague, confusing, or just plain sloppy.

Take for example, a census form from the 1940’s. One of the columns of information is the place of birth of each individual in the household. Given no other context, these entries can be extremely vague and in some cases, completely meaningless to the modern generation.

Here are some examples:

  • Prussia
  • Bohemia
  • Indian Territory

Additionally, there are entries that on the face of them seem clear, but with more context have new complexity:

  • Boston (England)
  • Paris (Idaho)
  • Provo (Bosnia)

And finally, we have entries that are terrifically vague and cannot be resolved without more context:

  • Springfield
  • Washington
  • Lincoln

If we add the complexity of automatic place parsing, where we try to infer meaning from the data and normalize it to a common form that we can search on, the challenges grow.

In the above example, if I feed “Springfield” into our place authority, which is a tool that normalizes different forms of place names to a single ID, I get 63 possible matches in a half dozen countries. This is not that helpful. I can’t put 63 different pins on a map, or try and match 63 different permutations to create a good DNA or record hint.

I need more context to narrow down the field to the one Springfield that represents the intent of that census clerk a hundred years ago.

One rather blunt approach is to sort the list by population. Statistically, more people will be from a larger city of Springfield than from a smaller. But this has all sorts of flaws, such as excluding rural places from ever being legitimate matches. If you happen to be from Paris, Idaho we are never going to find your record.

Another approach would be to implement a bunch of logical rules, where for the case of a name that matches a U.S. state we would say things like “Choose the largest jurisdiction for things that are both states and cities.” So “Tennessee” must mean the state of Tennessee, not the five cities in the U.S. that share the same name. Even if you like those results, there are always going to be exceptions that break the rule and require a second rule – such as the state of Georgia and the country of Georgia. The new rule would have to say “Choose the largest jurisdiction for things that are both states and cities, but don’t choose a Georgia as a country because it is really a state.”

It is clear that a rules-based approach will not work. But since we still need to resolve ambiguity, how is it to be done?

I propose a blended strategy that takes three approaches.

  1. Get context from wherever you can to limit the number of possibilities. If the birth location for Grandpa is Springfield and the record set you are studying is the Record of Births from Illinois, then the additional context may give you enough data to make a conclusion that Springfield=Springfield, Illinois, USA. What seems obvious to a human observer is actually pretty hard with automated systems. These systems need to learn where to find this additional context and Natural Language parsers or other systems need to be fed more context from the source to facilitate a good parse.
  2. Preserve all unresolved ambiguity. If the string I am parsing is “Provo” and my authority has a Provo in Utah, South Dakota, Kentucky, and Bosnia, I should save all of these as potential normalized representations of “Provo.” It is a smaller set to match on when doing comparisons and you may get help later on to pick the correct city.
  3. Get a human to help you. We are all familiar with applications and websites that give us that friendly “Did you mean…” dialogue. This approach lets a user, who may have more context, choose the “Provo” that they believe is right. We can get into a lot of trouble by trying to guess what is best for the customer instead of presenting a choice to them. Maybe Paris, Idaho is the Paris they want, maybe not. But let them choose for you.

In summary, context is the key to resolving ambiguity when parsing data, especially ambiguous place names. Using a blended approach that makes use of all available context, preserves any remaining ambiguity, and presents those ambiguous results to the user for resolution seems like the most successful strategy to solving the problem.

The post The Importance of Context in Resolving Ambiguous Place Data appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/the-importance-of-context-in-resolving-ambiguous-place-data/feed/ 2
Lessons Learned Building a Messaging Frameworkhttp://blogs.ancestry.com/techroots/lessons-learned-building-a-messaging-framework/ http://blogs.ancestry.com/techroots/lessons-learned-building-a-messaging-framework/#comments Tue, 01 Jul 2014 16:18:01 +0000 Xuyen On http://blogs.ancestry.com/techroots/?p=954 We have built out an initial logging framework with Kafka 0.7.2, a messaging system developed at LinkedIn. This blog post will go over some of the lessons we’ve learned by building out the framework here at Ancestry.com. Most of our application servers are Windows-based and we want to capture IIS logs from these servers. However,… Read more

The post Lessons Learned Building a Messaging Framework appeared first on Tech Roots.

]]>
We have built out an initial logging framework with Kafka 0.7.2, a messaging system developed at LinkedIn. This blog post will go over some of the lessons we’ve learned by building out the framework here at Ancestry.com.

Most of our application servers are Windows-based and we want to capture IIS logs from these servers. However, Kafka does not include any producers that run on the Microsoft .Net platform. Thankfully, we were able to find an open source project where someone else wrote libraries that run on .Net that could communicate with Kafka. This allowed us to develop our own custom producers to run on our Windows application servers. You may find that you will also need to develop your own custom producers because every platform is different. You might have applications running on different OS’s, or your applications might be running in different languages. The Kafka apache site lists all the different platforms and programming languages that it supports. We plan on transitioning onto Kafka 0.8 but we could not find any corresponding library packages like there was for 0.7.

Something to keep in mind when you design your producer is that it should be as lean and efficient as possible. The goal is to have as high throughput for sending messages to Kafka as possible while keeping the CPU and memory overhead as low as possible, so as to not overload the application server. One design decision we made early on was to have compression in our producers in order to make communication between the producers and Kafka more efficient and faster. We initially used gzip because it was natively supported within Kafka. We achieved very good compression results (10:1) and also had the added benefit of saving storage space. We have 2 kinds of producers. One ran as a separate service which simply reads log files in a specified directory where all the log files to be sent are stored. This design is well suited for cases when the log data is not time critical because the data is buffered in log files on the application server. This is useful because if a Kafka cluster becomes unavailable, the data is still saved locally. It’s a good safety measure against network failures and outages. The other kind of producer we have is hard coded into our applications. The messages are being sent directly to Kafka from code. This is good for situations where you want to get the data to Kafka as fast as possible and could be interfaced with a component like Samza (another project from LinkedIn) for real-time analysis. However, messages can be lost if the Kafka cluster becomes unavailable so a fail over cluster would be needed to prevent message loss.

To get data out of Kafka and into our Hadoop cluster we wrote a custom Kafka consumer job that is a Hadoop map application. It is a continuous job that runs every 10-15 minutes. We partitioned our Kafka topics to have 10 partitions per broker. We have 5 Kafka brokers in our cluster that are treated equally, which means that a message can be routed to any broker determined by a load balancer. This architecture allows us to scale out horizontally, and if we need to add more capacity to our Kafka cluster, we can just add more broker nodes. Conversely, we can take out nodes as needed for maintenance. Having many partitions allows us to scale out more easily because we can increase the number of mappers in our job to read from Kafka. However, we have found that splitting up the job into too many pieces may result in too many files being generated. In some cases, we were generating a bunch of small files that were less than the Hadoop block size, which was set to 128Mb. This problem was made evident when we had a large ingest of a batch of small files which had over 40 million small files being loaded into our Hadoop cluster. This caused our NameNode to go down because it was not able to handle the sheer number of file handles within the directory. We had to increase the Java heap memory size to 16 GB just to be able to do an ls (listing contents) on the directory. Hadoop likes to work with a small number of very large files (they should be much larger than the block size) so you may find that you will need to tweak the number of partitions used for the Kafka topics, as well as how long you want your mapper job to write to those files. Longer map times with fewer partitions will result in fewer and larger files, but it will also mean that it will take longer for the messages to be queried in Hadoop and it can limit the scalability of your consumer job since you will have less possible mappers to assign the job.

Another design decision we made was to partition the data within our consumer job. Each mapper would create a new file each time a new partition value was detected. The topic and partition values would be recorded in the filename. We created a separate process that would look in a staging directory in HDFS where the files were be generated. This process would look at the file names and determine whether there are existing table and partitions in Hive. If there are, it would simply move those files into the corresponding directory in the Hive External table directory in HDFS. If the partition did not already exist, it would dynamically create new ones. We also compressed the data within the consumer job to save disk space. We initially tried gzip, which gave good compression rates, but it dramatically slowed down our Hive queries due to the processing overhead. We are now trying bzip2 which gives less compression, but our Hive queries are running faster. We choose bzip2 because of its lower processing overhead, but also because it is a splitable format. This means that Hadoop can split a large bz2 file and assign multiple mappers to work on it.

That covers a few of the lessons learned thus far as we build out our messaging framework here at Ancestry. I hope you will be able to use some of the information covered here so that you can avoid the pitfalls we encountered.


The post Lessons Learned Building a Messaging Framework appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/lessons-learned-building-a-messaging-framework/feed/ 0
Adventures in Big Data: Commodity Hardware Blueshttp://blogs.ancestry.com/techroots/adventures-in-big-data-commodity-hardware-blues/ http://blogs.ancestry.com/techroots/adventures-in-big-data-commodity-hardware-blues/#comments Fri, 20 Jun 2014 16:47:11 +0000 Bill Yetman http://blogs.ancestry.com/techroots/?p=2488 One of the real advantages of a system like Hadoop is that it runs on commodity hardware. This will keep your hardware costs low. But when that hardware fails at an unusually high rate it can really throw a wrench into your plans. This was the case recently when we set up a new cluster… Read more

The post Adventures in Big Data: Commodity Hardware Blues appeared first on Tech Roots.

]]>
One of the real advantages of a system like Hadoop is that it runs on commodity hardware. This will keep your hardware costs low. But when that hardware fails at an unusually high rate it can really throw a wrench into your plans. This was the case recently when we set up a new cluster to collect our custom log data and experienced a high rate of hard drive failures. Here is what happened in about one week’s time:

  • We set up a new 27 node cluster, installed Hadoop 2.0, got the system up and running and started loading log files.
  • By Friday, (two days later) the cluster was down to 20 functioning nodes as data nodes began to fall out due to hard drive failures. The primary name node had failed over to the secondary name node.
  • By Monday, the cluster was down to 12 nodes and the name node had failed over.
  • On Wednesday the cluster was at 6 nodes and we had to shut it down.
  • The failures coincided with the increased data load on the system. As soon as we started ingesting our log data, putting pressure on the hard drives, the failures started.

It makes you wonder what happened during the manufacturing process. Did a forklift drop a palette of hard drives and those drives were the ones installed into the machines sent to us? Did the vendor simply skip the quality control steps for this batch of hard drives? Did someone on the assembly line sneeze on the drives? Did sun spots cause this? Over 20% of the hard drives in this cluster had to be replaced in the first three weeks that this system was running.  There were three or more nodes failing daily for a while. We started running scripts that looked at the S.M.A.R.T. monitoring information for the hard drives. Any drives that reported failures or predicted failures were identified and replaced. We had to proactively do this on all nodes in the cluster.

One interesting side note about Hadoop, our system never lost data. The HDFS file system check showed that replication had failed but we had at least one instance of every data block. As we rebuilt the cluster, the data was replicated three times.

What are we doing about this? First, we are having the vendor who is staging our hardware run a set of diagnostics before sending the hardware to us. It is no longer “good enough” to make sure the systems power on. If problems are found, they will swap out the hardware before we receive them. Second, we’ve set minimum failure standards for our hardware and keep track of failures. If we’re seeing too many failures, work proactively with the vendor on replacement hardware.

One of my Hadoop engineers said this, “If you purchase commodity memory, it may be slower but it runs. If you purchase commodity CPUs, they also run. If you purchase the least expensive commodity hard drives, they will fail.” He’s absolutely right.

If all else fails, get different commodity disk drives (enterprise level).

 

The post Adventures in Big Data: Commodity Hardware Blues appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/adventures-in-big-data-commodity-hardware-blues/feed/ 0
Ancestry.com to Present Jermline on DNA Day at the Global Big Data Conferencehttp://blogs.ancestry.com/techroots/jeremy-pollack-to-present-jermline-at-the-big-data-innovation-summit-on-april-10th/ http://blogs.ancestry.com/techroots/jeremy-pollack-to-present-jermline-at-the-big-data-innovation-summit-on-april-10th/#comments Wed, 09 Apr 2014 22:57:40 +0000 Jeremy Pollack http://blogs.ancestry.com/techroots/?p=2292 Interested in genealogy?  Curious about DNA?  Fascinated by the world of big data?  If so, come check out my talk  at the Global Big Data Conference on DNA day this Friday, April 25 at 4pmPT in the Santa Clara Convention Center!  I’ll cover Jermline, our massively-scalable DNA matching application.  I’ll talk about our business, give a run-through… Read more

The post Ancestry.com to Present Jermline on DNA Day at the Global Big Data Conference appeared first on Tech Roots.

]]>
Interested in genealogy?  Curious about DNA?  Fascinated by the world of big data?  If so, come check out my talk  at the Global Big Data Conference on DNA day this Friday, April 25 at 4pmPT in the Santa Clara Convention Center!  I’ll cover Jermline, our massively-scalable DNA matching application.  I’ll talk about our business, give a run-through of the matching algorithm, and even throw in a few Game of Thrones jokes.  It’ll be fun!  Hope to see you there.

 

Update: Thanks to everyone that  attended my presentation! You can find the slides on the Ancestry.com Slideshare account for your reference.

Match list

The post Ancestry.com to Present Jermline on DNA Day at the Global Big Data Conference appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/jeremy-pollack-to-present-jermline-at-the-big-data-innovation-summit-on-april-10th/feed/ 0
Using Mappers to Read and Partition Large amounts of Data from Kafka into Hadoophttp://blogs.ancestry.com/techroots/using-mappers-to-read-and-partition-large-amounts-of-data-from-kafka-into-hadoop/ http://blogs.ancestry.com/techroots/using-mappers-to-read-and-partition-large-amounts-of-data-from-kafka-into-hadoop/#comments Tue, 08 Apr 2014 16:55:36 +0000 Xuyen On http://blogs.ancestry.com/techroots/?p=2285 In my previous posts, I outlined how to import data into Hive tables using Hive scripts and dynamic partitioning. However, we’ve found that this only works for small batch sizes and it is not scalable for larger jobs. Instead, we found that it is faster and more efficient to partition the data as they are… Read more

The post Using Mappers to Read and Partition Large amounts of Data from Kafka into Hadoop appeared first on Tech Roots.

]]>
In my previous posts, I outlined how to import data into Hive tables using Hive scripts and dynamic partitioning. However, we’ve found that this only works for small batch sizes and it is not scalable for larger jobs. Instead, we found that it is faster and more efficient to partition the data as they are being read from Kafka brokers in a MapReduce job. The idea is to write the data into partitioned files and then simply move the files into the directories for the respective Hive partitions.

The first thing we need to do is to look into the data stream and build a json object for each record or line read in. This done by using a fast JSON parser like the one from the http://jackson.codehaus.org project:

private String DEFAULT_DATE = “1970-01-01”;

public String getKey(byte[] payload) {

org.codehaus.jackson.map.ObjectMapper objectMapper =  new ObjectMapper();

String timeStamp = DEFAULT_DATE;

try {

JsonNode jsonNode =

objectMapper.readTree(inString); // A record is read into the ObjectMapper

timeStamp = parseEventDataTimeStamp(jsonNode); // parse the partition date from the record

}

catch (Exception ex) {}

return String.format(“partitiondate=%s”, timeStamp);

}

Once we have the data in a JSON object, we can then parse out the partition data. In this example we want to get the EventDate from the record and extract the date value through the regex expression (“(^\\d{4}-\\d{2}-\\d{2}).*$”). This regex matches dates in the format 2014-01-10 and uses this value as part of the filename it will generate for this data. The method below shows you how to parse the partition date:

Sample Input JSON String:

{“EventType”:”INFO”,”EventDescription”:”Test Message”,”EventDate”:”2014-01-10T23:06:22.8428489Z”, …}

private String parseEventDataTimeStamp(JsonNode jsonNode) {

String timeStamp = jsonNode.path(“EventDate”).getValueAsText();

if (timeStamp == null) {

timeStamp = DEFAULT_DATE;

}

Pattern pattern = Pattern.compile(“(^\\d{4}-\\d{2}-\\d{2}).*$”); // This matches yyyy-mm-dd

Matcher matcher = pattern.matcher(timeStamp);

if (matcher.find()) {

timeStamp = matcher.group(1);

}

else {

timeStamp = DEFAULT_DATE;

}

return timeStamp;

}

The date is returned as a string in the form: “partitiondate=2014-01-10″. We can use this to specify which directory it should be moved to for the Hive table. So in this example, let’s say we have a Hive table called EventData. There would be directory named EventData for the Hive table in HDFS and there would be subdirectories for each Hive partition. We have a separate application that manages these files and directories. It gets all of the necessary information from the filename generated by our process. So we would create a partition/directory named partitiondate=2014-01-10 in the directory for EventData and place the file with all records of 2014-01-10 in there.

The getKey method can be used in a MapReduce java app to read from the Kafka brokers using multiple Hadoop nodes. You can use the method below to generate a data stream in the mapper process:

// This is a hashmap which caches partitiondate keys

private Map<String,Data> cachedMessageStreams = new HashMap<>();

public Data getMessageStream(byte[] payload) throws IOException, CompressorException {

// Generate a key for the partitiondate using information within the input payload message from Kafka.

String key = fileFormat.getKey(payload);

// We store the partitiondate information in a cache so that we only generate new files for records with new

// partitiondates. If the record belongs to an existing file that is already open with the same partitiondate, we

// don’t have to create a new file and just write to the existing one.

// Execute this block if the record contains a new partitiondate

if (!cachedMessageStreams.containsKey(key)) {

MessageStream messageStream = null;

String tempFileName = getFormattedFileName(key); // See the method definition below

// Create a new outputstream to write to a new file. We have custom formats for json and gzip files

OutputStream outputStream =

formatFactory.getCompressedStream(externalFileSystem.createFile(tempFileName));

messageStream = fileFormat.getStream(outputStream);

Data data = new Data(messageStream, tempFileName);

cachedMessageStreams.put(key, data);

return data;

}

else {

// If the record contains a partitiondate key that is already cached

// ie: a file and outputstream is already open for that partitiondate

// reuse the existing key and use it write data to the corresponding

// outputstream and file

return cachedMessageStreams.get(key);

}

}

// This is a generic method were you would write the data using the

// getMessageStream() method.

public void writeMessage(byte[] messageBytes) throws IOException {

try {

Data data = getMessageStream(messageBytes);

data.messageStream.writeMessage(messageBytes);

++data.messageCount;

}

catch (CompressorException ce) {

throw new IOException(ce);

}

}

// This is an example how you could generate a filename with the partitiondate information

// along with other pertinent information

private String getFormattedFileName(String key) {

String fileName = String.format(“%s__%sstart=%s__mapper=%s__id=%s”, topicProperties.topic, key,

getStartTime(), getHostName());

// We have custom methods that gets the Kafka topics from config files

// that we use to write to the filenames of the output files

return Paths.get(topicProperties.tempDir, fileName).toString();

}

I’ve given an overview of a way to read message data from Kafka into partitioned files using MapReduce. The partition information is written to the filenames generated by the methods outlined above. An example of a generated filename is:

eventdata__partitiondate=2014-01-10__start=2014.01.10_14.36.08__mapper=hadoopnode01__count=5273.json.gz

We store the Kafka topic name, partitiondate, creation time of the partition file, the hostname of the Hadoop node the mapper is on and a count of the number of records in the file. The file extension tells you that the file format is json that has been compressed using the gzip algorithm.

We use this information in a separate application to move the file to the respective directory in HDFS to load it into the corresponding Hive table. We found this process to be much faster and efficient than using Hive scripts to dynamically generate the partitions for us while loading Hive data.

What has your experience been in importing large amounts of data into Hadoop and Hive? What has worked, and what hasn’t?

The post Using Mappers to Read and Partition Large amounts of Data from Kafka into Hadoop appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/using-mappers-to-read-and-partition-large-amounts-of-data-from-kafka-into-hadoop/feed/ 0
DNA and the Masses: The Science and Technology Behind Discovering Who You Really Arehttp://blogs.ancestry.com/techroots/dna-and-the-masses-the-science-and-technology-behind-discovering-who-you-really-are/ http://blogs.ancestry.com/techroots/dna-and-the-masses-the-science-and-technology-behind-discovering-who-you-really-are/#comments Wed, 12 Mar 2014 19:02:58 +0000 Melissa Garrett http://blogs.ancestry.com/techroots/?p=2075 Originally published on Wired Innovation Insights, 3-12-14. There is a growing interest among mainstream consumers to learn more about who they are and where they came from. The good news is that DNA tests are no longer reserved for large medical research teams or plot lines in CSI. Now, the popularity of direct-to-consumer (DTC) DNA tests… Read more

The post DNA and the Masses: The Science and Technology Behind Discovering Who You Really Are appeared first on Tech Roots.

]]>
Originally published on Wired Innovation Insights, 3-12-14.

There is a growing interest among mainstream consumers to learn more about who they are and where they came from. The good news is that DNA tests are no longer reserved for large medical research teams or plot lines in CSI. Now, the popularity of direct-to-consumer (DTC) DNA tests is making self-discovery a reality, and is leading individuals to learn more about their genetic ethnicity and family history. My personal journey has led to discoveries about my family history outside of the United States. On a census questionnaire I am White or maybe Hispanic. My genetics, however, show I am Southern European, Middle Eastern, Native American, Northern African, and West African. And who knew that DNA would connect me with several cousins that have family living just 20 miles of where my mom was born in central Cuba?

Major strides have been made in recent years to better understand and more efficiently analyze DNA. Where are we today?

  • Easier: DNA testing required a blood draw. Now, you can spit in a tube in the comfort (and privacy) of your own home.
  • Cheaper: In 2000, it took about 15 years and $3 billion to sequence the genome of one person. Today you could get your genome sequenced for a few thousand dollars. To put that into context, if a tank of gas could get you from New York to Boston in 2000, and fuel efficiency had improved at the same pace as DNA sequencing, today you could travel to Mars (the planet) and back on the same tank of gas.
  • Faster: Companies of all kinds are quickly innovating to keep up with demand and to make DNA testing more readily available and affordable. Illumina recently announced a whole-genome sequencing machine that could sequence 20,000 entire genomes per year.
  • More information: We can now tell you things about your ethnicity, find distant cousins, tell you whether a drug is likely to benefit or harm you, and tell your risk of diseases like breast and colon cancer.

It isn’t all roses. There is a joke among the genetic community that you can get your DNA sequenced for $1,000, but it will cost $1,000,000 to interpret it. DNA is complex. Each of us contains six billion nucleotides that are arranged like letters in a book that tell a unique story. And while scientists have deciphered the alphabet that makes up the billions of letters of our genome, we know woefully little about its vocabulary, grammar and syntax. The problem is that if you want to learn how to read, you need books, lots of them, and up until recently we had very few books to learn from.

To illustrate how complex it can be, let’s look at how to determine a person’s genetic ethnic background. Say you are given three books written in English, Chinese and Arabic. Even if you don’t speak the languages you can use the letters in those books to determine what percent of a fourth book is written in each of the respective languages, since those three languages are so distinct. But that is like determining whether someone is African, White or Asian, which doesn’t require a genetic test. What if the three books were written in English, French and German that use a similar alphabet? That is like telling someone that is White that they are a mix of various ethnic groups. That is a much harder problem and one that usually requires a genetic test.

So how do we distinguish the different ethnicities using DNA? Since we don’t have a genetic dictionary that tells us what we are looking for, scientists use the genetic signatures of people who have a long history in a specific region, religion, language, or otherwise practiced a single culture as a dictionary. Once enough of those genetic sequences are gathered, teams of geneticists and statisticians use the dictionary to define what part of your genome came from similar regions.

How does big data play into all of this science?

DNA has been “big data” before the term became popularized. The real question should not be about how much data you have, but what you do with the data. Big data allows companies like Ancestry.com to compare 700,000 DNA letters for a single individual against the 700,000 DNA letters of several hundred thousand other test takers to find genetic cousins. That’s a lot of computational power, and the problem grows exponentially. To make all of this possible, big data and statistical analytics tools, such as Hadoop and HBase, are used to reduce the time associated with processing DNA results.

Given how far we have come in such a short time, what should we expect for the future of consumer DNA? The technology is moving so fast that it is almost worthless to predict. But what is clear is that we won’t come out of this genetic revolution the same. We are going to live better, healthier lives, and we are going to learn things about our species and ourselves we never dreamed of. And importantly, putting genetic ethnicity and family connection in the hands of individuals is going to tear down our notion of race and show how we are all family – literally. Maybe we’ll even treat each other a little better.

Ken Chahine is Senior Vice President and General Manager for Ancestry.com DNA.

 

The post DNA and the Masses: The Science and Technology Behind Discovering Who You Really Are appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/dna-and-the-masses-the-science-and-technology-behind-discovering-who-you-really-are/feed/ 0
Ancestry.com to Lead Core Conversation at SXSWhttp://blogs.ancestry.com/techroots/ancestry-com-to-lead-core-conversation-at-sxsw/ http://blogs.ancestry.com/techroots/ancestry-com-to-lead-core-conversation-at-sxsw/#comments Thu, 06 Mar 2014 21:11:15 +0000 Melissa Garrett http://blogs.ancestry.com/techroots/?p=2046 Headed to SXSW Interactive? Join EVP of Product, Eric Shoup and Senior Director of Product at Tableau, Francois Ajenstat, for an engaging Core Conversation about how using big data can tell personalized stories. Big Data is a game changer for storytelling. Too often, the data we pull is cold, factual and dehumanized. Technologies can now… Read more

The post Ancestry.com to Lead Core Conversation at SXSW appeared first on Tech Roots.

]]>
Headed to SXSW Interactive? Join EVP of Product, Eric Shoup and Senior Director of Product at Tableau, Francois Ajenstat, for an engaging Core Conversation about how using big data can tell personalized stories.

Big Data is a game changer for storytelling. Too often, the data we pull is cold, factual and dehumanized. Technologies can now analyze and turn individual data points into prose and fascinating personal stories. We can bring the humanity back into the bite-sized stories we tell with data by seeking out, understanding and incorporating the inherent narratives within it. Come join the conversation to discuss how we can bring depth and meaning to massive amounts of data.

Session Details:

Session – How Using Big Data Can Tell Personalized Stories

When – Saturday, March 8 from 12:30pm -1:30pm CT

Location – Sheraton Austin, Capitol View South, 701 E 11th St

Session Hashtag– Join the conversation on Twitter #datastory.

 

Storytelling 2

The post Ancestry.com to Lead Core Conversation at SXSW appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/ancestry-com-to-lead-core-conversation-at-sxsw/feed/ 0
Inferring Familiar Relationships From Historical Data Features (Part 2)http://blogs.ancestry.com/techroots/inferring-familiar-relationships-from-historical-data-features-part-2/ http://blogs.ancestry.com/techroots/inferring-familiar-relationships-from-historical-data-features-part-2/#comments Fri, 28 Feb 2014 17:34:11 +0000 Laryn Brown http://blogs.ancestry.com/techroots/?p=1990 In my previous post, I outlined some of the problems and strategies we use at Ancestry.com to determine if two people who appear in the same household are related. As promised, I want to focus this time on how to resolve ambiguous results. In my early days of doing family history research, I made an… Read more

The post Inferring Familiar Relationships From Historical Data Features (Part 2) appeared first on Tech Roots.

]]>
In my previous post, I outlined some of the problems and strategies we use at Ancestry.com to determine if two people who appear in the same household are related.

As promised, I want to focus this time on how to resolve ambiguous results.

In my early days of doing family history research, I made an assumption that finding a William Anderson married to an Isabella in 1901 in a small town in Scotland was enough evidence to establish a link to my ancestor and become the basis for further research. Unfortunately, what I learned after many months of further research was that “Anderson” is a very common Scottish surname. “William” and “Isabella” are very common given names in that region. A traditional naming pattern ends up creating cousins who are about the same age and whom have the same name.  Even though I was dealing with a small town with a tiny population, it turns out that there were four men named William Anderson who married a woman named Isabella in 1901 in that town.

The question becomes then, how to tell if the features being examined are conclusive or ambiguous. This question is not only for family historians who can use intuition, experience, and convention, but more importantly for natural language processing and automated systems.

I understand that there are statistical models and other tools to narrow down the likelihood that the features being compared are conclusive, but I want to take a detour from that line of thinking and instead model what a genealogist does to determine whether the evidence is strong enough to draw a conclusion or whether the conclusion should remain ambiguous.

Ambiguity comes from insufficient data. Is a birthplace that has been listed as “Springfield” in Illinois or Massachusetts? In her article, “When No Record Proves a Point,” professional genealogist Elizabeth Shown Mills argues that in the absence of conclusive data we need to build our case with reliable alternative data. In the case of “Springfield” our team uses what we call a “reference place,” other place data that exists in the metadata about the record.

For example, if we are parsing an obituary and the text states that the deceased was from Springfield, but gives no other data, pulling all place details from the metadata on the record such as the title of the newspaper, when it was published, and importantly, where it was published can generate a new feature, the reference location, that can assist in disambiguation.

Another effective strategy is to eliminate all other possibilities. Sir Arthur Conan Doyle had Sherlock Holmes said it this way, “… that when you have eliminated the impossible, whatever remains, however improbable, must be the truth.”

When faced with several possible interpretations of a natural language parse of data, business rules can be applied to eliminate impossible conclusions. These rules are often not 100% accurate, but within certain tolerances can function very effectively at reducing false positives. For example, checking a name authority for the occurrences of the term “Restaurant,” finds a very low likelihood that we are looking at a given name. Further queries show that the term occurs frequently in business names. From this we draw the conclusion that we have processed a business listing in a directory and should not include it as a personal name.

Other examples of conclusions might include: children are not born before their parents, spouses are usually married after age 12; and in many cultures, children are given the surname of their parents.

Getting these business rules correct is where bringing in an experienced researcher to consult can really help. You need someone who has proven conclusions before with limited data, and probably has made some mistakes, someone with intuition, experience, and who knows the conventions for establishing conclusions. Very often, your best source of business rules is the person who has been doing this process manually.

In summary, drawing conclusions from ambiguous data can be the greatest challenge of Big Data–the holy grail of Big Data parsing algorithms. At Ancestry.com we have found that starting with a good feature set, teasing out any additional clues from metadata or other related data, then eliminating the impossible using business rules informed by an expert in the area can greatly reduce the occurrences of ambiguity in the data. Where ambiguity can’t be resolved, it should be allowed to persist in the data and perhaps even in the presentation to the customer where additional context may be available at a later time to draw a firm conclusion.

The post Inferring Familiar Relationships From Historical Data Features (Part 2) appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/inferring-familiar-relationships-from-historical-data-features-part-2/feed/ 0
Video Q&A with Lead Engineer at Ancestry.comhttp://blogs.ancestry.com/techroots/video-qa-with-lead-engineer-at-ancestry-com/ http://blogs.ancestry.com/techroots/video-qa-with-lead-engineer-at-ancestry-com/#comments Fri, 21 Feb 2014 01:14:53 +0000 Melissa Garrett http://blogs.ancestry.com/techroots/?p=1974 Jeremy Pollack, a lead engineer at Ancestry.com, answers questions on the technical backend of AncestryDNA in a video interview with InfoQ. The interview took place after his presentation with Bill Yetman on scaling AncestryDNA using Hadoop and HBase at QConSF in 2013. Check it out!

The post Video Q&A with Lead Engineer at Ancestry.com appeared first on Tech Roots.

]]>
Jeremy Pollack, a lead engineer at Ancestry.com, answers questions on the technical backend of AncestryDNA in a video interview with InfoQ. The interview took place after his presentation with Bill Yetman on scaling AncestryDNA using Hadoop and HBase at QConSF in 2013. Check it out!

Jeremy P

The post Video Q&A with Lead Engineer at Ancestry.com appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/video-qa-with-lead-engineer-at-ancestry-com/feed/ 0