Tech Roots http://blogs.ancestry.com/techroots Ancestry.com Tech Roots Blogs Mon, 14 Apr 2014 16:49:18 +0000 en-US hourly 1 http://wordpress.org/?v=3.5.2 Building an Operationally Successful Component – Part 1: Transparencyhttp://blogs.ancestry.com/techroots/building-an-operationally-successful-component-part-1-transparency-2/ http://blogs.ancestry.com/techroots/building-an-operationally-successful-component-part-1-transparency-2/#comments Mon, 14 Apr 2014 16:49:18 +0000 Geoff Rayback http://blogs.ancestry.com/techroots/?p=2300 On our team at Ancestry.com, we spend much of our time focusing on the operational success of the code that we write.  An amazing feature that no one can use because the servers are constantly down is of little use to anyone and can be deemed a failure, even if the code worked from a… Read more

The post Building an Operationally Successful Component – Part 1: Transparency appeared first on Tech Roots.

]]>
On our team at Ancestry.com, we spend much of our time focusing on the operational success of the code that we write.  An amazing feature that no one can use because the servers are constantly down is of little use to anyone and can be deemed a failure, even if the code worked from a functional perspective.  This is an attitude that many developers resist because in many organizations, the developers do not have enough control over the operational aspects of their systems to enable them to take this kind of ownership.  As more and more organizations adopt DevOps principles, this will have to change.  As I see it, the mantra of a DevOps-minded software engineer is:  ”I will not abdicate responsibility for the operational success of my component.  It is my software, so it is my job to make sure it is succeeding operationally.”   Operational success is a feature that you build into your software, just like any other feature.  In my experience there are three attributes that a software component must have, to some degree, in order to be operationally successful:

 

  1. The component can report on its health.
  2. The component can overcome or correct problems itself.
  3. The component can fail quickly and gracefully when it encounters a problem it could not overcome or correct.

On our team, we are constantly striving to improve the degree to which our software has these three attributes.  This post will cover some of the things that we are doing on our team, and at Ancestry.com in general, to improve the first attribute, which has to do with transparency.

Is your component running?  How well is it running?  Are its dependencies reachable?  Do you even know what its dependencies are?  Can someone who doesn’t know anything about the component quickly get usable information about its state?  There is a very large amount of information we can gather and expose quite easily about our software.  On our team, we have built up a framework for dealing with what we think of as diagnostic data.  Our components are typically services reachable via HTTP, so we expose a number of diagnostic endpoints that we or other teams can use to get a peek into the health of the component.  Conceptually, these endpoints are:

/ping – This endpoint returns a simple heartbeat from the system.  This endpoint quickly demonstrates that the server is set up to handle HTTP traffic.  This tells us that the web server is installed and running on the server, that our code is deployed and configured reasonably correctly, and that the server is finding it and is configured correctly for our code.  Obviously that isn’t everything we need to know to be sure our system is working, but it is a great start, and the call will return quickly enough that we can make a large number of requests to it without impacting the performance of the system.  We use this type of endpoint as a sort of heartbeat to ensure that broken servers don’t take traffic.

/health – This endpoint runs a suite of health tests that we have built into our components.  The purpose of these tests is to assert on the health of various aspects of the system.  We have broken them into three categories: General health tests, dependency tests, and required dependency tests.  The general tests check things like the version of code deployed, configuration settings, and other aspects of the system that that need to be correct for it to function.  Dependency tests do things like ensuring that our IOC system correctly injected the right types for the various dependencies, and ensuring that each system we depend on is reachable and responding.  We make a distinction between dependencies that are required or not-required.  If a required dependency is down, the system will not be able to correctly handle traffic (something like a database that doesn’t have a viable fallback).  If a non-required dependency is down, the system will continue to handle user requests, but may not be able to log errors, or report its statistics.  Any component we build that depends on any other system is required to have a health test suite built into it.  These suites are discoverable using reflection, so as we add or remove components to the system, the health test engine automatically finds all the tests and runs them.

/statistics – We have instrumented our code so that whenever something that we are interested in happens, we record some data about it.  We track all sorts of stuff, from the number of individual requests a machine is taking, to the network bandwidth it is using, or the rate of exceptions encountered by the server.  Each component gathers this data up and exposes it to systems who ask for it.  We can then pull the data from all our servers periodically and dump it into a central reporting system to generate graphs and other visualizations of what is happening on our machines.  This puts us in a place where whenever we have a new question about the system that we aren’t getting the answer to, all we have to do is add instrumentation for it and we can see the data we need to make an informed decision.  We frequently add temporary counters to help us debug specific problems or to gather metrics needed by the business.

These three fairly rudimentary endpoints give us a tremendous amount of visibility into the system.  They are all automatable (“an API for everything” as Amazon would say), and they are easily extendable.  We have found that with these three endpoints, there is virtually no problem that we cannot quickly diagnose.  Here are some examples of how we use these tools:

  • The load balancer monitors the /ping endpoint to see if it needs to add or remove servers from the pool.
  • The company-wide statistical gathering system pulls from our /statistics endpoint.
  • Our team runs through the /health test suite using an automated tool whenever we think there is risk, like right after we roll out new code or when someone reports site problems.
  • There is a company-wide monitor that tries to walk the dependency tree when there is a site issue and determine exactly how deep the problem lies.  We have mapped our required dependency tests at the /health endpoint into this system (two for the price of one!).
  • There is a system in place to watch the data collected from the /statistics endpoint and send out notifications if the values rise above or drop below a threshold.

This framework is even good for situations where we hadn’t anticipated the problem beforehand:

  • We discovered that we didn’t have a good way to know if an individual server was actually in the load balancer pool or not (i.e. is it taking traffic?).  Well guess what, one new health test at /health and some better monitoring of existing counters at /statistics and now we know if our servers are dropping out or being removed!
  •  We found that our deployment system was occasionally failing to deploy the correct version of code, choosing instead to redeploy the existing version (out of spite I guess?).  First we added a test at /health that simply reports the code version (helpful for a human who might read the test result but not automatable).  We had the test deployed to our production environment within 15 minutes of having the idea.  Next we added a counter at /statistics so that we could graph the code version on each machine in the pool.  If we ever see two lines, then we know something went wrong.  A single line means all servers are on the same version.  Again, 15 minutes after the idea, we were live with the statistics (single line, phew!).  Later, when we had more time, we came back and added a health test that looks at the code version and compares it to the deployment system’s records (change management anyone?) and we can actually assert that the code version is correct or incorrect.  This took a day or so to write and then we rolled it out right away.

My point here is that having some kind of framework that allows us to quickly gain visibility into specific operational problems has proven to be invaluable.  Bad load balancing?  Now we can see it.  Bad deployment?  Now we can see that too.  In fact, I have yet to find an operational issue that our tools cannot begin identifying within a day or so of us realizing it is a problem, even if it is something brand new that we would never have dreamed up.  This gives us a tremendous amount of confidence that our system is running the way we painstakingly designed it to run, which means that the cool new feature we slaved over is actually going to provide value instead of failing because it won’t run correctly.

In the next two posts, I’ll discuss some of the ways we have made our components be error-resistant and self-correcting, and some of the ways we help prevent site-wide catastrophes by not allowing problems to cascade out of control.

 

The post Building an Operationally Successful Component – Part 1: Transparency appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/building-an-operationally-successful-component-part-1-transparency-2/feed/ 0
Using Mappers to Read and Partition Large amounts of Data from Kafka into Hadoophttp://blogs.ancestry.com/techroots/using-mappers-to-read-and-partition-large-amounts-of-data-from-kafka-into-hadoop/ http://blogs.ancestry.com/techroots/using-mappers-to-read-and-partition-large-amounts-of-data-from-kafka-into-hadoop/#comments Tue, 08 Apr 2014 16:55:36 +0000 Xuyen On http://blogs.ancestry.com/techroots/?p=2285 In my previous posts, I outlined how to import data into Hive tables using Hive scripts and dynamic partitioning. However, we’ve found that this only works for small batch sizes and it is not scalable for larger jobs. Instead, we found that it is faster and more efficient to partition the data as they are… Read more

The post Using Mappers to Read and Partition Large amounts of Data from Kafka into Hadoop appeared first on Tech Roots.

]]>
In my previous posts, I outlined how to import data into Hive tables using Hive scripts and dynamic partitioning. However, we’ve found that this only works for small batch sizes and it is not scalable for larger jobs. Instead, we found that it is faster and more efficient to partition the data as they are being read from Kafka brokers in a MapReduce job. The idea is to write the data into partitioned files and then simply move the files into the directories for the respective Hive partitions.

The first thing we need to do is to look into the data stream and build a json object for each record or line read in. This done by using a fast JSON parser like the one from the http://jackson.codehaus.org project:

private String DEFAULT_DATE = “1970-01-01”;

public String getKey(byte[] payload) {

org.codehaus.jackson.map.ObjectMapper objectMapper =  new ObjectMapper();

String timeStamp = DEFAULT_DATE;

try {

JsonNode jsonNode =

objectMapper.readTree(inString); // A record is read into the ObjectMapper

timeStamp = parseEventDataTimeStamp(jsonNode); // parse the partition date from the record

}

catch (Exception ex) {}

return String.format(“partitiondate=%s”, timeStamp);

}

Once we have the data in a JSON object, we can then parse out the partition data. In this example we want to get the EventDate from the record and extract the date value through the regex expression (“(^\\d{4}-\\d{2}-\\d{2}).*$”). This regex matches dates in the format 2014-01-10 and uses this value as part of the filename it will generate for this data. The method below shows you how to parse the partition date:

Sample Input JSON String:

{“EventType”:”INFO”,”EventDescription”:”Test Message”,”EventDate”:”2014-01-10T23:06:22.8428489Z”, …}

private String parseEventDataTimeStamp(JsonNode jsonNode) {

String timeStamp = jsonNode.path(“EventDate”).getValueAsText();

if (timeStamp == null) {

timeStamp = DEFAULT_DATE;

}

Pattern pattern = Pattern.compile(“(^\\d{4}-\\d{2}-\\d{2}).*$”); // This matches yyyy-mm-dd

Matcher matcher = pattern.matcher(timeStamp);

if (matcher.find()) {

timeStamp = matcher.group(1);

}

else {

timeStamp = DEFAULT_DATE;

}

return timeStamp;

}

The date is returned as a string in the form: “partitiondate=2014-01-10″. We can use this to specify which directory it should be moved to for the Hive table. So in this example, let’s say we have a Hive table called EventData. There would be directory named EventData for the Hive table in HDFS and there would be subdirectories for each Hive partition. We have a separate application that manages these files and directories. It gets all of the necessary information from the filename generated by our process. So we would create a partition/directory named partitiondate=2014-01-10 in the directory for EventData and place the file with all records of 2014-01-10 in there.

The getKey method can be used in a MapReduce java app to read from the Kafka brokers using multiple Hadoop nodes. You can use the method below to generate a data stream in the mapper process:

// This is a hashmap which caches partitiondate keys

private Map<String,Data> cachedMessageStreams = new HashMap<>();

public Data getMessageStream(byte[] payload) throws IOException, CompressorException {

// Generate a key for the partitiondate using information within the input payload message from Kafka.

String key = fileFormat.getKey(payload);

// We store the partitiondate information in a cache so that we only generate new files for records with new

// partitiondates. If the record belongs to an existing file that is already open with the same partitiondate, we

// don’t have to create a new file and just write to the existing one.

// Execute this block if the record contains a new partitiondate

if (!cachedMessageStreams.containsKey(key)) {

MessageStream messageStream = null;

String tempFileName = getFormattedFileName(key); // See the method definition below

// Create a new outputstream to write to a new file. We have custom formats for json and gzip files

OutputStream outputStream =

formatFactory.getCompressedStream(externalFileSystem.createFile(tempFileName));

messageStream = fileFormat.getStream(outputStream);

Data data = new Data(messageStream, tempFileName);

cachedMessageStreams.put(key, data);

return data;

}

else {

// If the record contains a partitiondate key that is already cached

// ie: a file and outputstream is already open for that partitiondate

// reuse the existing key and use it write data to the corresponding

// outputstream and file

return cachedMessageStreams.get(key);

}

}

// This is a generic method were you would write the data using the

// getMessageStream() method.

public void writeMessage(byte[] messageBytes) throws IOException {

try {

Data data = getMessageStream(messageBytes);

data.messageStream.writeMessage(messageBytes);

++data.messageCount;

}

catch (CompressorException ce) {

throw new IOException(ce);

}

}

// This is an example how you could generate a filename with the partitiondate information

// along with other pertinent information

private String getFormattedFileName(String key) {

String fileName = String.format(“%s__%sstart=%s__mapper=%s__id=%s”, topicProperties.topic, key,

getStartTime(), getHostName());

// We have custom methods that gets the Kafka topics from config files

// that we use to write to the filenames of the output files

return Paths.get(topicProperties.tempDir, fileName).toString();

}

I’ve given an overview of a way to read message data from Kafka into partitioned files using MapReduce. The partition information is written to the filenames generated by the methods outlined above. An example of a generated filename is:

eventdata__partitiondate=2014-01-10__start=2014.01.10_14.36.08__mapper=hadoopnode01__count=5273.json.gz

We store the Kafka topic name, partitiondate, creation time of the partition file, the hostname of the Hadoop node the mapper is on and a count of the number of records in the file. The file extension tells you that the file format is json that has been compressed using the gzip algorithm.

We use this information in a separate application to move the file to the respective directory in HDFS to load it into the corresponding Hive table. We found this process to be much faster and efficient than using Hive scripts to dynamically generate the partitions for us while loading Hive data.

What has your experience been in importing large amounts of data into Hadoop and Hive? What has worked, and what hasn’t?

The post Using Mappers to Read and Partition Large amounts of Data from Kafka into Hadoop appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/using-mappers-to-read-and-partition-large-amounts-of-data-from-kafka-into-hadoop/feed/ 0
Ignite at Ancestry.comhttp://blogs.ancestry.com/techroots/ignite-at-ancestry-com/ http://blogs.ancestry.com/techroots/ignite-at-ancestry-com/#comments Tue, 01 Apr 2014 18:40:32 +0000 Chris http://blogs.ancestry.com/techroots/?p=2266 What is Ignite? Ignite is a format for giving a talk on any subject. A speaker uses twenty slides, which auto-advance every 15 seconds, to provide a five minute talk. The purpose of this article is to elaborate on these talks and explain why we’re doing them and why you should try it at your… Read more

The post Ignite at Ancestry.com appeared first on Tech Roots.

]]>
What is Ignite?

Ignite is a format for giving a talk on any subject. A speaker uses twenty slides, which auto-advance every 15 seconds, to provide a five minute talk. The purpose of this article is to elaborate on these talks and explain why we’re doing them and why you should try it at your organization as well! Feel free to browse the official documentation with videos and explanations here:

What makes Ignite special?

At its core the short format and forced slide-changes force a speaker to be brief and concise. It is also encouraged that people speak about the things they are most passionate about, all of this in combination provides something quite beautiful.

Why is it so cool?

Stories are the oldest kind of magic. They allow you to experience the life and views of another person with the added bonus of them being there to go off their cues such as tone and body language. Ignite tries to boil this down with a quick format and forced slide changes to enable a person to share their experiences in the most wonderfully brief way they can. The first Ignite style talk I saw was from Jin, a software development manager at Ancestry.com, who also started the group, and he spoke about South Korean airlines. I have never been to Korea, but the history of its airline service was not something I thought would interest me, and I was wrong. This is the crux of Ignite, to see and experience something you perhaps would’ve never even considered.

During my first Ignite talk here at Ancestry.com, I messed up on a lot of slides and had too much text written. I stuttered and panicked because I had too much to say, but in reality I simply hadn’t boiled it down enough. For my second talk, I chose a topic I knew a great deal about, something I had done for more than a decade, Lion Dancing. Something that I loved and remember fondly, even if Wikipedia disproved many of the things I learned correctly or not from my teachers. I learned a great deal that day, because my second talk went a lot better. I made it in the time-frame of (almost) all my slides and something interesting happened. People could see my story; I could see it in their eyes, as something truly surreal takes over a group of people sharing a vivid story with another person. I learned a lot that day, and maybe a few other people did too from the lens of my life.

What does Ignite do for you, the speaker?

You’re always going to need to communicate, over just about everything. There is no facet of your life that being a better communicator isn’t going to help you with. We human beings are social animals and interacting with others is something of a requirement. The unique and fast paced format forces you, as the speaker, to cut right to the point, to be quick on your feet as well as creative. Even without saying one word the act of preparing for such a performance is an interesting experiment unto itself, which as a person who’s given many speeches left me pleasantly surprised. This was also one of the primary reasons the founder wanted to give Ignite talks, as Jin wanted to practice and hone his communicating skills, for what I only assume is some kind of presidential candidacy (Jin Lee 2024!).

What does Ignite do for you, the listener?

People live, and their lives are intricately tied to the people they interact with and listen to. My very favorite talk was given by Charlene Chen, Sr. Product Manager at Ancestry.com, who gave a great Ignite talk on how to take a good picture. I had taken a whole year of photography in school, and the simple tricks and story of Charlene’s photography taught me more than that entire year. Since then I have, in fact, been taking better pictures. This honestly took me by surprise, but this is the secret sauce of Ignite talks. Ignite forces people to be brief, and by encouraging people to talk about what they are most passionate the entire format is a blueprint for engaging and memorable stories. Stories that stick with you, and hopefully leave all involved a little more enriches. And after you take all this great stuff from your peers, perhaps you’ll be encouraged to tell your story. Everyone has one, and there’s nothing that can’t be turned into a good talk.

Relationships, how do those work?

It’s easy to miss all the great people around when you work for a big company. It’s not too likely many people can connect with the hundreds or thousands of other people they work with, but things like Ignite can help. I had seen Peter Graham, a fellow software engineer, around the office before but after he gave his talk about how Japanese Animation is produced and made we had something very large in common (I’m huge Anime Fanboy). Even if people are speaking about something you know nothing about, the format of Ignite will allow them to show you a very important piece of themselves. It’s a great feeling, and in the short time since we’ve started Ignite I’ve really gotten to see and know my colleagues in a new light, which is something any person at any organization can benefit from.

Ignite Presentation Smaller

The post Ignite at Ancestry.com appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/ignite-at-ancestry-com/feed/ 0
AncestryDNA Regions by the Numbershttp://blogs.ancestry.com/techroots/ancestrydna-did-you-know/ http://blogs.ancestry.com/techroots/ancestrydna-did-you-know/#comments Tue, 25 Mar 2014 22:37:20 +0000 Julie Granka http://blogs.ancestry.com/techroots/?p=2134 Since May of 2012, when we first released AncestryDNA, we’ve returned results to over a quarter of a million customers. Based on feedback that we have received, those 300,000 customers have learned a great deal about their family history – their deep ancestral origins and their genetic relatives. As it turns out, AncestryDNA has also… Read more

The post AncestryDNA Regions by the Numbers appeared first on Tech Roots.

]]>
Since May of 2012, when we first released AncestryDNA, we’ve returned results to over a quarter of a million customers.

Based on feedback that we have received, those 300,000 customers have learned a great deal about their family history – their deep ancestral origins and their genetic relatives.

As it turns out, AncestryDNA has also learned a great deal from our customers.  We’ve uncovered some interesting statistics about ethnicity estimates that may help you to learn a bit more about your own family history – and we’ll share them with you in this blog post.

At AncestryDNA, we estimate a customer’s genetic ethnicity as a set of percentages in 26 regions around the world. See map a map of these regions below.

Ethnicity-all-regions-map

We estimate the amount of DNA that a customer likely inherited from each of these regions by comparing a customer’s DNA with a reference set of DNA samples – with corresponding documented family trees – from each of these regions. For a deeper dive into the science of ethnicity estimation, take a look at my previous blog post on the subject.

Below is an example of an AncestryDNA ethnicity estimate.  In this post, we’ll explore what AncestryDNA ethnicity estimates look like across all of our customers – specifically, how many of these 26 regions show up in someone’s estimate?

Ethnicity example

Based on the percentages estimated for a customer, we place each region into one of three categories.  Main Regions are the primary regions from which you likely inherited DNA (the regions, pictured above, that you see when you first view your ethnicity estimate); Trace Regions have less evidence of being part of your genetic ethnicity (and are viewed by clicking on the “+” button); Other Regions Tested have even less or no evidence, and do not show up as part of your ethnicity estimate.

In exploring the aggregated genetic ethnicity results of customers who opted in to scientific research, here are a few fun facts we’ve found about the diversity of regions found in customers’ estimates:

  • Ethnicity at a continental level – First, it’s interesting to view a person’s ethnicity estimate by continent. Our 26 regions can be broken into six different continental regions – such as Africa, Europe, and West Asia (see the estimate above). On average, we see that customers can trace their DNA back to 2.3 different continents.  While half of our customers have 2 continents or more as part of their ethnicity estimate, some have only one continent — and others have all six!
  • Main Regions in an ethnicity estimate – According to U.S. Census data on census.gov, “the overwhelming majority (97 percent) of the total U.S. population reported only one race in 2010. This group totaled 299.7 million. Of these, the largest group reported white alone (223.6 million), accounting for 72 percent of all people living in the United States.”  This is thought-provoking because while most Americans self-identify with only one ethnicity, our database shows that some customers can be linked to as many as 11 main regions (or ethnicities), and the average is nearly four regions!  See a graph representing the number of main ethnicity regions per customer, here.  A person’s ethnicity is likely far more nuanced than they may report on a census.
  • Expanding to include Trace Regions – While main regions are those with strong evidence that they are part of someone’s genetic ethnicity, trace regions are those that have a smaller amount of evidence (and that you must click on the “+” sign to view). When we count up regions in both of these categories, customers can be traced back, on average, to 8.5 different regional ethnicities.  This really affirms that our customers hail from a variety of cultures and regions across the world.  Some customers even have 24 out of the possible 26 regions as part of their estimate!
  • African regions – We made an exciting new finding recently that African Americans have on average more than three African regions in their estimates on average. This shows that African Americans too are a melting pot of many unique African ethnicities. 

These statistics and averages demonstrate the diversity of regions often found in an AncestryDNA customer’s ethnicity estimate — and prove that Americans are truly a mix cultures and influences from across the globe.

Advances in science and DNA research are just now beginning to make a significant impact on how we understand ourselves and society at large. While DNA testing often confirms the expected, it can also reveal the completely unexpected. How do your AncestryDNA results compare to our findings?

The post AncestryDNA Regions by the Numbers appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/ancestrydna-did-you-know/feed/ 6
Utah Code Camp 2014 – A Success for Ancestry.com Tech Team and Whole Communityhttp://blogs.ancestry.com/techroots/utah-code-camp-2014-a-success-for-ancestry-and-the-whole-community/ http://blogs.ancestry.com/techroots/utah-code-camp-2014-a-success-for-ancestry-and-the-whole-community/#comments Thu, 20 Mar 2014 18:27:12 +0000 Mitchell Harris http://blogs.ancestry.com/techroots/?p=2099 Utah Code Camp 2014 came and went this weekend. More than 850 people attended and with more than 70 sessions, it was the largest code camp in Utah history. Thanks to Pat, Craig, Nate, and Kerry  of Utah Geek Events for putting it all on. Ancestry.com participated in a pretty big way. In addition to the… Read more

The post Utah Code Camp 2014 – A Success for Ancestry.com Tech Team and Whole Community appeared first on Tech Roots.

]]>
Utah Code Camp 2014 came and went this weekend. More than 850 people attended and with more than 70 sessions, it was the largest code camp in Utah history. Thanks to Pat, Craig, Nate, and Kerry  of Utah Geek Events for putting it all on.

Ancestry.com participated in a pretty big way. In addition to the many Ancestry.com employees in attendance, we had three speak.  Bressain Dinkelman presented a session titled, “Yes, You Belong Here,” about the Imposter Syndrome in the tech world, and how to overcome it. Craig Peterson presented a session titled, “High Performance Web Services with Apache Thrift.” I presented a session called, “Making Your Own Domain Specific Language,” and also was called in last minute pinch to  hit speak on RavenDB for a presenter that got pneumonia.

WP_20140320_001

 

In addition to employees attending and speaking at the event, Ancestry.com also sponsored the event. Did you find our spot next to the elevators on the first floor? We were handing out T-Shirts, hand sanitizer, pens, candy, and evangelizing how great it is to work here.  We’re proud to be a sponsors to of Utah Code Camp, and hope for another great camp next year.

The post Utah Code Camp 2014 – A Success for Ancestry.com Tech Team and Whole Community appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/utah-code-camp-2014-a-success-for-ancestry-and-the-whole-community/feed/ 0
All Work and Some Playhttp://blogs.ancestry.com/techroots/all-workd-and-some-play/ http://blogs.ancestry.com/techroots/all-workd-and-some-play/#comments Wed, 19 Mar 2014 17:38:15 +0000 Jeff Lord http://blogs.ancestry.com/techroots/?p=1686 When I joined Ancestry.com, we were a small start-up of a few hundred employees in some cramped offices behind the post office in Orem, Utah. Now almost 15 years later, we’re an international organization of more than 1,400 employees with offices around the world. Yet despite our growth, Ancestry.com has continued to provide employees with… Read more

The post All Work and Some Play appeared first on Tech Roots.

]]>
When I joined Ancestry.com, we were a small start-up of a few hundred employees in some cramped offices behind the post office in Orem, Utah. Now almost 15 years later, we’re an international organization of more than 1,400 employees with offices around the world. Yet despite our growth, Ancestry.com has continued to provide employees with a close knit work environment that makes you feel like you’re part of a family and not just another employee.

One way Ancestry.com shows it’s appreciation for all of our hard work is with a monthly morale budget, which our front-end development (FED) team has used to do some pretty interesting team building activities over the years. We started off just going to lunch together as a team, then slowly branched out to the occasional movie. Over time, we’ve started pushing the envelope and found some pretty interesting ways to spend our monthly morale budget.
FEDbowl
Some of our more memorable morale activities have included indoor rock climbing, frisbee golf, Nickelcade, family BBQ, bowling, and even archery. Simply doing these activities eventually wasn’t enough, so our intern now doubles as our audio/video specialist and documents all the fun we’re having:

The post All Work and Some Play appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/all-workd-and-some-play/feed/ 0
Competition as Collaboration – Ancestry.com Handwriting Recognition Competitionhttp://blogs.ancestry.com/techroots/competition-as-collaboration-ancestry-com-handwriting-recognition-competition/ http://blogs.ancestry.com/techroots/competition-as-collaboration-ancestry-com-handwriting-recognition-competition/#comments Fri, 14 Mar 2014 17:27:03 +0000 Michael Murdock http://blogs.ancestry.com/techroots/?p=2083 We are excited to announce that the Ancestry.com handwriting recognition competition proposal was accepted as one of seven, official International Conference on the Frontiers of Handwriting (ICFHR-2014) competitions. As part of our competition on word recognition from segmented historical documents, we are announcing the availability of a new image database1, ANWRESH-1, which contains segmented and labeled… Read more

The post Competition as Collaboration – Ancestry.com Handwriting Recognition Competition appeared first on Tech Roots.

]]>
We are excited to announce that the Ancestry.com handwriting recognition competition proposal was accepted as one of seven, official International Conference on the Frontiers of Handwriting (ICFHR-2014) competitions. As part of our competition on word recognition from segmented historical documents, we are announcing the availability of a new image database1, ANWRESH-1, which contains segmented and labeled documents for use by researchers in the document analysis community.

We invite you to visit our competition website to learn more about what the competition entails, prizes offered, and to register if you are interested. A few key dates to note:

  • Competition Registration Deadline: March 24, 2014
  • Submission Deadline: April, 1, 2014
  • Benchmark Database Availability: April 2, 2014
  • Results Announced: September 4, 2014

Read on to learn about the ICFHR conference, the Ancestry.com competition and database, and why we are so excited to be sponsoring this competition.

Since 1990 the document analysis research community has been meeting every two years for a series of conferences called ICFHW, the International Conference on the Frontiers of Handwriting Recognition

icfhr2014

Quoting from the ICFHR home page:

ICFHR is the premier international forum for researchers and practitioners in the document analysis community for identifying, encouraging and exchanging ideas on  the  state-of-the-art  technology  in  document  analysis,  understanding,  retrieval,  and  performance  evaluation.  The term  document  in  the  context  of  ICFHR  encompasses  a  broad  range  of  documents  from  historical  forms  such  as palm leaves and papyrus to traditional documents and modern multimedia documents. … The ICFHR provides a forum for researchers in the areas of on-line and off-line handwriting recognition, pen-based interface systems, form processing, handwritten-based digital libraries, and web document access and retrieval.

The format of the conference is fairly typical with a variety of pre-conference tutorials and the conference proper consisting of multiple parallel tracks of oral and poster presentations. A fairly modern innovation for these kinds of conferences is the inclusion of sponsored competitions that take place in the months leading up to the conference with the results announced and discussed (and in some cases, debated) in sessions on the last day of the conference.

The ANWRESH-1 Database

An important part of our competition is the new database, ANWRESH-1, that we are making available to the document analysis research community. We expect many in the research community will find it interesting and helpful in their work. It consists of about 800,000 “image snippets” of handwritten text drawn from about 4,000 images from the 1920 and 1930 U.S. Censuses. Specifically, we have located (segmented) on each image the Name, Relation, Age, Marital Condition, and Place of Birth fields and labeled them with their ground truth values. An example image is shown below in Figure 1. Note that I have shown in this figure one row (called a record), with each of the fields we are using in this competition labeled with its field type and highlighted in yellow.

Magnified Census - Scaled, Compressed

Figure 1. Example document with one row emphasized and the fields of interest highlighted in yellow.

 

The challenge in this competition is to use the ANWRESH-1 database to create field-specific recognizers that can take segmented image snippets of handwritten text in images and automatically transcribe (or assist with the transcription) to create the corresponding textual representations for these fields.

One possible approach for the Birth Place field that takes advantage of the repetition of values common in this kind of collection might be to develop a mathematical model that clusters the ink strokes in a snippet using some distance metric such that similar words (under this metric) belong to the same cluster. The following snippets would be “close together” under this metric and thus, would be in the same (green) cluster.

Green Cluster - Scaled, Compressed

This clustering algorithm wouldn’t have the slightest idea what characters are formed from the ink strokes, but it would know that the following snippets are different from the snippets in the green cluster (and thus belong together in the blue cluster):

Blue Cluster - Scaled, Compressed

This approach is very powerful when you encounter a document containing birthplace entries like the following:

Repetition of Birth Place Values

 

Once a human keyer identifies the very first occurrence as the text “alabama”2, the clustering algorithm can then automatically label the rest of the alabama fields as being similar or the same, which can then be quickly and easily reviewed by the human keyer. In some cases the repetition of field values could allow this kind of algorithm to reduce the number of fields that are required to be keyed by one or two orders of magnitude.

 

Is Competition a Good Thing?

One might ask what we hope to gain by sponsoring this competition. Developing and helping the document analysis community advance handwriting recognition technology is a strategic initiative for Ancestry.com. As we have discussed in previous blog posts, the process of converting images of historical documents of handwritten names, dates, relationships and places into a textual representation suitable for searching, is almost all done manually. This transcription process is expensive and time-consuming and is thus a limiting factor in large-scale efforts to extract the data contained in the vast libraries of archived historical documents. Considering the billions of valuable historical documents currently residing on microfilm, microfiche and paper, it’s clear that advancing the capabilities of handwriting recognition systems so as to be able to automate (or even partially automate) the transcription process could be hugely beneficial.

In sponsoring ANWRESH-2014, we are reaching out to researchers developing technologies in word recognition, word spotting, word clustering, machine learning and other related fields to encourage their participation and collaboration. Initially, we want our efforts in this area to generate interest and awareness, to foster connections and enable collaboration. We hope this competition and the ANWRESH-1 database will be an enabler for fresh, unconventional approaches in this difficult, multi-faceted problem. At the conclusion of the competition, at a minimum, we hope to have a much better understanding of the current state-of-the-art for systems for handwritten word recognition on historical documents. As we proceed beyond this competition, we anticipate a spectrum of innovative techniques will emerge. As a growing and diverse community uses increasingly larger, cleaner, and most importantly, shared databases of historical documents to help characterize these techniques, we will see real, albeit incremental, progress in these technologies that will enable us to unlock and make available valuable document collections to family historians that with today’s technologies are simply out of reach.

 

Notes:

1. The name of our competition and database, ANWRESH, stands for ANcestry.com Word REcognition from Segmented Historical Documents.

2. The lower-case “a” in “alabama” is because of our “key-as-seen” policy: If the text looks like a lower-case letter, that’s the way it is keyed.

 

 

The post Competition as Collaboration – Ancestry.com Handwriting Recognition Competition appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/competition-as-collaboration-ancestry-com-handwriting-recognition-competition/feed/ 0
DNA and the Masses: The Science and Technology Behind Discovering Who You Really Arehttp://blogs.ancestry.com/techroots/dna-and-the-masses-the-science-and-technology-behind-discovering-who-you-really-are/ http://blogs.ancestry.com/techroots/dna-and-the-masses-the-science-and-technology-behind-discovering-who-you-really-are/#comments Wed, 12 Mar 2014 19:02:58 +0000 Melissa Garrett http://blogs.ancestry.com/techroots/?p=2075 Originally published on Wired Innovation Insights, 3-12-14. There is a growing interest among mainstream consumers to learn more about who they are and where they came from. The good news is that DNA tests are no longer reserved for large medical research teams or plot lines in CSI. Now, the popularity of direct-to-consumer (DTC) DNA tests… Read more

The post DNA and the Masses: The Science and Technology Behind Discovering Who You Really Are appeared first on Tech Roots.

]]>
Originally published on Wired Innovation Insights, 3-12-14.

There is a growing interest among mainstream consumers to learn more about who they are and where they came from. The good news is that DNA tests are no longer reserved for large medical research teams or plot lines in CSI. Now, the popularity of direct-to-consumer (DTC) DNA tests is making self-discovery a reality, and is leading individuals to learn more about their genetic ethnicity and family history. My personal journey has led to discoveries about my family history outside of the United States. On a census questionnaire I am White or maybe Hispanic. My genetics, however, show I am Southern European, Middle Eastern, Native American, Northern African, and West African. And who knew that DNA would connect me with several cousins that have family living just 20 miles of where my mom was born in central Cuba?

Major strides have been made in recent years to better understand and more efficiently analyze DNA. Where are we today?

  • Easier: DNA testing required a blood draw. Now, you can spit in a tube in the comfort (and privacy) of your own home.
  • Cheaper: In 2000, it took about 15 years and $3 billion to sequence the genome of one person. Today you could get your genome sequenced for a few thousand dollars. To put that into context, if a tank of gas could get you from New York to Boston in 2000, and fuel efficiency had improved at the same pace as DNA sequencing, today you could travel to Mars (the planet) and back on the same tank of gas.
  • Faster: Companies of all kinds are quickly innovating to keep up with demand and to make DNA testing more readily available and affordable. Illumina recently announced a whole-genome sequencing machine that could sequence 20,000 entire genomes per year.
  • More information: We can now tell you things about your ethnicity, find distant cousins, tell you whether a drug is likely to benefit or harm you, and tell your risk of diseases like breast and colon cancer.

It isn’t all roses. There is a joke among the genetic community that you can get your DNA sequenced for $1,000, but it will cost $1,000,000 to interpret it. DNA is complex. Each of us contains six billion nucleotides that are arranged like letters in a book that tell a unique story. And while scientists have deciphered the alphabet that makes up the billions of letters of our genome, we know woefully little about its vocabulary, grammar and syntax. The problem is that if you want to learn how to read, you need books, lots of them, and up until recently we had very few books to learn from.

To illustrate how complex it can be, let’s look at how to determine a person’s genetic ethnic background. Say you are given three books written in English, Chinese and Arabic. Even if you don’t speak the languages you can use the letters in those books to determine what percent of a fourth book is written in each of the respective languages, since those three languages are so distinct. But that is like determining whether someone is African, White or Asian, which doesn’t require a genetic test. What if the three books were written in English, French and German that use a similar alphabet? That is like telling someone that is White that they are a mix of various ethnic groups. That is a much harder problem and one that usually requires a genetic test.

So how do we distinguish the different ethnicities using DNA? Since we don’t have a genetic dictionary that tells us what we are looking for, scientists use the genetic signatures of people who have a long history in a specific region, religion, language, or otherwise practiced a single culture as a dictionary. Once enough of those genetic sequences are gathered, teams of geneticists and statisticians use the dictionary to define what part of your genome came from similar regions.

How does big data play into all of this science?

DNA has been “big data” before the term became popularized. The real question should not be about how much data you have, but what you do with the data. Big data allows companies like Ancestry.com to compare 700,000 DNA letters for a single individual against the 700,000 DNA letters of several hundred thousand other test takers to find genetic cousins. That’s a lot of computational power, and the problem grows exponentially. To make all of this possible, big data and statistical analytics tools, such as Hadoop and HBase, are used to reduce the time associated with processing DNA results.

Given how far we have come in such a short time, what should we expect for the future of consumer DNA? The technology is moving so fast that it is almost worthless to predict. But what is clear is that we won’t come out of this genetic revolution the same. We are going to live better, healthier lives, and we are going to learn things about our species and ourselves we never dreamed of. And importantly, putting genetic ethnicity and family connection in the hands of individuals is going to tear down our notion of race and show how we are all family – literally. Maybe we’ll even treat each other a little better.

Ken Chahine is Senior Vice President and General Manager for Ancestry.com DNA.

 

The post DNA and the Masses: The Science and Technology Behind Discovering Who You Really Are appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/dna-and-the-masses-the-science-and-technology-behind-discovering-who-you-really-are/feed/ 0
My Experience as an Intern at Ancestry.comhttp://blogs.ancestry.com/techroots/my-experience-as-an-intern-at-ancestry-com/ http://blogs.ancestry.com/techroots/my-experience-as-an-intern-at-ancestry-com/#comments Mon, 10 Mar 2014 15:15:10 +0000 Bailey Stewart http://blogs.ancestry.com/techroots/?p=2057 I started my internship on the Front End Development team at Ancestry.com in May of 2013, and during the past 10 months I have developed skills and capabilities that I never dreamed of. Below are a few insights I have gained throughout my experience on how  to make an internship successful. Accept Criticism Before I… Read more

The post My Experience as an Intern at Ancestry.com appeared first on Tech Roots.

]]>
I started my internship on the Front End Development team at Ancestry.com in May of 2013, and during the past 10 months I have developed skills and capabilities that I never dreamed of. Below are a few insights I have gained throughout my experience on how  to make an internship successful.

Accept Criticism

Before I began interning, I felt fully capable of completing any task handed my way. However, within my first week I realized that I was in way over my head. I needed some ‘babysitting’ at first, which I’m sure was frustrating to my mentor, but luckily he was patient and willing to answer my questions. While it was nice to be on the receiving end of patience, nothing helped me grow more than the feedback and criticism I received. I chose to welcome criticism, and I appreciated being held to the same standards as the rest of my team. Sure, I may have needed some babysitting at first – but I never wanted to be treated like a baby. It is important to look at feedback from a learning perspective. Accepting criticism is a great opportunity to learn and rise to new challenges.

Take Ownership and Prove Yourself

When I started my internship, I expected to be given a pile of dusty projects that had been sitting on the shelf for too long. Sure enough, I completed some pretty monotonous tasks. But, by looking at each project as an investment, I was able to research and master different concepts. Taking ownership over a seemingly futile task showed my passion and devotion to the job, and allowed me to prove myself. Overtime, I took on more challenging projects that were important both to my team and to the company. While I definitely prefer these projects to the dusty ones, each project contributed to my growth and development.

Make Connections and Take Advantage of Opportunities

My internship marks the beginning of my career, and it is important for me to take advantage of opportunities provided and network with others. When I started interning, my team was planning a trip to the San Francisco office. I was shocked when I was invited to join the team, and happily took advantage of the opportunity. Not only did this experience help me connect with my teammates, it helped me connect with Ancestry.com employees both in Provo and in San Francisco. This trip, as well as many other company events and team morale activities have helped me establish connections that will be useful throughout my career.

I think that every part of an internship can be taken advantage of and learned from. In this post, I only mention a few of the things that I  found most beneficial and valuable, but I would love to hear your thoughts. What have you found helpful in taking advantage of your internship? Please share in the comments below!

The post My Experience as an Intern at Ancestry.com appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/my-experience-as-an-intern-at-ancestry-com/feed/ 0
Ancestry.com to Lead Core Conversation at SXSWhttp://blogs.ancestry.com/techroots/ancestry-com-to-lead-core-conversation-at-sxsw/ http://blogs.ancestry.com/techroots/ancestry-com-to-lead-core-conversation-at-sxsw/#comments Thu, 06 Mar 2014 21:11:15 +0000 Melissa Garrett http://blogs.ancestry.com/techroots/?p=2046 Headed to SXSW Interactive? Join EVP of Product, Eric Shoup and Senior Director of Product at Tableau, Francois Ajenstat, for an engaging Core Conversation about how using big data can tell personalized stories. Big Data is a game changer for storytelling. Too often, the data we pull is cold, factual and dehumanized. Technologies can now… Read more

The post Ancestry.com to Lead Core Conversation at SXSW appeared first on Tech Roots.

]]>
Headed to SXSW Interactive? Join EVP of Product, Eric Shoup and Senior Director of Product at Tableau, Francois Ajenstat, for an engaging Core Conversation about how using big data can tell personalized stories.

Big Data is a game changer for storytelling. Too often, the data we pull is cold, factual and dehumanized. Technologies can now analyze and turn individual data points into prose and fascinating personal stories. We can bring the humanity back into the bite-sized stories we tell with data by seeking out, understanding and incorporating the inherent narratives within it. Come join the conversation to discuss how we can bring depth and meaning to massive amounts of data.

Session Details:

Session – How Using Big Data Can Tell Personalized Stories

When – Saturday, March 8 from 12:30pm -1:30pm CT

Location – Sheraton Austin, Capitol View South, 701 E 11th St

Session Hashtag– Join the conversation on Twitter #datastory.

 

Storytelling 2

The post Ancestry.com to Lead Core Conversation at SXSW appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/ancestry-com-to-lead-core-conversation-at-sxsw/feed/ 0