Tech Roots http://blogs.ancestry.com/techroots Ancestry.com Tech Roots Blogs Wed, 23 Jul 2014 12:00:55 +0000 en-US hourly 1 http://wordpress.org/?v=3.5.2 Building an Operationally Successful Component – Part 3: Robustnesshttp://blogs.ancestry.com/techroots/building-an-operationally-successful-component-part-3-robustness/ http://blogs.ancestry.com/techroots/building-an-operationally-successful-component-part-3-robustness/#comments Wed, 23 Jul 2014 12:00:55 +0000 Geoff Rayback http://blogs.ancestry.com/techroots/?p=2519 Building an Operationally Successful Component – Part 3: Robustness My previous two posts discussed building components that are “operationally successful.”  To me, a component cannot be considered successful unless it actually operates as expected when released into the wild.  Something that, “works on my machine,” cannot be considered a success unless it also works on… Read more

The post Building an Operationally Successful Component – Part 3: Robustness appeared first on Tech Roots.

]]>
Building an Operationally Successful Component – Part 3: Robustness

My previous two posts discussed building components that are “operationally successful.”  To me, a component cannot be considered successful unless it actually operates as expected when released into the wild.  Something that, “works on my machine,” cannot be considered a success unless it also works on the machine it will ultimately be running on.  For our team at Ancestry.com, we have found that we can ensure (or at least facilitate) operational success by following these criteria:

In the final post in this series I want to discuss how to handle problems that the component cannot correct or overcome.  I am calling this robustness, although you could easily argue that overcoming and correcting problems is also a part of robustness.  The main distinction I want to make between what I called “self-correction” and what I am calling “robustness” is that there are problems that the system can overcome and still return a correct result, and there are problems that prevent the system from providing a correct result.  My last post discussed how to move as many problems as possible into the first category, and this post will discuss what we do about the problems left over in the second.

I propose that there are three things that should happen when a system encounters a fatal error:

  1. Degraded response – The system should provide a degraded response if possible and appropriate.
  2. Fail fast – The system should provide a response to the calling application as quickly as possible.
  3. Prevent cascading failures – The system should do everything it can to prevent the failure from cascading up or down and causing failures in other systems.

Degraded Response

A degraded response can be very helpful in creating failure resistant software systems.  Frequently a component will encounter a problem, like a failed downstream dependency, that prevents it from returning a fully correct response.  But often in those cases the component may have much of the data it needed for a correct response.  It can often be extremely helpful to return that partial data to the calling application because it allows that application to provide a degraded response in turn to its clients and on up the chain to the UI layer.  Human users typically prefer a degraded response to an error.  It is usually the software in the middle that aren’t smart enough to handle them.  For example we have a service that returns a batch of security tokens to the calling application.  In many cases there may be a problem with a single token, but the rest were correctly generated.  In these cases we can provide the correct tokens to the calling application along with the error about the one(s) that failed.  To the end-user, this results in the UI displaying a set of images, a few of which don’t load, which most people would agree is preferable to an error page.  The major argument against degraded responses is that they can be confusing for the client application.  A service that is unpredictable can be very difficult to work with.  Mysteriously returning data in some cases but not in others makes for a bad experience for developers consuming your service.  Because of this, when your service responds to client applications, it is important to clearly distinguish between a full response and a partial response.  I have become a big fan of the HTTP 206 status code – “Partial Response.”  When our clients see that code, they know that there was some kind of failure, and if they aren’t able to handle a partial response, they can treat the response as a complete failure.  But at least we gave them the option to treat the response as a partial success if they are able to.

In many ways I see the failure to use degraded responses as a cultural problem for development organizations.  It is important to cultivate a development culture where client applications and services all expect partial or degraded responses.  It should be clear to all developers that services are expected to return degraded responses if they are having problems, and client applications are expected to handle degraded responses, and the functional tests should reflect these expectations.  If everyone in the stack is afraid that their clients won’t be able to handle a degraded response, then everyone is forced to fail completely, even if they could have partially succeeded.  But if everyone in the stack expects and can handle partial responses, then it frees up everyone else in the stack to start returning them.  Chicken and egg, I know, but even if we can’t get everyone on board right away, we can all take steps to push the organizations we work with in the right direction.

Fail Fast

When a component encounters a fatal exception that doesn’t allow for even a partially successful response, then it has a responsibility to fail as quickly as possible.  It is inefficient to consume precious resources processing requests that will ultimately fail.  Your component shouldn’t be wasting CPU cycles, memory, and time on something that in the end isn’t going to provide value to anyone.  And if the call into your component is a blocking call, then you are forcing your clients to waste CPU cycles, memory, and time as well.  What this means is that it is important to try to detect failure as early in your request flow as possible.  This can be difficult if you haven’t designed for it from the beginning.  In legacy systems which weren’t built this way, it can result in some duplication of validation logic, but in my experience, the extra effort has always paid off once we got the system into production.  As soon as a request comes into the system, the code should do everything it can to determine if it is going to be able to successfully process the request.  At its most basic level, this means validating request data for correct formatting and usable values.  On a more sophisticated level, components can (and should) track the status of their downstream dependencies and change their behavior if they sense problems.  If a component has a dependency which it senses is unavailable, requests that require that dependency should fail without the component even calling it.  People often refer of this kind of thing as a circuit breaker.  A sophisticated circuit breaker will monitor the availability and response times of a dependency and if the system stops responding or the response times get unreasonably long, the circuit breaker will drastically reduce the traffic it sends to the struggling system until it starts to respond normally again.  Frequently this type of breaker will let a trickle of requests through so it will be able to quickly sense when the dependency issue is corrected.  This is a great way to fail as fast as possible;in fact if you build your circuit breakers correctly, you can fail almost instantly if there is no chance of successfully processing the request.

Prevent Cascading Failures

Circuit breakers can also help implement my last suggestion, which is that a component should aggressively try to prevent failures from cascading outside of its boundaries.  In some ways this is the culmination of everything I have discussed in this post and my previous post.  If a system encounters a problem, and it fails more severely than was necessary (if it does not self-correct, and does not provide a degraded response), or the failure takes as long as, or longer than a successful request (which is actually common if you are retrying or synchronously logging the exception), then it can propagate the failure up to its calling applications.  Similarly if the system encounters a problem that results in increased traffic to a downstream dependency (think retries again, a dependency fails because it is overloaded, so the system calls it again, and again, compounding the issue), it has propagated the issue down to its dependencies.  Every component in the stack needs to take responsibility for containing failures.  The rule we try to follow on our team is that a failure should be faster and result in less downstream traffic than a success.  There are valid exceptions to that rule, but every system should start with that rule and only break it deliberately when the benefit outweighs the risk.  Circuit breakers can be tremendously valuable in these scenarios because they can make the requests fail nearly instantaneously and/or throttle the traffic down to an acceptable level as a cross-cutting concern that only has to be built once.  That is a more attractive option that building complex logic around each incoming request and each call to a downstream dependency (aspect oriented programming anyone?)

If development teams take to heart their personal responsibility to ensure that their code runs correctly in production, as opposed to throwing it over the wall to another team, or even to the customer, the result is going to be software that is healthier, more stable, more highly available, and more successful.

The post Building an Operationally Successful Component – Part 3: Robustness appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/building-an-operationally-successful-component-part-3-robustness/feed/ 1
XX+UX happy hour at Ancestry.com with guest speaker Erin Malonehttp://blogs.ancestry.com/techroots/xxux-happy-hour-at-ancestry-com-with-guest-speaker-erin-malone/ http://blogs.ancestry.com/techroots/xxux-happy-hour-at-ancestry-com-with-guest-speaker-erin-malone/#comments Mon, 21 Jul 2014 14:37:48 +0000 Ashley Schofield http://blogs.ancestry.com/techroots/?p=2537 A career journey is a curvy path that usually takes unexpected turns, and as a designer in the growing field of UX, it’s sometimes a struggle to find the right environment to foster great design discussions with fellow UXers. One of the things I’ve enjoyed most at Ancestry.com is the great team who have helped… Read more

The post XX+UX happy hour at Ancestry.com with guest speaker Erin Malone appeared first on Tech Roots.

]]>
A career journey is a curvy path that usually takes unexpected turns, and as a designer in the growing field of UX, it’s sometimes a struggle to find the right environment to foster great design discussions with fellow UXers.

One of the things I’ve enjoyed most at Ancestry.com is the great team who have helped me grow tremendously as a designer.

I’m excited to announce that on July 31, Ancestry.com will be hosting a XX+UX happy hour to foster conversations around all things UX.

Guest speaker Erin Malone, a UXer with over 20 years of experience and co-author of Designing Social Interfaces, will be sharing stories of her journey into user experience and talk about the mentors who have helped her along the way. Find out more about the event here: https://xxux-ancestry.eventbrite.com

The Google+ XX+UX community is comprised of women in UX, design, research and technology. The community shares useful industry news and hosts some of the best design events I’ve ever attended.

These events were recently written about by the +Google Design page, “We’re proud to support this burgeoning, international community of women in design, research, and technology who can connect, share stories, and mentor each other online and offline.”

Their events don’t have the typical networking awkwardness and encourage comfortable conversation. I was surprised by how much I learned from just mingling with other colleagues in various work environments—that had never happened to me at prior “networking” events.

Connecting with others and swapping stories at events like this help to develop a greater understanding of my trade and grow a network of trusted colleagues I can rely on through the twists ahead in my career.

Hope to see you at the event and hear about your career journey.

Event Details:

XX+UX Happy Hour with speaker Erin Malone, hosted by Ancestry.com

July 31, 2014 from 6:00-9:00pm

Ancestry.com

153 Townsend St, Floor 8

San Francisco, CA 94107

Map: http://goo.gl/maps/RXHW2

Free pre-registration is required: https://xxux-ancestry.eventbrite.com

The post XX+UX happy hour at Ancestry.com with guest speaker Erin Malone appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/xxux-happy-hour-at-ancestry-com-with-guest-speaker-erin-malone/feed/ 0
Ancestry.com Awarded Patent for Displaying Pedigree Charts on a Touch Devicehttp://blogs.ancestry.com/techroots/ancestry-com-awarded-patent-for-displaying-pedigree-charts-on-a-touch-device/ http://blogs.ancestry.com/techroots/ancestry-com-awarded-patent-for-displaying-pedigree-charts-on-a-touch-device/#comments Fri, 11 Jul 2014 22:29:27 +0000 Gary Mangum http://blogs.ancestry.com/techroots/?p=2523 In 2011 Ancestry.com joined the mobile revolution and I was given the opportunity to work on a new app that would bring our rich genealogical content to iOS and Android devices.  The original app was called ‘Tree To Go’, but a really funny thing about this name was that the app did not have a… Read more

The post Ancestry.com Awarded Patent for Displaying Pedigree Charts on a Touch Device appeared first on Tech Roots.

]]>
In 2011 Ancestry.com joined the mobile revolution and I was given the opportunity to work on a new app that would bring our rich genealogical content to iOS and Android devices.  The original app was called ‘Tree To Go’, but a really funny thing about this name was that the app did not have a visual ‘tree’ anywhere in the user interface; it provided only a list of all of the people in a user’s family ‘tree’.  We joked that it would have been more appropriately named ‘List To Go’ instead.  We knew that providing a tree experience for visualizing their family data would be an important feature to quickly bring to our customers.  Our small team went to work brainstorming ideas and quickly came up with some rather unique ways to visualize familial relationships.  We were challenged to ‘think outside the box’ by our team lead who asked us to envision the best way to put this data in front of our users taking advantage of the uniqueness of mobile devices with touch screens, accelerometers, limited real estate, and clumsy fingers instead of a mouse.  We needed our design to be very intuitive.  We wanted users to quickly pick up the device and start browsing the tree without reading any instructions.  This was a fun challenge and some of the ideas that we came up with ended up getting described in various patent idea disclosure documents where we had to explain why our solutions presented unique ways of solving the problem.

One night, while pondering on this problem, the idea came to me that a user who is visualizing only a small part of his family tree on the mobile device would be inclined to want to swipe his finger on the device in order to navigate further back into his tree.  If we could continually prepare and buffer ancestral data off screen then we could give the user the impression that he could swipe further and further back in his tree forever until he reached his chosen destination.  And so the idea was born.

7-11-2014 4-24-09 PM

 

We iterated on the idea as a team trying to figure out:

  • what is the correct swiping action to look for?
  • how many generations of people should be displayed on the device and how should they be laid out on the screen?
  • what were the most optimal algorithms for buffering and preparing the off screen data?
  • how would we make the swiping gesture and the animations feel natural and intuitive to the user?
  • should the user be able to navigate in both directions (back through ancestors as well as forward through descendants)? and if so, what would that look like to the user?
  • could this idea handle both tree navigation as well as pinch and zoom?
  • would this idea lend itself to different tree views concepts?
  • what would it mean if the user tapped on part of the tree?

After lots of work and some great user feedback the idea finally became a reality.  The new ‘continuously swiping’ tree view became a prominent feature of the 2.0 version of the newly renamed Ancestry iOS mobile app and has given us a great platform to build on.  I’m pleased to announce that Ancestry.com was recently awarded a patent for this pedigree idea on July 1, 2014 (http://www.google.com/patents/US8769438).

If you’d like to experience the app for yourself, you can download it here.

mobile pedigree 2

 

mobile pedigree 3

 

The post Ancestry.com Awarded Patent for Displaying Pedigree Charts on a Touch Device appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/ancestry-com-awarded-patent-for-displaying-pedigree-charts-on-a-touch-device/feed/ 1
The Importance of Context in Resolving Ambiguous Place Datahttp://blogs.ancestry.com/techroots/the-importance-of-context-in-resolving-ambiguous-place-data/ http://blogs.ancestry.com/techroots/the-importance-of-context-in-resolving-ambiguous-place-data/#comments Thu, 10 Jul 2014 20:01:05 +0000 Laryn Brown http://blogs.ancestry.com/techroots/?p=2512 When interpreting historical documents for the intent of researching your ancestors, you are often presented with less than perfect data. Many of the records that are the backbone of family history research are bureaucratic scraps of paper filled out decades ago in some government building. We should hardly be surprised when the data entered is… Read more

The post The Importance of Context in Resolving Ambiguous Place Data appeared first on Tech Roots.

]]>
When interpreting historical documents for the intent of researching your ancestors, you are often presented with less than perfect data. Many of the records that are the backbone of family history research are bureaucratic scraps of paper filled out decades ago in some government building. We should hardly be surprised when the data entered is vague, confusing, or just plain sloppy.

Take for example, a census form from the 1940’s. One of the columns of information is the place of birth of each individual in the household. Given no other context, these entries can be extremely vague and in some cases, completely meaningless to the modern generation.

Here are some examples:

  • Prussia
  • Bohemia
  • Indian Territory

Additionally, there are entries that on the face of them seem clear, but with more context have new complexity:

  • Boston (England)
  • Paris (Idaho)
  • Provo (Bosnia)

And finally, we have entries that are terrifically vague and cannot be resolved without more context:

  • Springfield
  • Washington
  • Lincoln

If we add the complexity of automatic place parsing, where we try to infer meaning from the data and normalize it to a common form that we can search on, the challenges grow.

In the above example, if I feed “Springfield” into our place authority, which is a tool that normalizes different forms of place names to a single ID, I get 63 possible matches in a half dozen countries. This is not that helpful. I can’t put 63 different pins on a map, or try and match 63 different permutations to create a good DNA or record hint.

I need more context to narrow down the field to the one Springfield that represents the intent of that census clerk a hundred years ago.

One rather blunt approach is to sort the list by population. Statistically, more people will be from a larger city of Springfield than from a smaller. But this has all sorts of flaws, such as excluding rural places from ever being legitimate matches. If you happen to be from Paris, Idaho we are never going to find your record.

Another approach would be to implement a bunch of logical rules, where for the case of a name that matches a U.S. state we would say things like “Choose the largest jurisdiction for things that are both states and cities.” So “Tennessee” must mean the state of Tennessee, not the five cities in the U.S. that share the same name. Even if you like those results, there are always going to be exceptions that break the rule and require a second rule – such as the state of Georgia and the country of Georgia. The new rule would have to say “Choose the largest jurisdiction for things that are both states and cities, but don’t choose a Georgia as a country because it is really a state.”

It is clear that a rules-based approach will not work. But since we still need to resolve ambiguity, how is it to be done?

I propose a blended strategy that takes three approaches.

  1. Get context from wherever you can to limit the number of possibilities. If the birth location for Grandpa is Springfield and the record set you are studying is the Record of Births from Illinois, then the additional context may give you enough data to make a conclusion that Springfield=Springfield, Illinois, USA. What seems obvious to a human observer is actually pretty hard with automated systems. These systems need to learn where to find this additional context and Natural Language parsers or other systems need to be fed more context from the source to facilitate a good parse.
  2. Preserve all unresolved ambiguity. If the string I am parsing is “Provo” and my authority has a Provo in Utah, South Dakota, Kentucky, and Bosnia, I should save all of these as potential normalized representations of “Provo.” It is a smaller set to match on when doing comparisons and you may get help later on to pick the correct city.
  3. Get a human to help you. We are all familiar with applications and websites that give us that friendly “Did you mean…” dialogue. This approach lets a user, who may have more context, choose the “Provo” that they believe is right. We can get into a lot of trouble by trying to guess what is best for the customer instead of presenting a choice to them. Maybe Paris, Idaho is the Paris they want, maybe not. But let them choose for you.

In summary, context is the key to resolving ambiguity when parsing data, especially ambiguous place names. Using a blended approach that makes use of all available context, preserves any remaining ambiguity, and presents those ambiguous results to the user for resolution seems like the most successful strategy to solving the problem.

The post The Importance of Context in Resolving Ambiguous Place Data appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/the-importance-of-context-in-resolving-ambiguous-place-data/feed/ 1
Lessons Learned Building a Messaging Frameworkhttp://blogs.ancestry.com/techroots/lessons-learned-building-a-messaging-framework/ http://blogs.ancestry.com/techroots/lessons-learned-building-a-messaging-framework/#comments Tue, 01 Jul 2014 16:18:01 +0000 Xuyen On http://blogs.ancestry.com/techroots/?p=954 We have built out an initial logging framework with Kafka 0.7.2, a messaging system developed at LinkedIn. This blog post will go over some of the lessons we’ve learned by building out the framework here at Ancestry.com. Most of our application servers are Windows-based and we want to capture IIS logs from these servers. However,… Read more

The post Lessons Learned Building a Messaging Framework appeared first on Tech Roots.

]]>
We have built out an initial logging framework with Kafka 0.7.2, a messaging system developed at LinkedIn. This blog post will go over some of the lessons we’ve learned by building out the framework here at Ancestry.com.

Most of our application servers are Windows-based and we want to capture IIS logs from these servers. However, Kafka does not include any producers that run on the Microsoft .Net platform. Thankfully, we were able to find an open source project where someone else wrote libraries that run on .Net that could communicate with Kafka. This allowed us to develop our own custom producers to run on our Windows application servers. You may find that you will also need to develop your own custom producers because every platform is different. You might have applications running on different OS’s, or your applications might be running in different languages. The Kafka apache site lists all the different platforms and programming languages that it supports. We plan on transitioning onto Kafka 0.8 but we could not find any corresponding library packages like there was for 0.7.

Something to keep in mind when you design your producer is that it should be as lean and efficient as possible. The goal is to have as high throughput for sending messages to Kafka as possible while keeping the CPU and memory overhead as low as possible, so as to not overload the application server. One design decision we made early on was to have compression in our producers in order to make communication between the producers and Kafka more efficient and faster. We initially used gzip because it was natively supported within Kafka. We achieved very good compression results (10:1) and also had the added benefit of saving storage space. We have 2 kinds of producers. One ran as a separate service which simply reads log files in a specified directory where all the log files to be sent are stored. This design is well suited for cases when the log data is not time critical because the data is buffered in log files on the application server. This is useful because if a Kafka cluster becomes unavailable, the data is still saved locally. It’s a good safety measure against network failures and outages. The other kind of producer we have is hard coded into our applications. The messages are being sent directly to Kafka from code. This is good for situations where you want to get the data to Kafka as fast as possible and could be interfaced with a component like Samza (another project from LinkedIn) for real-time analysis. However, messages can be lost if the Kafka cluster becomes unavailable so a fail over cluster would be needed to prevent message loss.

To get data out of Kafka and into our Hadoop cluster we wrote a custom Kafka consumer job that is a Hadoop map application. It is a continuous job that runs every 10-15 minutes. We partitioned our Kafka topics to have 10 partitions per broker. We have 5 Kafka brokers in our cluster that are treated equally, which means that a message can be routed to any broker determined by a load balancer. This architecture allows us to scale out horizontally, and if we need to add more capacity to our Kafka cluster, we can just add more broker nodes. Conversely, we can take out nodes as needed for maintenance. Having many partitions allows us to scale out more easily because we can increase the number of mappers in our job to read from Kafka. However, we have found that splitting up the job into too many pieces may result in too many files being generated. In some cases, we were generating a bunch of small files that were less than the Hadoop block size, which was set to 128Mb. This problem was made evident when we had a large ingest of a batch of small files which had over 40 million small files being loaded into our Hadoop cluster. This caused our NameNode to go down because it was not able to handle the sheer number of file handles within the directory. We had to increase the Java heap memory size to 16 GB just to be able to do an ls (listing contents) on the directory. Hadoop likes to work with a small number of very large files (they should be much larger than the block size) so you may find that you will need to tweak the number of partitions used for the Kafka topics, as well as how long you want your mapper job to write to those files. Longer map times with fewer partitions will result in fewer and larger files, but it will also mean that it will take longer for the messages to be queried in Hadoop and it can limit the scalability of your consumer job since you will have less possible mappers to assign the job.

Another design decision we made was to partition the data within our consumer job. Each mapper would create a new file each time a new partition value was detected. The topic and partition values would be recorded in the filename. We created a separate process that would look in a staging directory in HDFS where the files were be generated. This process would look at the file names and determine whether there are existing table and partitions in Hive. If there are, it would simply move those files into the corresponding directory in the Hive External table directory in HDFS. If the partition did not already exist, it would dynamically create new ones. We also compressed the data within the consumer job to save disk space. We initially tried gzip, which gave good compression rates, but it dramatically slowed down our Hive queries due to the processing overhead. We are now trying bzip2 which gives less compression, but our Hive queries are running faster. We choose bzip2 because of its lower processing overhead, but also because it is a splitable format. This means that Hadoop can split a large bz2 file and assign multiple mappers to work on it.

That covers a few of the lessons learned thus far as we build out our messaging framework here at Ancestry. I hope you will be able to use some of the information covered here so that you can avoid the pitfalls we encountered.


The post Lessons Learned Building a Messaging Framework appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/lessons-learned-building-a-messaging-framework/feed/ 0
Controlling Costs in a Cloudy Environmenthttp://blogs.ancestry.com/techroots/controlling-costs-in-a-cloudy-environment/ http://blogs.ancestry.com/techroots/controlling-costs-in-a-cloudy-environment/#comments Tue, 24 Jun 2014 20:11:03 +0000 Daniel Sands http://blogs.ancestry.com/techroots/?p=2500 From an engineering and development standpoint, one of the most important aspects of cloud infrastructure is the concept of unlimited resources. The idea of being able to get a new server to experiment with, or being able to spin up more servers on the fly to handle a traffic spike is a foundational benefit of… Read more

The post Controlling Costs in a Cloudy Environment appeared first on Tech Roots.

]]>
From an engineering and development standpoint, one of the most important aspects of cloud infrastructure is the concept of unlimited resources. The idea of being able to get a new server to experiment with, or being able to spin up more servers on the fly to handle a traffic spike is a foundational benefit of cloud architectures. This is handled in a variety of different ways with various cloud providers, but there is one thing that they all share in common:

Capacity costs money. The more capacity you use, the more it costs.

So how do we provide unlimited resources to our development and operations groups without it costing us an arm and a leg? The answer is remarkably simple. Visibility is the key to controlling costs on cloud platforms. Team leads and managers with visibility into how much their cloud based resources are costing them can make intelligent decisions with regard to their own budgets. Without decent visibility into the costs involved in a project, overruns are inevitable.

This kind of cost tracking and analysis has been the bane of accounting groups for years, but there are several projects that have cropped up to tackle the problem. Projects like Netflix ICE provide open source tools to track costs in public cloud environments. Private cloud architectures are starting to catch up to public clouds with projects like Ceilometer in Open Stack, but can be a bit trickier to determine accurate costs due to the variables involved in a custom internal architecture.

The most important thing in managing costs of any nature is to realistically know what the costs are. Without this vital information, effectively managing the costs associated with infrastructure overhead can be nearly impossible.

The post Controlling Costs in a Cloudy Environment appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/controlling-costs-in-a-cloudy-environment/feed/ 0
Adventures in Big Data: Commodity Hardware Blueshttp://blogs.ancestry.com/techroots/adventures-in-big-data-commodity-hardware-blues/ http://blogs.ancestry.com/techroots/adventures-in-big-data-commodity-hardware-blues/#comments Fri, 20 Jun 2014 16:47:11 +0000 Bill Yetman http://blogs.ancestry.com/techroots/?p=2488 One of the real advantages of a system like Hadoop is that it runs on commodity hardware. This will keep your hardware costs low. But when that hardware fails at an unusually high rate it can really throw a wrench into your plans. This was the case recently when we set up a new cluster… Read more

The post Adventures in Big Data: Commodity Hardware Blues appeared first on Tech Roots.

]]>
One of the real advantages of a system like Hadoop is that it runs on commodity hardware. This will keep your hardware costs low. But when that hardware fails at an unusually high rate it can really throw a wrench into your plans. This was the case recently when we set up a new cluster to collect our custom log data and experienced a high rate of hard drive failures. Here is what happened in about one week’s time:

  • We set up a new 27 node cluster, installed Hadoop 2.0, got the system up and running and started loading log files.
  • By Friday, (two days later) the cluster was down to 20 functioning nodes as data nodes began to fall out due to hard drive failures. The primary name node had failed over to the secondary name node.
  • By Monday, the cluster was down to 12 nodes and the name node had failed over.
  • On Wednesday the cluster was at 6 nodes and we had to shut it down.
  • The failures coincided with the increased data load on the system. As soon as we started ingesting our log data, putting pressure on the hard drives, the failures started.

It makes you wonder what happened during the manufacturing process. Did a forklift drop a palette of hard drives and those drives were the ones installed into the machines sent to us? Did the vendor simply skip the quality control steps for this batch of hard drives? Did someone on the assembly line sneeze on the drives? Did sun spots cause this? Over 20% of the hard drives in this cluster had to be replaced in the first three weeks that this system was running.  There were three or more nodes failing daily for a while. We started running scripts that looked at the S.M.A.R.T. monitoring information for the hard drives. Any drives that reported failures or predicted failures were identified and replaced. We had to proactively do this on all nodes in the cluster.

One interesting side note about Hadoop, our system never lost data. The HDFS file system check showed that replication had failed but we had at least one instance of every data block. As we rebuilt the cluster, the data was replicated three times.

What are we doing about this? First, we are having the vendor who is staging our hardware run a set of diagnostics before sending the hardware to us. It is no longer “good enough” to make sure the systems power on. If problems are found, they will swap out the hardware before we receive them. Second, we’ve set minimum failure standards for our hardware and keep track of failures. If we’re seeing too many failures, work proactively with the vendor on replacement hardware.

One of my Hadoop engineers said this, “If you purchase commodity memory, it may be slower but it runs. If you purchase commodity CPUs, they also run. If you purchase the least expensive commodity hard drives, they will fail.” He’s absolutely right.

If all else fails, get different commodity disk drives (enterprise level).

 

The post Adventures in Big Data: Commodity Hardware Blues appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/adventures-in-big-data-commodity-hardware-blues/feed/ 0
Website Performance 101http://blogs.ancestry.com/techroots/website-performance-101/ http://blogs.ancestry.com/techroots/website-performance-101/#comments Tue, 17 Jun 2014 23:59:24 +0000 Jeremy Johnson http://blogs.ancestry.com/techroots/?p=2452 Here at Ancestry.com, we have a team dedicated to monitoring, measuring, and helping the company improve the performance of the website. Trying to do this is a very fun and interesting challenge. With a website that has many billions of records and other content (10 petabytes), making it fast is no small task! To illustrate… Read more

The post Website Performance 101 appeared first on Tech Roots.

]]>
Here at Ancestry.com, we have a team dedicated to monitoring, measuring, and helping the company improve the performance of the website. Trying to do this is a very fun and interesting challenge. With a website that has many billions of records and other content (10 petabytes), making it fast is no small task! To illustrate some of the key concepts of performance, let me share a little story with you.

The Performance Painter

Think of a painter and canvas. The painter knows what she wants to paint and dips her brush into a color and starts to paint. Let’s say this took 3 seconds for her to take her brush, dip it into her color, and begin to paint. Those 3 seconds are how long it took to start to see the painting begin. This is the painter’s First Paint time.

Now think of the painter finishing their painting. They’ve got all the trees, clouds, and puppies that the painting will contain, but not signing it or writing anything on the back or putting on any of the finishing touches. Let’s say it took 3 hours for her to have her initial painting done (including First Paint Time). This is the painter’s Page Load time.

Finally, think of the painter now signing the painting, writing a special note on the back, and doing some additional touch ups to the painting to get it ready to deliver to a buyer. Let’s say this took an additional 30 minutes for the painter to do. This is the painter’s Total Download time (3 hours, 30 minutes).

What These Performance Times Mean

A web page on the Internet is just like the painting getting created, except (we hope) much faster than 3 hours! In fact, we’ve found that anything over 3 seconds on a web page starts to feel slow to our customers!

First Paint is when our customers start to see the page get drawn on the monitor. The moment you see any kind of graphic, text, or pixel, this is the first paint time. It’s like the painter first touching their brush to the canvas.

Page load time is when the page is mostly done, although there may be a few things that still need to be processed or shown on the page – like delay load sections of the page brought in via AJAX. It’s like the painter finishing the painting, but not adding all the finishing touches.

Total Download time is when everything on the page is finished and delivered to our customers browser. This includes delay load AJAX sections on the page or anything used to track metrics for the page. This is like the painter adding the finishing touches and signing the painting.

Each of these is important, but our business is especially focused on First Paint and Page Load time so our customers are seeing the things they care about as fast as possible.

How Do We Measure Performance?

Everything we do on the performance team comes down to improving our time in these 3 key areas, with an emphasis on First Paint and Page Load.

Think about yourself as you use websites. Would you like to sit and wait for a web page to load? We feel the same way on the performance team. Pages should load as fast as possible starting with the most important things on the page.

We built our own custom software to go out and hit our most important pages that our customers use on our website. This software runs 24 hours a day, 7 days a week and captures how long our key pages took. We do this measuring from a few different locations as well. It is through this method of capturing performance times and maintaining a history that we can then set goals for improvement.

How Do We Improve?

The simplest way to improve website performance is to first track and measure it and then find areas that don’t meet standards of performance. Typically, we want a page’s first paint to happen in 1.25 seconds or less. We want page load to happen in 2.5 seconds or less and the total download time to happen in 3 seconds or less (but we are more lenient with this time).

If we find a key area isn’t hitting this goal, we ask the question WHY. This involves a lot of performance investigation and questions with owners within the company of the page. Most of the time, we are able to pin point a specific part of the page that needs improving and can schedule work to be done on the page to improve performance.

Continuous Performance

Lastly, performance is not a one-time fix. As a widely used website, Ancestry.com is continually adding features to the website or modifying existing features to be more rich and satisfying for our customers. As such, the monitoring and improving never ends. Performance really is a feature of the website that must be considered at all times.

The post Website Performance 101 appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/website-performance-101/feed/ 2
Building an Operationally Successful Component – Part 2: Self Correctionhttp://blogs.ancestry.com/techroots/building-an-operationally-successful-component-part-2-self-correction/ http://blogs.ancestry.com/techroots/building-an-operationally-successful-component-part-2-self-correction/#comments Tue, 10 Jun 2014 15:30:42 +0000 Geoff Rayback http://blogs.ancestry.com/techroots/?p=2476 Building an Operationally Successful Component – Part 2: Self-correction In my last post I talked about building components that are “operationally successful,” by which I mean that the software functions correctly when it is deployed into production.  I suggested that there are three things that a software component must have, to some degree, in order… Read more

The post Building an Operationally Successful Component – Part 2: Self Correction appeared first on Tech Roots.

]]>
Building an Operationally Successful Component – Part 2: Self-correction

In my last post I talked about building components that are “operationally successful,” by which I mean that the software functions correctly when it is deployed into production.  I suggested that there are three things that a software component must have, to some degree, in order to be operationally successful:

The subject of the last post was transparency.  You have to know (and be able to prove to yourself) that your software, and the hardware it is running on, is actually working.  But what do we do when the software isn’t working, or there is a problem with one of its dependencies or even the hardware it is running on?  If the software doesn’t work, then it isn’t operationally successful, no matter what the actual cause is.  In this and the next post, I’ll discuss how our software responds to problems.  I am making a distinction between problems that we can do something about (which I’ll cover in this post), and problems we can’t do anything about (which I’ll discuss next time).  Of course, that distinction assumes we have the wisdom to know the difference.

Let’s look at three examples of overcoming or correcting common problems that our team currently uses in our services.  Obviously self-correction can be extremely sophisticated, even going so far as to automatically rewrite code to adjust for failures, but I don’t think we need to go anywhere near that far to get real benefit from the idea.

A failing server

This is a simple example, and one that I mentioned in the previous post.  Assuming that you have enough visibility into your servers (that they can report on their health), you can take corrective action when they are unhealthy.  In our system, like many others, the servers sit in a pool behind a load balancer.  Our load balancer constantly polls the health of the servers using either our /ping endpoint, or our /health endpoint.  If a server is found to be failing, the load balancer removes it from the pool.  That is a simple step, and most modern load balancers have this feature out of the box.  But if the endpoint the balancer calls is more sophisticated than a simple ping, you can make much better decisions about what to do. 

Removing the server from the pool isn’t always the right approach.  You only want to remove it if it is broken, not if it is simply overworked.  I think it is crucial that our services provide enough visibility that we can make that distinction.  Once a server gets pulled, we can fix it offline while the rest of the pool handles the traffic.  We are working towards a more sophisticated approach where failing servers are simply rebuilt and have our code redeployed to them automatically.  If you have a fully automated configuration management system like Chef, Puppet, Ansible, etc., you don’t need a person to rebuild a failing server, the system can do it automatically.  This lets the system correct for anything but an actual hardware problem, like a failed hard drive or power supply.  This is a nice baby step on the road to true elastic capacity, which is the gold standard of self-correcting server pools.  Scaling capacity up or down dynamically and replacing problematic servers on the fly is something some companies already do well, and those that do have a huge advantage.  In my opinion, it should be a goal in the back of everyone’s mind, even if they are far away from achieving it.

Corrupt data

We often run into data that is incorrect in some way.  Our team’s services relate to media, mostly images, and we often run into missing image files, incorrect metadata (like image widths and heights), and images that missed some pre-processing step like thumbnail generation. 

We found long ago that these are all easily correctable problems.  We have software solutions for each issue that could be applied behind the scenes as we run into problems.  We typically throw some kind of error when we run into a case like this, so as an easy first stab at self-correction we built a listener that watched our exception log for specific exceptions.  Whenever it found one, it would create a work item for another service that was always running, repairing these specific problems in the background.  If a user requests an image with a missing thumbnail, then one is automatically generated a few minutes later.  If the width and height are incorrect, they are repaired a few minutes later.  This obviously doesn’t help the unfortunate user who triggered the exception, but it follows the mantra that we should never make the same mistake twice, so subsequent users always get the corrected metadata. 

We have since enhanced the system so that in addition to waiting for log entries to trigger work items for it, it is constantly running through our data in the background looking proactively for things it can repair.  We are currently making further enhancements that will allow our production services to call it directly to report anomalies instead of relying on our exception logs.  This will let us report a wider range of issues, and even allows other teams to report issues with our data.  We have found that having the system perpetually correcting the data relaxes some of the data integrity requirements for new content coming into the system.  It lets us publish data that is mostly correct because we can rely on the automated correction system to repair any problems.  Since we are often in a race with our competition to get some new dataset online first, this approach (you could call it “eventual correctness” if you are into that kind of thing) can give us a leg up.  We accelerate the publishing timeline, accepting some flaws in the data, with the understanding that the flaws will be repaired automatically.

A failed call to a dependency

When we make a call to a downstream system, and that call fails, we have several options.  The simplest option is to fail ourselves and let the failure bubble up the call stack.  Obviously this is undesirable, and often unnecessary.  One simple way to overcome a failure is to just try again.  In many cases a retry is helpful and appropriate, but it depends on the reason for the failure.  Retries can actually exacerbate some kinds of issues (e.g. if the dependency is failing because it is overtaxed), so it is important to think through the scenario.   

We have been refining our approach to this, and I don’t think we have a perfect solution yet, but it seems to me that a retry is appropriate when additional traffic will not compound the issue and when there is a chance that you’d see different results between identical requests.  This can happen when the dependency is behind a load balancer (one request gets routed to a failing server but a subsequent request could go to a functioning server), or if the dependency has some kind of throttling, circuit breaker, or other measures in place that could cause intermittent failures.  In these cases, when a request fails, we just try again. 

We have an extreme version of this that I would not advocate as a general rule, but which works in our specific case.  We make calls to a third party system that we don’t control, and which has very badly designed load balancing.  The load balancer frequently pins us to servers that are failing to service our requests.  This results in repeated failures over an extended period.  Our solution has been to bypass the third party system’s load balancer and build a software load balancer into our service.  This software load balancer maintains a “dead pool” which lists individual nodes that have been misbehaving.  We avoid those nodes until they stop acting up, distributing calls to the working nodes instead.  If we hit a node we thought was working and discover it is failing, we add it to the dead pool and retry on another node.

Another approach we take to overcoming failing dependencies is to have designated fallback systems in place.  We typically take this approach for databases and storage systems, and ideally the fallback is geographically separate, and contains replicated data.  If the primary system is failing (or is slow, or overtaxed, or is missing data), our services automatically fall back to a secondary, and even tertiary system if necessary, to satisfy the request.  This switching is built into the software and happens automatically on a request-by-request basis (a modified retry), or globally (a circuit breaker) if a system is consistently failing.  This can increase the response times significantly, so it doesn’t work for all use cases.  Sometimes failing quickly would be preferable, so it is important to understand the requirements of the system.

There are plenty of other ways that we can make our services self-correcting.  In fact, we have found that most of them are fairly obvious once we started forcing ourselves to think in those terms.  The trick for many developers is accepting that the software should be self-correcting.  It is easy to shift responsibility to others:  ”My service didn’t work because the operations people deployed it wrong.”  ”Our stuff is down because that other team’s stuff is down.”  ”That is the DBA’s problem.  I don’t need to worry about that.”  We need to accept that those are irresponsible positions to take if we want to build highly available, highly scalable, and operationally successful systems.  Every team needs to do everything they can to make sure the systems they are responsible for continue to function – come hell or high water.  We get a real competitive edge if we foster a culture where systems correct problems instead of letting them affect other systems. A culture where our software corrects issues instead of passing the buck.

Some issues are not solvable, or the solution is difficult or impossible to automate.  We need to have a strategy for those situations as well, and that will be the subject of the next post in this series on building operationally successful components.

The post Building an Operationally Successful Component – Part 2: Self Correction appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/building-an-operationally-successful-component-part-2-self-correction/feed/ 1
Dealing with Your Team’s Bell Curvehttp://blogs.ancestry.com/techroots/dealing-with-your-teams-bell-curve/ http://blogs.ancestry.com/techroots/dealing-with-your-teams-bell-curve/#comments Fri, 06 Jun 2014 21:01:49 +0000 Daniel Sands http://blogs.ancestry.com/techroots/?p=2471 I recently came across this article on the INTUIT QuickBase blog and was intrigued by the premise. It asserts that inside any team or organization, you will have a bell curve of talent and intelligence – which most would agree to. It’s not a bad thing, it just happens. Regardless of how well staffed you… Read more

The post Dealing with Your Team’s Bell Curve appeared first on Tech Roots.

]]>
I recently came across this article on the INTUIT QuickBase blog and was intrigued by the premise. It asserts that inside any team or organization, you will have a bell curve of talent and intelligence – which most would agree to. It’s not a bad thing, it just happens. Regardless of how well staffed you are or how many experts you recruit, there will always be someone who stands out above the rest and someone who lags behind. Lagging behind is in this case a very relative matter and the so-called lagging individual may in fact be generating brilliant work. This curve seems to naturally exist.

While the article discusses how the groups respond to the least of the group, my interest was instead peaked by another thought. How do we each perceive ourselves within the group? From where I am standing, where do I think I am on the bell curve? In my own team, I know of individuals who depreciate their own perceived value, verbally expressing that others contribute more, have a better response time or whatever criteria you wish to judge on. That perspective can actually be quite dangerous as someone of great value may view themselves as insufficient. On the other hand, someone who views themselves as a rock star may be all flash and no substance.

More than anything, the concept triggered an awareness of my own team and helped me to think a little more about those around me and be more sensitive to issues and circumstances that I may not have otherwise thought about. All in all a good read if you have a few minutes.

I’ll echo the author’s question at the end of her article, how has the bell curve on your team affected business culture and team efficacy?

The post Dealing with Your Team’s Bell Curve appeared first on Tech Roots.

]]>
http://blogs.ancestry.com/techroots/dealing-with-your-teams-bell-curve/feed/ 0