Building an Operationally Successful Component – Part 1: Transparency

Posted by Geoff Rayback on April 14, 2014 in DevOps

On our team at Ancestry.com, we spend much of our time focusing on the operational success of the code that we write.  An amazing feature that no one can use because the servers are constantly down is of little use to anyone and can be deemed a failure, even if the code worked from a functional perspective.  This is an attitude that many developers resist because in many organizations, the developers do not have enough control over the operational aspects of their systems to enable them to take this kind of ownership.  As more and more organizations adopt DevOps principles, this will have to change.  As I see it, the mantra of a DevOps-minded software engineer is:  ”I will not abdicate responsibility for the operational success of my component.  It is my software, so it is my job to make sure it is succeeding operationally.”   Operational success is a feature that you build into your software, just like any other feature.  In my experience there are three attributes that a software component must have, to some degree, in order to be operationally successful:

 

  1. The component can report on its health.
  2. The component can overcome or correct problems itself.
  3. The component can fail quickly and gracefully when it encounters a problem it could not overcome or correct.

On our team, we are constantly striving to improve the degree to which our software has these three attributes.  This post will cover some of the things that we are doing on our team, and at Ancestry.com in general, to improve the first attribute, which has to do with transparency.

Is your component running?  How well is it running?  Are its dependencies reachable?  Do you even know what its dependencies are?  Can someone who doesn’t know anything about the component quickly get usable information about its state?  There is a very large amount of information we can gather and expose quite easily about our software.  On our team, we have built up a framework for dealing with what we think of as diagnostic data.  Our components are typically services reachable via HTTP, so we expose a number of diagnostic endpoints that we or other teams can use to get a peek into the health of the component.  Conceptually, these endpoints are:

/ping – This endpoint returns a simple heartbeat from the system.  This endpoint quickly demonstrates that the server is set up to handle HTTP traffic.  This tells us that the web server is installed and running on the server, that our code is deployed and configured reasonably correctly, and that the server is finding it and is configured correctly for our code.  Obviously that isn’t everything we need to know to be sure our system is working, but it is a great start, and the call will return quickly enough that we can make a large number of requests to it without impacting the performance of the system.  We use this type of endpoint as a sort of heartbeat to ensure that broken servers don’t take traffic.

/health – This endpoint runs a suite of health tests that we have built into our components.  The purpose of these tests is to assert on the health of various aspects of the system.  We have broken them into three categories: General health tests, dependency tests, and required dependency tests.  The general tests check things like the version of code deployed, configuration settings, and other aspects of the system that that need to be correct for it to function.  Dependency tests do things like ensuring that our IOC system correctly injected the right types for the various dependencies, and ensuring that each system we depend on is reachable and responding.  We make a distinction between dependencies that are required or not-required.  If a required dependency is down, the system will not be able to correctly handle traffic (something like a database that doesn’t have a viable fallback).  If a non-required dependency is down, the system will continue to handle user requests, but may not be able to log errors, or report its statistics.  Any component we build that depends on any other system is required to have a health test suite built into it.  These suites are discoverable using reflection, so as we add or remove components to the system, the health test engine automatically finds all the tests and runs them.

/statistics – We have instrumented our code so that whenever something that we are interested in happens, we record some data about it.  We track all sorts of stuff, from the number of individual requests a machine is taking, to the network bandwidth it is using, or the rate of exceptions encountered by the server.  Each component gathers this data up and exposes it to systems who ask for it.  We can then pull the data from all our servers periodically and dump it into a central reporting system to generate graphs and other visualizations of what is happening on our machines.  This puts us in a place where whenever we have a new question about the system that we aren’t getting the answer to, all we have to do is add instrumentation for it and we can see the data we need to make an informed decision.  We frequently add temporary counters to help us debug specific problems or to gather metrics needed by the business.

These three fairly rudimentary endpoints give us a tremendous amount of visibility into the system.  They are all automatable (“an API for everything” as Amazon would say), and they are easily extendable.  We have found that with these three endpoints, there is virtually no problem that we cannot quickly diagnose.  Here are some examples of how we use these tools:

  • The load balancer monitors the /ping endpoint to see if it needs to add or remove servers from the pool.
  • The company-wide statistical gathering system pulls from our /statistics endpoint.
  • Our team runs through the /health test suite using an automated tool whenever we think there is risk, like right after we roll out new code or when someone reports site problems.
  • There is a company-wide monitor that tries to walk the dependency tree when there is a site issue and determine exactly how deep the problem lies.  We have mapped our required dependency tests at the /health endpoint into this system (two for the price of one!).
  • There is a system in place to watch the data collected from the /statistics endpoint and send out notifications if the values rise above or drop below a threshold.

This framework is even good for situations where we hadn’t anticipated the problem beforehand:

  • We discovered that we didn’t have a good way to know if an individual server was actually in the load balancer pool or not (i.e. is it taking traffic?).  Well guess what, one new health test at /health and some better monitoring of existing counters at /statistics and now we know if our servers are dropping out or being removed!
  •  We found that our deployment system was occasionally failing to deploy the correct version of code, choosing instead to redeploy the existing version (out of spite I guess?).  First we added a test at /health that simply reports the code version (helpful for a human who might read the test result but not automatable).  We had the test deployed to our production environment within 15 minutes of having the idea.  Next we added a counter at /statistics so that we could graph the code version on each machine in the pool.  If we ever see two lines, then we know something went wrong.  A single line means all servers are on the same version.  Again, 15 minutes after the idea, we were live with the statistics (single line, phew!).  Later, when we had more time, we came back and added a health test that looks at the code version and compares it to the deployment system’s records (change management anyone?) and we can actually assert that the code version is correct or incorrect.  This took a day or so to write and then we rolled it out right away.

My point here is that having some kind of framework that allows us to quickly gain visibility into specific operational problems has proven to be invaluable.  Bad load balancing?  Now we can see it.  Bad deployment?  Now we can see that too.  In fact, I have yet to find an operational issue that our tools cannot begin identifying within a day or so of us realizing it is a problem, even if it is something brand new that we would never have dreamed up.  This gives us a tremendous amount of confidence that our system is running the way we painstakingly designed it to run, which means that the cool new feature we slaved over is actually going to provide value instead of failing because it won’t run correctly.

In the next two posts, I’ll discuss some of the ways we have made our components be error-resistant and self-correcting, and some of the ways we help prevent site-wide catastrophes by not allowing problems to cascade out of control.

 

Leave a comment

Past Articles

Ignite at Ancestry.com

Posted by Chris on April 1, 2014 in Inside our Offices

What is Ignite? Ignite is a format for giving a talk on any subject. A speaker uses twenty slides, which auto-advance every 15 seconds, to provide a five minute talk. The purpose of this article is to elaborate on these talks and explain why we’re doing them and why you should try it at your… Read more

AncestryDNA Regions by the Numbers

Posted by Julie Granka on March 25, 2014 in DNA, Science

Since May of 2012, when we first released AncestryDNA, we’ve returned results to over a quarter of a million customers. Based on feedback that we have received, those 300,000 customers have learned a great deal about their family history – their deep ancestral origins and their genetic relatives. As it turns out, AncestryDNA has also… Read more

All Work and Some Play

Posted by Jeff Lord on March 19, 2014 in Inside our Offices

When I joined Ancestry.com, we were a small start-up of a few hundred employees in some cramped offices behind the post office in Orem, Utah. Now almost 15 years later, we’re an international organization of more than 1,400 employees with offices around the world. Yet despite our growth, Ancestry.com has continued to provide employees with… Read more

Competition as Collaboration – Ancestry.com Handwriting Recognition Competition

Posted by Michael Murdock on March 14, 2014 in Image Processing and Analysis

We are excited to announce that the Ancestry.com handwriting recognition competition proposal was accepted as one of seven, official International Conference on the Frontiers of Handwriting (ICFHR-2014) competitions. As part of our competition on word recognition from segmented historical documents, we are announcing the availability of a new image database1, ANWRESH-1, which contains segmented and labeled… Read more

DNA and the Masses: The Science and Technology Behind Discovering Who You Really Are

Posted by Melissa Garrett on March 12, 2014 in Analytics, Big Data, DNA, Science

Originally published on Wired Innovation Insights, 3-12-14. There is a growing interest among mainstream consumers to learn more about who they are and where they came from. The good news is that DNA tests are no longer reserved for large medical research teams or plot lines in CSI. Now, the popularity of direct-to-consumer (DTC) DNA tests… Read more

My Experience as an Intern at Ancestry.com

Posted by Bailey Stewart on March 10, 2014 in Interns

I started my internship on the Front End Development team at Ancestry.com in May of 2013, and during the past 10 months I have developed skills and capabilities that I never dreamed of. Below are a few insights I have gained throughout my experience on how  to make an internship successful. Accept Criticism Before I… Read more