On our team at Ancestry.com, we spend much of our time focusing on the operational success of the code that we write. An amazing feature that no one can use because the servers are constantly down is of little use to anyone and can be deemed a failure, even if the code worked from a functional perspective. This is an attitude that many developers resist because in many organizations, the developers do not have enough control over the operational aspects of their systems to enable them to take this kind of ownership. As more and more organizations adopt DevOps principles, this will have to change. As I see it, the mantra of a DevOps-minded software engineer is: “I will not abdicate responsibility for the operational success of my component. It is my software, so it is my job to make sure it is succeeding operationally.” Operational success is a feature that you build into your software, just like any other feature. In my experience there are three attributes that a software component must have, to some degree, in order to be operationally successful:
- The component can report on its health (visibility).
- The component can overcome or correct problems itself (self-correction).
- The component can fail quickly and gracefully when it encounters a problem it could not overcome or correct (robustness).
On our team, we are constantly striving to improve the degree to which our software has these three attributes. This post will cover some of the things that we are doing on our team, and at Ancestry.com in general, to improve the first attribute, which has to do with transparency.
Is your component running? How well is it running? Are its dependencies reachable? Do you even know what its dependencies are? Can someone who doesn’t know anything about the component quickly get usable information about its state? There is a very large amount of information we can gather and expose quite easily about our software. On our team, we have built up a framework for dealing with what we think of as diagnostic data. Our components are typically services reachable via HTTP, so we expose a number of diagnostic endpoints that we or other teams can use to get a peek into the health of the component. Conceptually, these endpoints are:
/ping – This endpoint returns a simple heartbeat from the system. This endpoint quickly demonstrates that the server is set up to handle HTTP traffic. This tells us that the web server is installed and running on the server, that our code is deployed and configured reasonably correctly, and that the server is finding it and is configured correctly for our code. Obviously that isn’t everything we need to know to be sure our system is working, but it is a great start, and the call will return quickly enough that we can make a large number of requests to it without impacting the performance of the system. We use this type of endpoint as a sort of heartbeat to ensure that broken servers don’t take traffic.
/health – This endpoint runs a suite of health tests that we have built into our components. The purpose of these tests is to assert on the health of various aspects of the system. We have broken them into three categories: General health tests, dependency tests, and required dependency tests. The general tests check things like the version of code deployed, configuration settings, and other aspects of the system that that need to be correct for it to function. Dependency tests do things like ensuring that our IOC system correctly injected the right types for the various dependencies, and ensuring that each system we depend on is reachable and responding. We make a distinction between dependencies that are required or not-required. If a required dependency is down, the system will not be able to correctly handle traffic (something like a database that doesn’t have a viable fallback). If a non-required dependency is down, the system will continue to handle user requests, but may not be able to log errors, or report its statistics. Any component we build that depends on any other system is required to have a health test suite built into it. These suites are discoverable using reflection, so as we add or remove components to the system, the health test engine automatically finds all the tests and runs them.
/statistics – We have instrumented our code so that whenever something that we are interested in happens, we record some data about it. We track all sorts of stuff, from the number of individual requests a machine is taking, to the network bandwidth it is using, or the rate of exceptions encountered by the server. Each component gathers this data up and exposes it to systems who ask for it. We can then pull the data from all our servers periodically and dump it into a central reporting system to generate graphs and other visualizations of what is happening on our machines. This puts us in a place where whenever we have a new question about the system that we aren’t getting the answer to, all we have to do is add instrumentation for it and we can see the data we need to make an informed decision. We frequently add temporary counters to help us debug specific problems or to gather metrics needed by the business.
These three fairly rudimentary endpoints give us a tremendous amount of visibility into the system. They are all automatable (“an API for everything” as Amazon would say), and they are easily extendable. We have found that with these three endpoints, there is virtually no problem that we cannot quickly diagnose. Here are some examples of how we use these tools:
- The load balancer monitors the /ping endpoint to see if it needs to add or remove servers from the pool.
- The company-wide statistical gathering system pulls from our /statistics endpoint.
- Our team runs through the /health test suite using an automated tool whenever we think there is risk, like right after we roll out new code or when someone reports site problems.
- There is a company-wide monitor that tries to walk the dependency tree when there is a site issue and determine exactly how deep the problem lies. We have mapped our required dependency tests at the /health endpoint into this system (two for the price of one!).
- There is a system in place to watch the data collected from the /statistics endpoint and send out notifications if the values rise above or drop below a threshold.
This framework is even good for situations where we hadn’t anticipated the problem beforehand:
- We discovered that we didn’t have a good way to know if an individual server was actually in the load balancer pool or not (i.e. is it taking traffic?). Well guess what, one new health test at /health and some better monitoring of existing counters at /statistics and now we know if our servers are dropping out or being removed!
- We found that our deployment system was occasionally failing to deploy the correct version of code, choosing instead to redeploy the existing version (out of spite I guess?). First we added a test at /health that simply reports the code version (helpful for a human who might read the test result but not automatable). We had the test deployed to our production environment within 15 minutes of having the idea. Next we added a counter at /statistics so that we could graph the code version on each machine in the pool. If we ever see two lines, then we know something went wrong. A single line means all servers are on the same version. Again, 15 minutes after the idea, we were live with the statistics (single line, phew!). Later, when we had more time, we came back and added a health test that looks at the code version and compares it to the deployment system’s records (change management anyone?) and we can actually assert that the code version is correct or incorrect. This took a day or so to write and then we rolled it out right away.
My point here is that having some kind of framework that allows us to quickly gain visibility into specific operational problems has proven to be invaluable. Bad load balancing? Now we can see it. Bad deployment? Now we can see that too. In fact, I have yet to find an operational issue that our tools cannot begin identifying within a day or so of us realizing it is a problem, even if it is something brand new that we would never have dreamed up. This gives us a tremendous amount of confidence that our system is running the way we painstakingly designed it to run, which means that the cool new feature we slaved over is actually going to provide value instead of failing because it won’t run correctly.
In the next two posts, I’ll discuss some of the ways we have made our components be error-resistant and self-correcting, and some of the ways we help prevent site-wide catastrophes by not allowing problems to cascade out of control.