Building an Operationally Successful Component – Part 3: Robustness
My previous two posts discussed building components that are “operationally successful.” To me, a component cannot be considered successful unless it actually operates as expected when released into the wild. Something that, “works on my machine,” cannot be considered a success unless it also works on the machine it will ultimately be running on. For our team at Ancestry.com, we have found that we can ensure (or at least facilitate) operational success by following these criteria:
- The component can report on its health (visibility).
- The component can overcome or correct problems itself (self-correction).
- The component can fail quickly and gracefully when it encounters a problem it could not overcome or correct (robustness).
In the final post in this series I want to discuss how to handle problems that the component cannot correct or overcome. I am calling this robustness, although you could easily argue that overcoming and correcting problems is also a part of robustness. The main distinction I want to make between what I called “self-correction” and what I am calling “robustness” is that there are problems that the system can overcome and still return a correct result, and there are problems that prevent the system from providing a correct result. My last post discussed how to move as many problems as possible into the first category, and this post will discuss what we do about the problems left over in the second.
I propose that there are three things that should happen when a system encounters a fatal error:
- Degraded response – The system should provide a degraded response if possible and appropriate.
- Fail fast – The system should provide a response to the calling application as quickly as possible.
- Prevent cascading failures – The system should do everything it can to prevent the failure from cascading up or down and causing failures in other systems.
A degraded response can be very helpful in creating failure resistant software systems. Frequently a component will encounter a problem, like a failed downstream dependency, that prevents it from returning a fully correct response. But often in those cases the component may have much of the data it needed for a correct response. It can often be extremely helpful to return that partial data to the calling application because it allows that application to provide a degraded response in turn to its clients and on up the chain to the UI layer. Human users typically prefer a degraded response to an error. It is usually the software in the middle that aren’t smart enough to handle them. For example we have a service that returns a batch of security tokens to the calling application. In many cases there may be a problem with a single token, but the rest were correctly generated. In these cases we can provide the correct tokens to the calling application along with the error about the one(s) that failed. To the end-user, this results in the UI displaying a set of images, a few of which don’t load, which most people would agree is preferable to an error page. The major argument against degraded responses is that they can be confusing for the client application. A service that is unpredictable can be very difficult to work with. Mysteriously returning data in some cases but not in others makes for a bad experience for developers consuming your service. Because of this, when your service responds to client applications, it is important to clearly distinguish between a full response and a partial response. I have become a big fan of the HTTP 206 status code – “Partial Response.” When our clients see that code, they know that there was some kind of failure, and if they aren’t able to handle a partial response, they can treat the response as a complete failure. But at least we gave them the option to treat the response as a partial success if they are able to.
In many ways I see the failure to use degraded responses as a cultural problem for development organizations. It is important to cultivate a development culture where client applications and services all expect partial or degraded responses. It should be clear to all developers that services are expected to return degraded responses if they are having problems, and client applications are expected to handle degraded responses, and the functional tests should reflect these expectations. If everyone in the stack is afraid that their clients won’t be able to handle a degraded response, then everyone is forced to fail completely, even if they could have partially succeeded. But if everyone in the stack expects and can handle partial responses, then it frees up everyone else in the stack to start returning them. Chicken and egg, I know, but even if we can’t get everyone on board right away, we can all take steps to push the organizations we work with in the right direction.
When a component encounters a fatal exception that doesn’t allow for even a partially successful response, then it has a responsibility to fail as quickly as possible. It is inefficient to consume precious resources processing requests that will ultimately fail. Your component shouldn’t be wasting CPU cycles, memory, and time on something that in the end isn’t going to provide value to anyone. And if the call into your component is a blocking call, then you are forcing your clients to waste CPU cycles, memory, and time as well. What this means is that it is important to try to detect failure as early in your request flow as possible. This can be difficult if you haven’t designed for it from the beginning. In legacy systems which weren’t built this way, it can result in some duplication of validation logic, but in my experience, the extra effort has always paid off once we got the system into production. As soon as a request comes into the system, the code should do everything it can to determine if it is going to be able to successfully process the request. At its most basic level, this means validating request data for correct formatting and usable values. On a more sophisticated level, components can (and should) track the status of their downstream dependencies and change their behavior if they sense problems. If a component has a dependency which it senses is unavailable, requests that require that dependency should fail without the component even calling it. People often refer of this kind of thing as a circuit breaker. A sophisticated circuit breaker will monitor the availability and response times of a dependency and if the system stops responding or the response times get unreasonably long, the circuit breaker will drastically reduce the traffic it sends to the struggling system until it starts to respond normally again. Frequently this type of breaker will let a trickle of requests through so it will be able to quickly sense when the dependency issue is corrected. This is a great way to fail as fast as possible;in fact if you build your circuit breakers correctly, you can fail almost instantly if there is no chance of successfully processing the request.
Prevent Cascading Failures
Circuit breakers can also help implement my last suggestion, which is that a component should aggressively try to prevent failures from cascading outside of its boundaries. In some ways this is the culmination of everything I have discussed in this post and my previous post. If a system encounters a problem, and it fails more severely than was necessary (if it does not self-correct, and does not provide a degraded response), or the failure takes as long as, or longer than a successful request (which is actually common if you are retrying or synchronously logging the exception), then it can propagate the failure up to its calling applications. Similarly if the system encounters a problem that results in increased traffic to a downstream dependency (think retries again, a dependency fails because it is overloaded, so the system calls it again, and again, compounding the issue), it has propagated the issue down to its dependencies. Every component in the stack needs to take responsibility for containing failures. The rule we try to follow on our team is that a failure should be faster and result in less downstream traffic than a success. There are valid exceptions to that rule, but every system should start with that rule and only break it deliberately when the benefit outweighs the risk. Circuit breakers can be tremendously valuable in these scenarios because they can make the requests fail nearly instantaneously and/or throttle the traffic down to an acceptable level as a cross-cutting concern that only has to be built once. That is a more attractive option that building complex logic around each incoming request and each call to a downstream dependency (aspect oriented programming anyone?)
If development teams take to heart their personal responsibility to ensure that their code runs correctly in production, as opposed to throwing it over the wall to another team, or even to the customer, the result is going to be software that is healthier, more stable, more highly available, and more successful.