Building an Operationally Successful Component – Part 2: Self-correction
In my last post I talked about building components that are “operationally successful,” by which I mean that the software functions correctly when it is deployed into production. I suggested that there are three things that a software component must have, to some degree, in order to be operationally successful:
- The component can report on its health (visibility).
- The component can overcome or correct problems itself (self-correction).
- The component can fail quickly and gracefully when it encounters a problem it could not overcome or correct (robustness).
The subject of the last post was transparency. You have to know (and be able to prove to yourself) that your software, and the hardware it is running on, is actually working. But what do we do when the software isn’t working, or there is a problem with one of its dependencies or even the hardware it is running on? If the software doesn’t work, then it isn’t operationally successful, no matter what the actual cause is. In this and the next post, I’ll discuss how our software responds to problems. I am making a distinction between problems that we can do something about (which I’ll cover in this post), and problems we can’t do anything about (which I’ll discuss next time). Of course, that distinction assumes we have the wisdom to know the difference.
Let’s look at three examples of overcoming or correcting common problems that our team currently uses in our services. Obviously self-correction can be extremely sophisticated, even going so far as to automatically rewrite code to adjust for failures, but I don’t think we need to go anywhere near that far to get real benefit from the idea.
A failing server
This is a simple example, and one that I mentioned in the previous post. Assuming that you have enough visibility into your servers (that they can report on their health), you can take corrective action when they are unhealthy. In our system, like many others, the servers sit in a pool behind a load balancer. Our load balancer constantly polls the health of the servers using either our /ping endpoint, or our /health endpoint. If a server is found to be failing, the load balancer removes it from the pool. That is a simple step, and most modern load balancers have this feature out of the box. But if the endpoint the balancer calls is more sophisticated than a simple ping, you can make much better decisions about what to do.
Removing the server from the pool isn’t always the right approach. You only want to remove it if it is broken, not if it is simply overworked. I think it is crucial that our services provide enough visibility that we can make that distinction. Once a server gets pulled, we can fix it offline while the rest of the pool handles the traffic. We are working towards a more sophisticated approach where failing servers are simply rebuilt and have our code redeployed to them automatically. If you have a fully automated configuration management system like Chef, Puppet, Ansible, etc., you don’t need a person to rebuild a failing server, the system can do it automatically. This lets the system correct for anything but an actual hardware problem like a failed hard drive or power supply. This is a nice baby step on the road to true elastic capacity, which is the gold standard of self-correcting server pools. Scaling capacity up or down dynamically and replacing problematic servers on the fly is something some companies already do well and those that do have a huge advantage. In my opinion, it should be a goal in the back of everyone’s mind, even if they are far away from achieving it.
We often run into data that is incorrect in some way. Our team’s services relate to media, mostly images, and we often run into missing image files, incorrect metadata (like image widths and heights), and images that missed some pre-processing step like thumbnail generation.
We found long ago that these are all easily correctable problems. We have software solutions for each issue that could be applied behind the scenes as we run into problems. We typically throw some kind of error when we run into a case like this, so as an easy first stab at self-correction we built a listener that watched our exception log for specific exceptions. Whenever it found one, it would create a work item for another service that was always running, repairing these specific problems in the background. If a user requests an image with a missing thumbnail, then one is automatically generated a few minutes later. If the width and height are incorrect, they are repaired a few minutes later. This obviously doesn’t help the unfortunate user who triggered the exception, but it follows the mantra that we should never make the same mistake twice, so subsequent users always get the corrected metadata.
We have since enhanced the system so that in addition to waiting for log entries to trigger work items for it, it is constantly running through our data in the background looking proactively for things it can repair. We are currently making further enhancements that will allow our production services to call it directly to report anomalies instead of relying on our exception logs. This will let us report a wider range of issues, and even allows other teams to report issues with our data. We have found that having the system perpetually correcting the data relaxes some of the data integrity requirements for new content coming into the system. It lets us publish data that is mostly correct because we can rely on the automated correction system to repair any problems. Since we are often in a race with our competition to get some new dataset online first, this approach (you could call it “eventual correctness” if you are into that kind of thing) can give us a leg up. We accelerate the publishing timeline, accepting some flaws in the data, with the understanding that the flaws will be repaired automatically.
A failed call to a dependency
When we make a call to a downstream system, and that call fails, we have several options. The simplest option is to fail ourselves and let the failure bubble up the call stack. Obviously this is undesirable, and often unnecessary. One simple way to overcome a failure is to just try again. In many cases a retry is helpful and appropriate, but it depends on the reason for the failure. Retries can actually exacerbate some kinds of issues (e.g. if the dependency is failing because it is overtaxed), so it is important to think through the scenario.
We have been refining our approach to this, and I don’t think we have a perfect solution yet, but it seems to me that a retry is appropriate when additional traffic will not compound the issue and when there is a chance that you’d see different results between identical requests. This can happen when the dependency is behind a load balancer (one request gets routed to a failing server but a subsequent request could go to a functioning server), or if the dependency has some kind of throttling, circuit breaker, or other measures in place that could cause intermittent failures. In these cases, when a request fails, we just try again.
We have an extreme version of this that I would not advocate as a general rule, but which works in our specific case. We make calls to a third party system that we don’t control, and which has very badly designed load balancing. The load balancer frequently pins us to servers that are failing to service our requests. This results in repeated failures over an extended period. Our solution has been to bypass the third party system’s load balancer and build a software load balancer into our service. This software load balancer maintains a “dead pool” which lists individual nodes that have been misbehaving. We avoid those nodes until they stop acting up, distributing calls to the working nodes instead. If we hit a node we thought was working and discover it is failing, we add it to the dead pool and retry on another node.
Another approach we take to overcoming failing dependencies is to have designated fallback systems in place. We typically take this approach for databases and storage systems, and ideally the fallback is geographically separate, and contains replicated data. If the primary system is failing (or is slow, or overtaxed, or is missing data), our services automatically fall back to a secondary, and even tertiary system if necessary, to satisfy the request. This switching is built into the software and happens automatically on a request-by-request basis (a modified retry), or globally (a circuit breaker) if a system is consistently failing. This can increase the response times significantly, so it doesn’t work for all use cases. Sometimes failing quickly would be preferable, so it is important to understand the requirements of the system.
There are plenty of other ways that we can make our services self-correcting. In fact, we have found that most of them are fairly obvious once we started forcing ourselves to think in those terms. The trick for many developers is accepting that the software should be self-correcting. It is easy to shift responsibility to others: “My service didn’t work because the operations people deployed it wrong.” “Our stuff is down because that other team’s stuff is down.” “That is the DBA’s problem. I don’t need to worry about that.” We need to accept that those are irresponsible positions to take if we want to build highly available, highly scalable, and operationally successful systems. Every team needs to do everything they can to make sure the systems they are responsible for continue to function – come hell or high water. We get a real competitive edge if we foster a culture where systems correct problems instead of letting them affect other systems. A culture where our software corrects issues instead of passing the buck.
Some issues are not solvable, or the solution is difficult or impossible to automate. We need to have a strategy for those situations as well, and that will be the subject of the next post in this series on building operationally successful components.