We have been working hard at Ancestry to build automated server health monitoring into our service infrastructure. One of the most recent efforts was the addition of some simple health tests that our load balancers could run by themselves to check on the health of our servers. We added three types of tests for each server: a simple ICMP ping; an HTTP Get request to a static resource; and a monitor that watches actual response times and HTTP response codes across the wire, looking for errors and abnormal times. The decision was made that if two of these three health monitors reported a failure over a specific time period; the server would be flagged as failing and would be pulled from the pool. Automated health tests are cool, and load balancers that can fix their own problems are extra cool, so the geek factor for this addition was quite high and we were pretty excited to have it working. To play off the famous quote attributed to George Orwell: “We sleep soundly in our beds because rough load balancers stand ready in the night to visit violence on the servers who would do us harm.” If you are thinking ahead, you can probably see the potential for problems.
One day we started to get some unusually heavy traffic. One of our services was being hit especially hard. At one point one of the servers started to return a large number of timeouts or HTTP errors because it was overloaded and was having trouble keeping up. Because the health monitors started showing a failure, they flagged the machine and it was pulled from the pool. Again, if you are thinking ahead, you can predict what happened next. The rest of the servers in that pool now were taking all of the original load, plus the added load of the server that was pulled. So, of course, they also began to be flagged and pulled from the pool one by one. We were monitoring the situation and watched that pool tick down to zero servers as we frantically tried to put servers back in fast enough. That is what kids these days are calling a “cascading failure,” where failure of one component triggers failure in other components. In retrospect it all seems terribly obvious, but at the time we were all so excited to have these automated health tests running that we were a little blinded by our geekish enthusiasm. We had inadvertently designed cascading failures into our automated health monitoring. After thinking things through more clearly in the light of day, we all agreed that a better approach would be to reduce traffic to misbehaving servers instead of pulling them from the pool. Let them cool down a bit and see if they start responding again. If they do not, then maybe it is ok to pull them, but if they are simply taking too much traffic, taking them out of the pool is the last thing you want to do. During this process, we realized that we already use the correct model for several other systems such as some of our filers and cache servers, which go into a reduced traffic “dead pool” when they get overloaded and then get put back in the normal pool if cooling down solves their problem. Based on what we learned, we have revised the load balancer monitoring so that it follows this model more closely and it has been working great.
The purpose of building automation around load balancing and health tests is to have a more self-healing server pool. Ideally the system would react to problems on a single server by routing traffic away from that server and sending it to healthy servers instead. Once we stopped treating an overloaded server the same way as an unhealthy server, the system started doing what we wanted it to do. Unhealthy servers get pulled from the pool so we can re-provision them or whatever is needed and the healthy servers take up the slack. But when a server is simply overloaded, it stays in the pool. This is a great place to be. People don’t spend as much time monitoring server pools, and fixing problems requires less manual work.