Ad Serving Outage
Incident Report for Kevel
Resolved
The load balancer that routes traffic to our ad servers has a health check that determines whether a given node in the cluster is operating correctly. It does this by collecting periodic "heartbeat" messages from different processes that are expected to be running on the node. If these heartbeats are not received within a certain period of time, the node is marked as unhealthy and removed from operation.

We recently deprecated a process in our ad serving system that is no longer required. As we shut down this process across all nodes in the cluster, it stopped sending heartbeats to the health check process. Since the health check on each node expected heartbeats from this deprecated process, it began to report that the node was unhealthy. In turn, this caused the load balancer to remove all nodes from operation.

We noticed the problem immediately and re-started the deprecated process, causing the health check to succeed again. The load balancer then re-added all nodes to the active cluster. We have since altered the configuration of the health check to not expect heartbeats from this deprecated process.

Our ad serving system was down for approximately 60 seconds, followed by a period of approximately 60 seconds where the system was operational with increased latency, as nodes were re-added to the active cluster.
Posted Aug 29, 2013 - 16:09 EDT