Failure Detection in Distributed Systems

Jul 4

In a single process, failure can feel direct. A function throws, a thread stops, or a local dependency returns an error. In a distributed system, failure is stranger: one process rarely observes another process failing. It observes silence.

Silence is ambiguous. The remote process might have crashed. It might be slow. The network might be dropping packets. A firewall, queue, pause, DNS issue, overloaded node, or delayed scheduler might be hiding a healthy process from you. This is why perfect failure detection is impossible in a real distributed system. A missing response is evidence, not proof.

Failure detection is a guess

Failure detectors turn missing communication into an operational decision: after waiting long enough, the system treats a peer as unavailable and starts doing something else.

That decision is always a trade-off. A short timeout detects real failures quickly, which can reduce user-visible downtime. But it also increases false suspicions: a slow process can be mistaken for a dead one. A long timeout avoids some false alarms, but it delays recovery when the failure is real.

Waiting forever is usually worse than being imperfect. If a caller never gives up, it cannot retry, fail over, shed load, show an error, or trigger repair. Distributed systems need timeouts because uncertainty must eventually become action.

Request-time checks

The simplest failure detector is the normal request path. A service sends work to another service, waits for a response, and treats timeout or connection failure as a sign that the dependency is unavailable.

This is often enough when calls are infrequent or when the system does not need to react before the next real request arrives. It keeps the design simple because the system learns about failure only when it actually needs the dependency.

Pings and heartbeats

Some systems need to notice failure before useful work is sent. They use proactive checks.

A ping asks another process, "are you reachable right now?" A heartbeat is the inverse rhythm: a process sends regular liveness signals, and peers become suspicious when those signals stop.

Proactive checks are useful when processes interact frequently, when leadership or membership matters, or when quick failover is worth the extra traffic and complexity. They are less useful when interactions are rare and request-time detection gives the same practical result with fewer moving parts.

The practical rule

Do not ask for a perfect answer to "is this process dead?" A distributed system cannot give you that. Ask a better question: "how long should we wait before acting as if this process is unavailable?"

The answer depends on the cost of each mistake. If false suspicion is expensive, wait longer or require more evidence. If delayed recovery is expensive, use shorter timeouts and accept more false alarms. Failure detection is not about certainty. It is about choosing the least harmful moment to stop waiting.