Leader Election
Leader election ensures a single node coordinates system state, but network partitions inevitably cause split-brain scenarios.
- Leader election ensures a single node coordinates system state, but network partitions inevitably cause split-brain scenarios.
- Leader leases bound by monotonic time are critical to prevent old leaders from acting during network partitions.
- Slow leaders (due to GC or CPU starvation) must step down proactively before their lease expires.
- Heartbeat intervals must be carefully tuned relative to election timeouts to prevent cascading election storms.
The Problem
In many distributed systems, a single coordinator (the leader) is responsible for managing state, writing to storage, or scheduling tasks. If multiple nodes believe they are the leader simultaneously (a split-brain scenario), they will issue conflicting commands, leading to data corruption and inconsistent state.
This problem manifests during network partitions: a leader is cut off from the rest of the cluster. The remaining nodes elect a new leader. However, the old leader, unaware of the partition, continues to accept writes and execute actions. If the old leader is merely slow (due to a garbage collection pause or high CPU steal) rather than dead, it may resume execution and overwrite changes made by the newly elected leader.
Core System Idea
To prevent split-brain and ensure safety, leader election must combine consensus-based voting with time-bound leader leases.
When a node is elected leader via a consensus protocol (like Raft), it is granted a lease for a specific duration (e.g., 5 seconds). This lease is a contract: the leader is only authorized to act as the leader for the duration of the lease. The leader must continuously renew this lease by sending periodic heartbeats to a quorum of followers.
Crucially, the lease duration must be measured using monotonic clocks, not wall-clock time, to avoid issues with NTP clock drift. If a leader fails to receive acknowledgment for its heartbeats from a majority of nodes before its lease expires, it must immediately step down and cease all coordinator activities. Followers will only initiate a new election after the lease duration of the old leader has fully elapsed, ensuring that two leaders never overlap in their active windows.
System Flow
A network partition isolates Node 1, causing its lease to expire and forcing it to step down before Node 2 can be elected as the new leader.
Real-World Examples Indicative
Kubernetes controller-manager and scheduler use the coordination.k8s.io/v1 Lease API backed by etcd for leader election. Each component acquires a Lease with leaseDuration=15s and renewDeadline=10s. The active leader renews its Lease every retryPeriod (2 seconds), maintaining a 5:1 safety margin over the heartbeat interval. If the active leader fails to renew within 10 seconds (5 retry cycles), a competing instance detects the expired Lease via the etcd watch API and immediately acquires it. This 15-second leaseDuration was specifically tuned to survive etcd compaction pauses, which in production add 2-3 seconds of write latency on congested EBS volumes — a value discovered empirically during the Kubernetes 1.14 release cycle.
Vault uses Consul for leader election in HA mode. The active Vault node creates a Consul session with TTL=15s and LockDelay=15s. The LockDelay is the critical safety property: after a session expires (due to a Vault crash or network partition), Consul blocks any other node from acquiring the leadership lock for an additional 15 seconds. This delay accounts for the worst-case scenario where the failed leader had a write in-flight to the Vault storage backend — without LockDelay, a new leader could begin writing while the old leader's write was still completing, corrupting Vault's encrypted secret store. HashiCorp's production Vault reference architecture mandates both TTL and LockDelay be set to the same value, creating a hard safety window.
CockroachDB uses an internal liveness system separate from Raft leadership. Each node writes a periodic heartbeat to the system.liveness table with a monotonically increasing epoch counter. The heartbeat interval is 2.25 seconds; a node is declared dead after 4.5 seconds (2 missed heartbeats). During a 2019 production incident involving AWS instance hibernation, CockroachDB nodes failed to heartbeat due to hypervisor CPU steal, triggering liveness transitions. The epoch counter mechanism — not wall-clock time — ensured that writes from the "zombie" nodes were rejected by their Raft peers, which compared the incoming write's epoch against the latest observed liveness epoch and declined any request with a stale epoch, preventing split-brain writes during the hibernation event.
Anti-Patterns
Relying on system time (gettimeofday) to calculate lease expiration, which can be manipulated by NTP syncs, causing leases to expire early or last too long.
Continuing to write to databases or call external APIs after the lease renewal heartbeat has failed or timed out.
Configuring an election timeout of 1 second with a heartbeat interval of 800 milliseconds. Minor network jitter will trigger constant, unnecessary re-elections (election storms).
Assuming a leader is either fully healthy or dead; ignoring the "grey failure" state where a leader is alive but too slow to process requests, blocking the system.
Design Tradeoffs
| Dimension | Active Lease Push (Heartbeats) | Passive Lease Pull (Polling) |
|---|---|---|
| Failure detection latency | Low; leader failure detected within one missed heartbeat interval — typically 2-10 seconds in production clusters | High; detection is bounded by the polling interval — can be 30+ seconds for infrequent pollers |
| Network overhead | High in large clusters; the leader fans out heartbeats to all followers on every interval, scaling linearly with cluster size | Low; only polling agents generate traffic, scaling with polling frequency rather than cluster size |
| Leader health visibility | Immediate; followers know the leader is alive via continuous heartbeat acknowledgments with no polling delay | Delayed; a degraded-but-alive leader may stop responding between polls without triggering detection |
Best Practices
CLOCK_MONOTONIC in Linux) to isolate the system from NTP adjustments.When to Use / Avoid
| Use When | Avoid When |
|---|---|
| You have a single-writer bottleneck where concurrent writes from multiple nodes would corrupt the system state. | The workload is completely stateless and can be scaled horizontally behind a standard round-robin load balancer. |
| Coordinating complex cluster-wide operations like partition rebalancing, schema migrations, or cron scheduling. | High-throughput, low-latency writes are required across all nodes simultaneously (use multi-master or sharded architectures instead). |
| Implementing a control plane that must make deterministic, centralized decisions for a pool of worker nodes. | The cost of election downtime (during partitions or leader failures) violates strict real-time availability SLAs. |