← System Design Distributed Coordination
System Design

Leader Election

Leader election ensures a single node coordinates system state, but network partitions inevitably cause split-brain scenarios.

TL;DR
  • Leader election ensures a single node coordinates system state, but network partitions inevitably cause split-brain scenarios.
  • Leader leases bound by monotonic time are critical to prevent old leaders from acting during network partitions.
  • Slow leaders (due to GC or CPU starvation) must step down proactively before their lease expires.
  • Heartbeat intervals must be carefully tuned relative to election timeouts to prevent cascading election storms.

The Problem

In many distributed systems, a single coordinator (the leader) is responsible for managing state, writing to storage, or scheduling tasks. If multiple nodes believe they are the leader simultaneously (a split-brain scenario), they will issue conflicting commands, leading to data corruption and inconsistent state.

This problem manifests during network partitions: a leader is cut off from the rest of the cluster. The remaining nodes elect a new leader. However, the old leader, unaware of the partition, continues to accept writes and execute actions. If the old leader is merely slow (due to a garbage collection pause or high CPU steal) rather than dead, it may resume execution and overwrite changes made by the newly elected leader.

Core System Idea

To prevent split-brain and ensure safety, leader election must combine consensus-based voting with time-bound leader leases.

When a node is elected leader via a consensus protocol (like Raft), it is granted a lease for a specific duration (e.g., 5 seconds). This lease is a contract: the leader is only authorized to act as the leader for the duration of the lease. The leader must continuously renew this lease by sending periodic heartbeats to a quorum of followers.

Crucially, the lease duration must be measured using monotonic clocks, not wall-clock time, to avoid issues with NTP clock drift. If a leader fails to receive acknowledgment for its heartbeats from a majority of nodes before its lease expires, it must immediately step down and cease all coordinator activities. Followers will only initiate a new election after the lease duration of the old leader has fully elapsed, ensuring that two leaders never overlap in their active windows.

System Flow

flowchart TD A["Node 1: Leader"] -->|1. Send Heartbeat| B(Follower 1) A -->|2. Send Heartbeat| C(Follower 2) A -->|3. Lease Renewed for 5s| A D[Network Partition Occurs] -.-> A A -->|4. Heartbeat Fails| B A -->|5. Heartbeat Fails| C A -->|"6. Lease Expires: Step Down"| A B -->|7. Timeout Elapses| B B -->|8. Elect New Leader| C

A network partition isolates Node 1, causing its lease to expire and forcing it to step down before Node 2 can be elected as the new leader.

Real-World Examples Indicative

Kubernetes control plane — etcd Lease API, 15s leaseDuration with 5:1 heartbeat safety margin

Kubernetes controller-manager and scheduler use the coordination.k8s.io/v1 Lease API backed by etcd for leader election. Each component acquires a Lease with leaseDuration=15s and renewDeadline=10s. The active leader renews its Lease every retryPeriod (2 seconds), maintaining a 5:1 safety margin over the heartbeat interval. If the active leader fails to renew within 10 seconds (5 retry cycles), a competing instance detects the expired Lease via the etcd watch API and immediately acquires it. This 15-second leaseDuration was specifically tuned to survive etcd compaction pauses, which in production add 2-3 seconds of write latency on congested EBS volumes — a value discovered empirically during the Kubernetes 1.14 release cycle.

HashiCorp Vault HA — Consul session TTL with 15s lock-delay as dual-write defense

Vault uses Consul for leader election in HA mode. The active Vault node creates a Consul session with TTL=15s and LockDelay=15s. The LockDelay is the critical safety property: after a session expires (due to a Vault crash or network partition), Consul blocks any other node from acquiring the leadership lock for an additional 15 seconds. This delay accounts for the worst-case scenario where the failed leader had a write in-flight to the Vault storage backend — without LockDelay, a new leader could begin writing while the old leader's write was still completing, corrupting Vault's encrypted secret store. HashiCorp's production Vault reference architecture mandates both TTL and LockDelay be set to the same value, creating a hard safety window.

CockroachDB — epoch-based liveness heartbeat at 4.5s, hypervisor steal incident detection

CockroachDB uses an internal liveness system separate from Raft leadership. Each node writes a periodic heartbeat to the system.liveness table with a monotonically increasing epoch counter. The heartbeat interval is 2.25 seconds; a node is declared dead after 4.5 seconds (2 missed heartbeats). During a 2019 production incident involving AWS instance hibernation, CockroachDB nodes failed to heartbeat due to hypervisor CPU steal, triggering liveness transitions. The epoch counter mechanism — not wall-clock time — ensured that writes from the "zombie" nodes were rejected by their Raft peers, which compared the incoming write's epoch against the latest observed liveness epoch and declined any request with a stale epoch, preventing split-brain writes during the hibernation event.

Anti-Patterns

Using Wall-Clock Time for Leases

Relying on system time (gettimeofday) to calculate lease expiration, which can be manipulated by NTP syncs, causing leases to expire early or last too long.

Executing Actions After Lease Expiry

Continuing to write to databases or call external APIs after the lease renewal heartbeat has failed or timed out.

Setting Election Timeouts Too Close to Heartbeat Intervals

Configuring an election timeout of 1 second with a heartbeat interval of 800 milliseconds. Minor network jitter will trigger constant, unnecessary re-elections (election storms).

Failing to Handle Slow Leaders

Assuming a leader is either fully healthy or dead; ignoring the "grey failure" state where a leader is alive but too slow to process requests, blocking the system.

Design Tradeoffs

DimensionActive Lease Push (Heartbeats)Passive Lease Pull (Polling)
Failure detection latencyLow; leader failure detected within one missed heartbeat interval — typically 2-10 seconds in production clustersHigh; detection is bounded by the polling interval — can be 30+ seconds for infrequent pollers
Network overheadHigh in large clusters; the leader fans out heartbeats to all followers on every interval, scaling linearly with cluster sizeLow; only polling agents generate traffic, scaling with polling frequency rather than cluster size
Leader health visibilityImmediate; followers know the leader is alive via continuous heartbeat acknowledgments with no polling delayDelayed; a degraded-but-alive leader may stop responding between polls without triggering detection

Best Practices

Use Monotonic Clocks OnlyAlways measure lease durations and timeouts using monotonic timers (e.g., CLOCK_MONOTONIC in Linux) to isolate the system from NTP adjustments.
Enforce a Safety MarginSet the election timeout to be at least 3x to 5x the heartbeat interval to accommodate transient network spikes without triggering re-elections.
Implement Graceful Step-DownDesign the leader to proactively relinquish its leadership and stop processing if it detects that more than half of its heartbeat attempts are failing.
Fencing on Downstream WritesCombine leader election with downstream fencing tokens so that storage engines can reject writes from a partitioned leader that hasn't realized its lease expired.

When to Use / Avoid

Use WhenAvoid When
You have a single-writer bottleneck where concurrent writes from multiple nodes would corrupt the system state.The workload is completely stateless and can be scaled horizontally behind a standard round-robin load balancer.
Coordinating complex cluster-wide operations like partition rebalancing, schema migrations, or cron scheduling.High-throughput, low-latency writes are required across all nodes simultaneously (use multi-master or sharded architectures instead).
Implementing a control plane that must make deterministic, centralized decisions for a pool of worker nodes.The cost of election downtime (during partitions or leader failures) violates strict real-time availability SLAs.