← System Design Distributed Coordination
System Design

Consensus Algorithms

Consensus algorithms guarantee state machine replication across independent nodes, prioritizing safety over raw performance.

TL;DR
  • Consensus algorithms guarantee state machine replication across independent nodes, prioritizing safety over raw performance.
  • Raft simplifies Paxos by introducing a strong, centralized leader that manages all log replication.
  • Quorum requirements (N/2 + 1) dictate that a 5-node cluster can tolerate exactly 2 node failures.
  • Strong consistency incurs a heavy latency penalty due to mandatory disk fsyncs and multi-roundtrip network consensus.

The Problem

In a distributed database or coordination engine, data must be replicated across multiple physical machines to survive hardware failures. If nodes replicate data asynchronously, a leader crash results in data loss and state divergence. If they replicate synchronously without a formal consensus protocol, a single slow or dead node halts the entire system's ability to accept writes.

The core challenge is achieving reliable agreement on a sequence of state transitions (a log) across a network that drops, delays, and reorders messages, without relying on a single, vulnerable master node.

Core System Idea

The foundation of modern consensus is the Replicated State Machine (RSM) model. Instead of replicating database state directly, nodes replicate a log of deterministic commands. Each node processes the log entries in the exact same order, guaranteeing they arrive at the identical internal state.

Consensus algorithms like Raft and Paxos orchestrate this log replication. They operate on a quorum system: any write must be acknowledged by a majority of nodes (e.g., 3 out of 5) before it is committed and applied to the state machine.

Raft simplifies this by electing a single, strong leader. All client writes go to the leader, which appends the command to its log and replicates it to followers. Once a quorum of followers acknowledges receipt of the log entry, the leader commits it, applies it locally, and returns success to the client. This strong leader model eliminates the complex multi-leader proposal collisions inherent in basic Paxos.

System Flow

flowchart TD A[Client] -->|1. Write Command| B(Leader) B -->|2. Append Entry| C(Follower 1) B -->|2. Append Entry| D(Follower 2) C -->|3. Acknowledge| B D -->|3. Acknowledge| B B -->|"4. Commit & Apply"| B B -->|5. Success| A B -->|6. Broadcast Commit| C

The leader replicates a write to followers, commits the entry once a quorum (2 out of 3 nodes) acknowledges receipt, and then responds to the client.

Real-World Examples Indicative

etcd at Kubernetes — 5-node cluster, fsync latency on EBS drives 250ms→10ms after WAL isolation

Production etcd deployments for Kubernetes control planes run as 5-node clusters (tolerating 2 simultaneous failures). Each Raft commit requires a synchronous fsync to the WAL before the follower sends an AppendEntries acknowledgment to the leader. On AWS EBS gp2 volumes under shared IOPS contention, fsync latency spikes to 250ms P99, directly stalling Kubernetes API server write throughput. Weaveworks documented in 2020 that switching etcd WAL storage to a dedicated io1 provisioned IOPS EBS volume reduced P99 fsync latency from 250ms to under 10ms, increasing etcd write throughput by 12x. The Kubernetes community now mandates etcd WAL placement on dedicated fast local disks in all production deployment guides.

CockroachDB Multi-Raft — 64MB ranges, 10K independent Raft groups, RocksDB WAL isolation fix

CockroachDB partitions data into Ranges (initially 64MB each), with every Range independently managed by its own 3- or 5-node Raft consensus group. A 100-node CockroachDB cluster running 10,000 Ranges operates 10,000 concurrent Raft groups. During a 2021 internal scaling analysis, CockroachDB engineers identified that the bottleneck was not Raft consensus roundtrip latency (typically 2-5ms) but head-of-line blocking: WAL writes from 10,000 Raft groups co-located on shared RocksDB instances caused fsync serialization. Their fix was to partition RocksDB column families per Range to isolate Raft WAL fsync contention from KV data compaction I/O, unblocking consensus throughput at high Range density.

TiKV — Raft with dedicated NVMe WAL, 1ms P99 commit latency at TPC-C scale

TiKV (the storage layer for TiDB) implements Multi-Raft with each Region (96MB default) having its own Raft group. In PingCAP's 2022 TPC-C benchmark, TiKV achieved 1ms P99 Raft commit latency by isolating the Raft WAL to a dedicated NVMe SSD (raftdb.wal-dir) separate from the KV data RocksDB instance. This isolation prevents compaction I/O from the KV data store from introducing fsync latency jitter into the Raft consensus path — a configuration requirement explicitly documented in TiKV's production deployment guide as mandatory for single-digit millisecond commit SLAs.

Anti-Patterns

Deploying Even-Numbered Clusters

Running a 4-node cluster instead of a 3 or 5-node cluster. A 4-node cluster requires a quorum of 3 to make progress, meaning it can only tolerate 1 failure—the same as a 3-node cluster, but with higher network overhead.

Bypassing the Leader for Reads

Reading directly from follower nodes without implementing lease reads or index tracking, which exposes the client to stale data (violating linearizability).

Placing Consensus Logs on Shared HDDs

Storing the consensus write-ahead log (WAL) on slow, spinning disks or shared network storage. Every consensus write requires a synchronous disk flush (fsync); slow disk I/O directly throttles system write throughput.

Over-Sizing Clusters for Read Scale

Increasing a consensus cluster to 11 or 15 nodes to handle read traffic. This severely degrades write performance because the leader must coordinate quorums across a much larger pool of nodes.

Design Tradeoffs

DimensionRaft (Strong Leader)Multi-Paxos (Symmetric/Leaderless)
Implementation complexitySimple; unidirectional log flow from a single elected leader eliminates proposal collisions and split-log scenariosExtremely complex; simultaneous proposals from multiple coordinators require multi-phase conflict resolution and careful epoch management
Write throughput ceilingBounded by the single leader's network and disk bandwidth — all writes must serialize through one nodeHigher potential; multiple coordinators can propose concurrently, reducing the single-leader bottleneck in geo-distributed deployments
Write availability during electionsZero; the cluster cannot commit writes during the leader election phase, typically 150-300ms in well-tuned clustersHigher; no distinct election phase is required for basic operations, reducing write-availability gaps during node failures

Best Practices

Isolate the WAL DiskDedicate a high-speed, local NVMe SSD solely to the consensus write-ahead log to minimize fsync latency spikes.
Implement Batching and PipeliningBatch multiple client writes into a single consensus proposal and pipeline log replication requests to saturate network bandwidth and hide latency.
Use Lease Reads for PerformanceTo serve linearizable reads without running a full consensus roundtrip for every read, implement leader leases that guarantee the leader is still valid.
Keep Cluster Sizes SmallStandardize on 3 or 5-node consensus clusters. If you need to scale storage or throughput, partition your data into multiple independent consensus groups (sharding).

When to Use / Avoid

Use WhenAvoid When
Building metadata stores, coordination engines, distributed locks, or configuration registries where data loss is unacceptable.Storing high-volume, transient data like application logs, metrics, IoT sensor streams, or user clickstreams.
Implementing distributed transactions that require strict linearizability and ACID guarantees across shards.Building globally distributed systems where write latency must be sub-millisecond (consensus requires cross-network roundtrips).
Managing cluster membership, routing tables, or IP allocations where split-brain states would cause catastrophic failures.The application workload is read-heavy and can tolerate eventually consistent or stale data.