Consensus Algorithms
Consensus algorithms guarantee state machine replication across independent nodes, prioritizing safety over raw performance.
- Consensus algorithms guarantee state machine replication across independent nodes, prioritizing safety over raw performance.
- Raft simplifies Paxos by introducing a strong, centralized leader that manages all log replication.
- Quorum requirements (N/2 + 1) dictate that a 5-node cluster can tolerate exactly 2 node failures.
- Strong consistency incurs a heavy latency penalty due to mandatory disk fsyncs and multi-roundtrip network consensus.
The Problem
In a distributed database or coordination engine, data must be replicated across multiple physical machines to survive hardware failures. If nodes replicate data asynchronously, a leader crash results in data loss and state divergence. If they replicate synchronously without a formal consensus protocol, a single slow or dead node halts the entire system's ability to accept writes.
The core challenge is achieving reliable agreement on a sequence of state transitions (a log) across a network that drops, delays, and reorders messages, without relying on a single, vulnerable master node.
Core System Idea
The foundation of modern consensus is the Replicated State Machine (RSM) model. Instead of replicating database state directly, nodes replicate a log of deterministic commands. Each node processes the log entries in the exact same order, guaranteeing they arrive at the identical internal state.
Consensus algorithms like Raft and Paxos orchestrate this log replication. They operate on a quorum system: any write must be acknowledged by a majority of nodes (e.g., 3 out of 5) before it is committed and applied to the state machine.
Raft simplifies this by electing a single, strong leader. All client writes go to the leader, which appends the command to its log and replicates it to followers. Once a quorum of followers acknowledges receipt of the log entry, the leader commits it, applies it locally, and returns success to the client. This strong leader model eliminates the complex multi-leader proposal collisions inherent in basic Paxos.
System Flow
The leader replicates a write to followers, commits the entry once a quorum (2 out of 3 nodes) acknowledges receipt, and then responds to the client.
Real-World Examples Indicative
Production etcd deployments for Kubernetes control planes run as 5-node clusters (tolerating 2 simultaneous failures). Each Raft commit requires a synchronous fsync to the WAL before the follower sends an AppendEntries acknowledgment to the leader. On AWS EBS gp2 volumes under shared IOPS contention, fsync latency spikes to 250ms P99, directly stalling Kubernetes API server write throughput. Weaveworks documented in 2020 that switching etcd WAL storage to a dedicated io1 provisioned IOPS EBS volume reduced P99 fsync latency from 250ms to under 10ms, increasing etcd write throughput by 12x. The Kubernetes community now mandates etcd WAL placement on dedicated fast local disks in all production deployment guides.
CockroachDB partitions data into Ranges (initially 64MB each), with every Range independently managed by its own 3- or 5-node Raft consensus group. A 100-node CockroachDB cluster running 10,000 Ranges operates 10,000 concurrent Raft groups. During a 2021 internal scaling analysis, CockroachDB engineers identified that the bottleneck was not Raft consensus roundtrip latency (typically 2-5ms) but head-of-line blocking: WAL writes from 10,000 Raft groups co-located on shared RocksDB instances caused fsync serialization. Their fix was to partition RocksDB column families per Range to isolate Raft WAL fsync contention from KV data compaction I/O, unblocking consensus throughput at high Range density.
TiKV (the storage layer for TiDB) implements Multi-Raft with each Region (96MB default) having its own Raft group. In PingCAP's 2022 TPC-C benchmark, TiKV achieved 1ms P99 Raft commit latency by isolating the Raft WAL to a dedicated NVMe SSD (raftdb.wal-dir) separate from the KV data RocksDB instance. This isolation prevents compaction I/O from the KV data store from introducing fsync latency jitter into the Raft consensus path — a configuration requirement explicitly documented in TiKV's production deployment guide as mandatory for single-digit millisecond commit SLAs.
Anti-Patterns
Running a 4-node cluster instead of a 3 or 5-node cluster. A 4-node cluster requires a quorum of 3 to make progress, meaning it can only tolerate 1 failure—the same as a 3-node cluster, but with higher network overhead.
Reading directly from follower nodes without implementing lease reads or index tracking, which exposes the client to stale data (violating linearizability).
Storing the consensus write-ahead log (WAL) on slow, spinning disks or shared network storage. Every consensus write requires a synchronous disk flush (fsync); slow disk I/O directly throttles system write throughput.
Increasing a consensus cluster to 11 or 15 nodes to handle read traffic. This severely degrades write performance because the leader must coordinate quorums across a much larger pool of nodes.
Design Tradeoffs
| Dimension | Raft (Strong Leader) | Multi-Paxos (Symmetric/Leaderless) |
|---|---|---|
| Implementation complexity | Simple; unidirectional log flow from a single elected leader eliminates proposal collisions and split-log scenarios | Extremely complex; simultaneous proposals from multiple coordinators require multi-phase conflict resolution and careful epoch management |
| Write throughput ceiling | Bounded by the single leader's network and disk bandwidth — all writes must serialize through one node | Higher potential; multiple coordinators can propose concurrently, reducing the single-leader bottleneck in geo-distributed deployments |
| Write availability during elections | Zero; the cluster cannot commit writes during the leader election phase, typically 150-300ms in well-tuned clusters | Higher; no distinct election phase is required for basic operations, reducing write-availability gaps during node failures |
Best Practices
fsync latency spikes.When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Building metadata stores, coordination engines, distributed locks, or configuration registries where data loss is unacceptable. | Storing high-volume, transient data like application logs, metrics, IoT sensor streams, or user clickstreams. |
| Implementing distributed transactions that require strict linearizability and ACID guarantees across shards. | Building globally distributed systems where write latency must be sub-millisecond (consensus requires cross-network roundtrips). |
| Managing cluster membership, routing tables, or IP allocations where split-brain states would cause catastrophic failures. | The application workload is read-heavy and can tolerate eventually consistent or stale data. |