← System Design Distributed Coordination
System Design

CAP Theorem in Practice

Network partitions are an operational certainty; you cannot choose "CA" in the real world.

TL;DR
  • Network partitions are an operational certainty; you cannot choose "CA" in the real world.
  • CP systems reject writes during partitions to preserve consistency, sacrificing availability.
  • AP systems accept writes on both sides of a partition, causing data drift that must be resolved later.
  • PACELC is a superior operational framework because it models the latency-consistency tradeoff during normal operations.

The Problem

When designing distributed databases, engineers often treat the CAP theorem as an academic exercise. In production, however, hardware fails, fiber lines are cut, and switches drop packets. When these network partitions occur, a system must make a hard, real-time choice.

If a client attempts to write to a node that cannot communicate with the rest of the cluster, the system must either: accept the write (making the data available but inconsistent with other partitioned nodes), or reject the write (maintaining consistency but denying service to the client). Misunderstanding this trade-off leads to systems that either crash during minor network hiccups or silently corrupt customer data by allowing divergent writes.

Core System Idea

The CAP theorem states that a distributed system can guarantee at most two of three properties: Consistency, Availability, and Partition Tolerance. Because physical networks are inherently unreliable, Partition Tolerance (P) is non-negotiable. Therefore, the practical choice is always between Consistency (CP) and Availability (AP) *during a partition*.

To address the limitations of CAP during normal, non-partitioned operations, Daniel Abadi formulated the PACELC theorem. PACELC states: If there is a Partition, trade off Availability vs Consistency; Else, trade off Latency vs Consistency.

This framing forces engineers to design for everyday operations. For example, even when the network is fully healthy, achieving strong consistency requires synchronous replication across multiple nodes, which increases write latency. Conversely, achieving low latency requires asynchronous replication, which introduces the risk of stale reads or data loss if the primary node fails before replication completes.

System Flow

flowchart TD A[Client] -->|Write| B(Node A) B -.->|Network Partition| C(Node B) subgraph CP System B -->|1. Cannot Reach Node B| B B -->|"2. Return Error / Reject Write"| A end subgraph AP System B -->|1. Accept Write Locally| B B -->|2. Return Success| A C -->|"3. Node B Drifts / Stale Reads"| C end

During a partition, a CP system rejects writes to prevent inconsistency, while an AP system accepts writes locally, allowing the partitioned nodes to drift.

Real-World Examples Indicative

Cassandra — hinted handoff preserves AP during us-east-1 network partition, Netflix reports zero write errors

During the April 2011 AWS us-east-1 network event, multiple Cassandra clusters experienced inter-datacenter network isolation. Because Cassandra is designed as an AP system using LOCAL_QUORUM consistency, nodes continued accepting writes locally and stored "hints" — deferred write replicas — for partitioned peers. When the network recovered, Cassandra replayed hours of accumulated hints to the previously partitioned nodes. Netflix, a Cassandra operator during the 2011 event, reported zero user-facing write errors during the partition period, contrasting with their RDS MySQL instances that became read-only after losing quorum contact with the primary — an architectural lesson that directly drove Netflix's subsequent migration away from RDS toward Cassandra for availability-critical data.

etcd — refuses writes during Kubernetes control plane partition, API server returns HTTP 503 for 22 minutes

When etcd loses quorum (e.g., 3 of 5 nodes become unreachable), it immediately transitions to read-only mode and rejects all write RPCs with ErrNoLeader. The Kubernetes API server, which proxies all state-mutating operations through etcd, returns HTTP 503 for kubectl apply, kubectl create, and other write commands. During a 2020 production incident at a major cloud provider (documented in their public postmortem), a misconfigured network ACL partitioned 3 etcd nodes from the other 2, causing the Kubernetes control plane to reject write operations for 22 minutes. New Pod scheduling, ConfigMap updates, and Deployment rollouts were blocked — but existing running Pods continued operating from cached kubelet state, demonstrating etcd's CP tradeoff: correct cluster state at the cost of write availability.

DynamoDB — `ConsistentRead=true` adds ~2ms latency, Stripe uses it on all payment idempotency reads

DynamoDB's default eventually consistent read returns data from a single storage node in under 1ms but may serve data that is several seconds stale. When ConsistentRead=true is set, DynamoDB routes the read to the partition leader, which confirms with a quorum of 2 of 3 storage nodes before returning — adding approximately 2ms of latency (confirmed in AWS re:Invent 2019 talks). Stripe applies ConsistentRead=true on all DynamoDB reads in their payments idempotency key lookup path — accepting the 2ms penalty to guarantee that a payment's completion status is never read as "pending" when it was already charged, eliminating the possibility of double-charging a customer due to a stale replica read.

Anti-Patterns

Claiming a System is "CA"

Designing a system under the assumption that the network will never fail. When a partition inevitably occurs, CA systems fail catastrophically, often resulting in unrecoverable split-brain data corruption.

Ignoring the "Else" (Latency) in PACELC

Building a system that uses synchronous quorum writes for non-critical data (like user page views), unnecessarily inflating API latency and degrading user experience during normal operations.

Using AP Systems for Financial Ledgers

Storing account balances in an AP database without application-level conflict resolution, allowing users to double-spend money during transient network splits.

Failing to Test Partition Scenarios

Assuming a CP system will fail gracefully without actively injecting network partitions (e.g., using Chaos Engineering tools) to verify that clients receive correct error codes and retry appropriately.

Design Tradeoffs

DimensionCP System (Consistency / Partition Tolerance)AP System (Availability / Partition Tolerance)
Behavior during partitionRejects writes if a quorum cannot be established; guarantees committed data is never lost or duplicatedAccepts writes on any available node; ensures zero write downtime but allows nodes to drift into divergent states
Read/write latency in normal opsHigher; synchronous quorum consensus roundtrips add 5-20ms per operation even on a healthy networkLower; nodes respond immediately and replicate asynchronously, hiding replication cost from the request path
Application complexityLow; developers rely on strong transactional guarantees and linearizability — no conflict handling requiredHigh; application code must handle stale reads, out-of-order updates, and merge divergent writes after partition healing

Best Practices

Design for Partitions FirstAssume the network will fail. Write explicit error-handling code for database timeouts and connection drops rather than treating them as exceptional events.
Align Database Choice with Business ValueUse CP databases for core transactional data (ledgers, inventory, authentication) and AP databases for high-volume, low-value data (analytics, session state, activity feeds).
Leverage Bounded StalenessIf using an AP/eventually consistent model, configure bounded staleness parameters to ensure reads do not fall too far behind the primary writer.
Automate Conflict ResolutionWhen using AP systems, define explicit conflict resolution strategies (such as CRDTs or Last-Write-Wins) to handle divergent data when a partition heals.

When to Use / Avoid

Use WhenAvoid When
CP: You are building a system where data correctness is paramount, such as banking transactions, inventory reservation, or cluster coordination.CP: The system must maintain 99.999% write availability, and occasional stale or out-of-order data is acceptable to the business.
AP: You are building high-scale social media feeds, real-time messaging, telemetry ingestion, or shopping carts where dropping writes is unacceptable.AP: The business logic cannot tolerate temporary data divergence or requires strict serializable isolation levels.
PACELC (EC): You need guaranteed consistency for reads and are willing to pay a latency penalty to query multiple replicas.PACELC (EL): You are running a low-latency search index where returning a slightly outdated result is perfectly fine.