← System Design Distributed Coordination
System Design

Multi-Region Architecture

Active-Active multi-region architectures require complex conflict resolution; physical clock-based resolution (LWW) causes data loss.

TL;DR
  • Active-Active multi-region architectures require complex conflict resolution; physical clock-based resolution (LWW) causes data loss.
  • Active-Passive architectures simplify writes by routing them to a single primary region, but introduce cross-region replication lag.
  • Synchronous cross-region replication guarantees zero data loss (RPO=0) but severely degrades write latency due to the speed of light.
  • CRDTs and operational transformation are the only safe ways to handle concurrent, multi-region writes without coordination.

The Problem

To survive the complete failure of an entire cloud provider region or to provide sub-millisecond latency to a global user base, systems must deploy across multiple geographical regions. However, the speed of light limits how fast data can travel between continents (e.g., transatlantic round-trip time is ~70ms).

If a system uses synchronous replication across regions to guarantee consistency, write latency skyrockets, rendering user-facing applications unusable. If the system replicates asynchronously to keep latency low, a region outage results in data loss (RPO > 0). Furthermore, if writes are accepted in multiple regions simultaneously (Active-Active), concurrent updates to the same record will conflict, leading to silent data corruption or divergent database states.

Core System Idea

Multi-region architectures must balance the trade-off between write latency, read freshness, and disaster recovery capabilities.

In an Active-Passive (Single-Primary) model, all writes are routed to a single "home" region. This region processes transactions locally and replicates changes asynchronously to passive "read replica" regions. This eliminates write conflicts but means users far from the primary region experience higher write latency, and passive regions may serve stale reads.

In an Active-Active (Multi-Primary) model, writes are accepted in any region. To prevent data divergence, the system must employ a deterministic conflict resolution strategy.

Because physical clocks drift across regions, relying on Last-Write-Wins (LWW) based on timestamps will arbitrarily delete valid updates. Instead, systems must use Conflict-Free Replicated Data Types (CRDTs)—such as state-based grow-only counters or observed-removed sets—or implement application-level orchestration to merge concurrent writes deterministically.

System Flow

flowchart TD UserUS[US User] -->|1. Write| RegUS(US Region - Primary) UserEU[EU User] -->|2. Write| RegEU(EU Region - Replica) RegEU -->|3. Proxy Write| RegUS RegUS -->|4. Commit Write| RegUS RegUS -.->|5. Async Replication Lag| RegEU RegEU -->|6. Local Read| UserEU

In an Active-Passive multi-region setup, writes from the EU are proxied to the US primary region, while reads are served locally from the EU replica with asynchronous lag.

Real-World Examples Indicative

Google Spanner — TrueTime ±7ms uncertainty, external consistency across 3 continents at 40K TPS

Spanner uses Google's TrueTime API — hardware clocks backed by GPS receivers and atomic clocks deployed in every Google datacenter — to achieve external consistency without distributed locks. TrueTime reports wall-clock time as an interval [earliest, latest] with a guaranteed maximum uncertainty of ±7ms. Spanner's commit protocol assigns transaction timestamps within the TrueTime interval and waits out the full uncertainty window (commit-wait) before declaring the transaction committed. This guarantees that any transaction starting after commit-wait observes the committed state. Google F1 (the database backing Google Ads) runs on Spanner across US, EU, and Asia, processing 40K+ transactions/sec with linearizable cross-region reads — without any clock synchronization protocol between regions.

AWS Aurora Global Database — storage-layer replication <1s lag, <60s failover, used by Slack for workspace metadata

Aurora Global Database uses proprietary storage-layer replication — not MySQL binlog replication — to stream physical redo log pages from the primary region to up to 5 secondary regions. At the storage layer, this replication achieves under 1 second of typical lag (300-700ms P99 under 50K writes/sec, measured in AWS re:Invent 2021 demos). During a simulated primary region failure, Aurora Global Database's managed failover promotes a secondary in under 60 seconds (RTO < 60s), compared to traditional MySQL replica promotion requiring 20-40 minutes of manual coordination. Slack uses Aurora Global Database between us-east-1 (primary) and eu-west-1 (secondary) for their workspace metadata store, relying on the <1s lag guarantee to implement their read-your-own-writes session routing logic.

Netflix Cassandra — Active-Active across 3 AWS regions with LWW scoped to eventually-consistent data only

Netflix runs Cassandra Active-Active across us-east-1, us-west-2, and eu-west-1. Viewing history writes are accepted in any region and replicated asynchronously (RF=3 per region, cross-region via Cassandra's multi-datacenter replication). Netflix uses Cassandra's Last-Write-Wins (LWW) conflict resolution based on WRITETIME() timestamps. However, since physical clocks drift between AWS regions by up to 10ms, LWW occasionally silently discards valid writes. Netflix mitigates this by routing writes for correctness-critical data (billing records, entitlements) exclusively to us-east-1 (Active-Passive), using Active-Active only for eventually-consistent data (viewing history, personalization signals) where LWW conflicts are invisible to users.

Anti-Patterns

Synchronous Cross-Region Writes in User Path

Blocking an API request to perform a synchronous three-phase commit across US, EU, and Asia regions, resulting in multi-hundred-millisecond response times.

Using LWW with Untrusted Clocks

Resolving multi-region write conflicts using database-level Last-Write-Wins without a hardware-synchronized clock source (like TrueTime), leading to silent data loss during clock drift.

Ignoring Replication Lag in Read Paths

Routing a user's read request to a local secondary region immediately after they performed a write, causing the user to see stale data because the write has not yet replicated (violating read-your-own-writes consistency).

Failing to Practice Region Failovers

Treating multi-region disaster recovery as a theoretical capability without executing regular, automated region evacuation drills in production.

Design Tradeoffs

DimensionActive-Passive (Single-Primary)Active-Active (Multi-Primary)
Write conflict riskZero; all writes route through a single primary region — no concurrent regional writes to reconcileHigh; concurrent writes to the same record from different regions require deterministic conflict resolution via CRDTs or LWW
Write latency for remote usersHigh; users geographically distant from the primary pay the full cross-region round-trip latency on every writeLow; each user writes to their nearest local region, achieving local write latency regardless of geographic distribution
Failure recovery (RPO/RTO)RPO > 0 during primary region failure; asynchronous replication lag means recent writes may be lost during failoverRPO = 0; all regions are live, so traffic is instantly rerouted to surviving regions with no data loss

Best Practices

Pin Users to Home RegionsRoute a specific user's traffic to a designated "home" region where their data is mastered, converting global Active-Active complexity into localized Active-Passive execution.
Use CRDTs for Collaborative DataFor shared data structures (like shopping carts or collaborative documents), design the schema using CRDTs to allow mathematical, conflict-free merging of concurrent regional updates.
Implement Read-Your-Own-Writes via Session TokensWhen a user writes to the primary region, return a version token. Force subsequent reads in secondary regions to wait until the local replica has caught up to that version.
Automate Region EvacuationBuild automated tooling to dynamically shift DNS (using latency-based routing with health checks) and promote secondary databases during regional outages.

When to Use / Avoid

Use WhenAvoid When
You have strict high-availability SLAs (99.99%+) where the business cannot tolerate downtime even during a cloud provider's regional outage.The application is a simple, low-traffic internal tool where a few hours of downtime during a disaster is acceptable.
Serving a globally distributed user base that requires sub-100ms response times for both reads and writes.The budget is highly constrained; multi-region deployments double or triple infrastructure and data egress costs.
Building highly collaborative, event-driven systems where data can be naturally modeled as append-only event streams.The system relies on legacy, monolithic relational databases that do not natively support distributed consensus or replication.