← System Design Reliability Engineering
System Design

Disaster Recovery Design

RTO defines how fast you must recover, RPO defines how much data you can afford to lose — AWS RDS Multi-AZ achieves <60s RTO and near-zero RPO through synchronous replication, while warm standby trades RPO for lower cost by using asynchronous replication.

TL;DR
  • RTO (how fast you recover) and RPO (how much data you lose) are the two metrics that drive all DR architecture decisions. AWS RDS Multi-AZ achieves RTO <60 seconds and RPO near-zero through synchronous replication to a standby instance in a separate AZ.
  • Automated failover without a human gate risks split-brain: both the primary and promoted secondary accept writes simultaneously, creating conflicting data that cannot be automatically reconciled. For stateful databases, require human approval before promotion.
  • DNS TTL is the hidden RTO contributor. RDS endpoint TTL is 5 seconds — designed for fast failover. A custom load balancer DNS record with TTL=3600 extends effective RTO by up to 1 hour after the infrastructure failover completes.
  • Backups that have never been tested are not backups. Salesforce's DR drills and Netflix's Chaos Kong quarterly regional evacuations are the only way to validate that the recovery procedure actually works at the required RTO.
  • Active-active multi-region costs 2× the infrastructure of active-passive but provides near-zero RTO and RPO. The deciding factor is the cost of downtime — if 1 hour of downtime costs more than 1 year of active-active overhead, the math favors active-active.

The Problem

A cloud region suffers a major failure. The team has a DR plan documented in a wiki last updated 14 months ago. The plan references a manual failover script that fails because the standby database's IAM role was rotated 6 months ago and the script's credentials are invalid. While debugging the credentials, the automated failover script that a junior engineer triggers attempts to promote the standby while the primary is still partially online — both databases accept writes for 8 minutes, creating conflicting transaction records that cannot be automatically merged. The incident post-mortem reveals the DR plan was never tested end-to-end, the manual steps assumed a cloud console that no longer looks the same, and the RTO target of "4 hours" was never validated against the actual recovery procedure time.

Core System Idea

Disaster recovery architecture is governed by two metrics: RTO (Recovery Time Objective) — the maximum acceptable duration from failure detection to restored service — and RPO (Recovery Point Objective) — the maximum acceptable age of data at the point of recovery. These metrics directly determine the architecture tier. Three topologies: (1) Cold standby — infrastructure is provisioned only after a disaster is declared. RTO: hours to days (depends on provisioning speed). RPO: hours (determined by backup frequency). Cost: minimal — pay only for storage of backups and IaC templates, not running compute. Use case: internal tools, dev environments. (2) Warm standby — a scaled-down replica runs continuously, receiving asynchronous replication from the primary. RTO: minutes (scale up replica, promote database, redirect DNS). RPO: seconds to minutes (determined by replication lag). Cost: 30–50% of full production cost. Use case: SaaS products with 99.9% SLA. (3) Active-active — full production infrastructure runs in multiple regions simultaneously, serving live traffic with synchronous or near-synchronous replication. RTO: seconds (traffic redirected without infrastructure changes). RPO: near-zero (synchronous replication). Cost: 2× per additional region. Use case: financial systems, global consumer products with <1 minute RTO requirements. The architecture choice must be validated against actual RTO/RPO requirements — an organization that has never tested their DR procedure does not know their actual RTO.

System Flow

flowchart TD A["Primary Region Active"] --> B["Asynchronous Data Replication"] B --> C["Secondary Region Standby"] A -- "Region Outage Detected" --> D{"Failover Decision"} D -- "Automated" --> E["Risk: Split-Brain"] D -- "Manual Gate" --> F["Verify Data Consistency"] F --> G["Promote Secondary to Primary"] G --> H["Redirect DNS and Global Traffic"]

A controlled disaster recovery failover balances automation with manual verification — automated promotion risks split-brain with asynchronous replication; a human gate validates data consistency before promotion.

Real-World Examples Indicative

AWS RDS Multi-AZ — synchronous replication with <60s RTO

AWS RDS Multi-AZ maintains a synchronous standby in a separate Availability Zone. Every write to the primary is committed to the standby before the write is acknowledged to the application — RPO is near-zero (no committed transactions are lost). Failover is automatic: when the primary fails, AWS updates the RDS CNAME endpoint (TTL=5 seconds) to point to the standby, and the standby is promoted to primary. Total failover time: 20–60 seconds for instance-level failures, up to 120 seconds for AZ-level failures. Applications use the RDS endpoint hostname throughout — no configuration change, IP address change, or connection string update required. The DNS-based failover is why RDS endpoint TTL is deliberately set to 5 seconds: a 30-minute TTL would mean 30 minutes of failed connections pointing to the unavailable primary even after the standby is promoted.

Netflix Active-Active regional evacuation via Chaos Kong

Netflix operates active-active across multiple AWS regions (us-east-1, us-west-2, eu-west-1) using Zuul as the regional API gateway and Eureka for service discovery. Traffic is distributed across regions at all times. During the 2011 AWS us-east-1 EBS outage, Netflix was unable to fully evacuate the region — manual steps took 4 hours and required engineer intervention at each step. Netflix subsequently built automated regional evacuation tested quarterly via Chaos Kong, a chaos tool that evacuates an entire AWS region's traffic by reconfiguring Denominator (Netflix's multi-DNS-provider abstraction) to withdraw Route 53 records for the affected region. By 2015, the evacuation procedure completed in minutes. The quarterly tests are non-negotiable: an untested evacuation procedure degrades with every infrastructure change and is effectively invalid within 6 months.

Cloudflare anycast routing with BGP-level failover

Cloudflare operates 200+ Points of Presence with anycast IP routing — multiple datacenters advertise the same IP prefix. When a datacenter fails, BGP withdraws the anycast route for that PoP; client traffic automatically routes to the nearest remaining PoP. BGP convergence typically completes within 20–90 seconds globally. RPO is zero for stateless traffic (DNS resolution, HTTP reverse proxy) — there is no state to lose. For Cloudflare's Durable Objects (stateful edge compute), state is replicated across 3 datacenters using Raft consensus: RTO is approximately 2 seconds (Raft leader re-election) and RPO is zero (Raft guarantees no committed data loss). This is the architecture distinction between stateless failover (RPO=0 trivially) and stateful failover (RPO=0 requires explicit consensus protocol).

Anti-Patterns

Automated multi-region failover on flaky triggers

Configuring automated database promotion to fire on a 30-second network timeout. A transient partition that self-heals in 60 seconds triggers a full primary-to-secondary promotion, potentially causing split-brain if the original primary recovers while the new primary is accepting writes. Use human gates for stateful database promotion.

Untested backups

Backing up databases daily without verifying that the backup can be restored to the same data state. PostgreSQL pg_dump backups with --format=custom that have never been tested with pg_restore may fail silently on schema migrations applied after the backup was created.

High DNS TTL on failover endpoints

Setting Route 53 TTL=3600 on the application's primary hostname. When the region fails and DNS is updated to the DR endpoint, clients cache the old record for up to 1 hour — extending the effective RTO by 60 minutes beyond the infrastructure recovery time. Set failover DNS records to TTL=60.

Ignoring replication lag during failover

Promoting an asynchronous standby without checking replication lag. If the primary was 90 seconds ahead of the standby at the time of failure, RPO is 90 seconds — all transactions in that window are lost. For financial data, this must be explicitly accepted and communicated before promotion proceeds.

Design Tradeoffs

DimensionActive-Passive Warm StandbyActive-Active Multi-Region
Cost30–50% of active-active — secondary runs at reduced capacity2× infrastructure cost per additional region serving live traffic
RTOMinutes — database promotion plus DNS TTL drain timeSeconds — traffic shifted by load balancer without infrastructure changes
RPOSeconds to minutes — bounded by asynchronous replication lagNear-zero — synchronous replication or consensus protocol eliminates lag
Write consistencySimple — single write master, no conflict resolution neededComplex — multi-master writes require conflict detection and resolution

Best Practices

Require a human approval gate for stateful database promotion in warm standby configurations. Verify replication lag, check that the primary is confirmed offline (not just unreachable), and validate that no in-flight transactions are pending before promoting the standby.
Set DNS TTLs to 30–60 seconds on all failover-path records. The RDS endpoint uses TTL=5 for this reason — the infrastructure recovery is instantaneous relative to a 3600-second TTL. Audit your DNS records quarterly for high TTLs on hostnames in the recovery path.
Test the DR procedure end-to-end at least quarterly, including credential rotation verification, updated console UI, and timing validation against the RTO target. DR plans that are not tested are not DR plans — they are documentation with unknown accuracy.
Implement read-only failover mode for warm standby databases: when the primary is unreachable, serve reads from the replica immediately while the human gate decision is pending. Read-only mode prevents revenue loss for read-heavy workloads during the promotion verification window.
Automate infrastructure provisioning for cold and warm standby using IaC (Terraform, CloudFormation). A standby environment that requires manual provisioning extends RTO by the time required to hand-configure infrastructure under incident stress.

When to Use / Avoid

Use WhenAvoid When
Operating mission-critical services where downtime has direct revenue impact or legal consequencesBuilding internal tools or non-critical applications where multi-hour outages are acceptable and have no financial consequence
Designing systems that must survive complete cloud provider or regional infrastructure failuresWorking with tight budgets where the cost of redundant infrastructure exceeds the cost of the longest expected downtime
Operating under regulatory requirements for geographic data redundancy and maximum data lossEarly-stage startups where product-market fit and iteration velocity are prioritized over multi-region reliability