Disaster Recovery Design
RTO defines how fast you must recover, RPO defines how much data you can afford to lose — AWS RDS Multi-AZ achieves <60s RTO and near-zero RPO through synchronous replication, while warm standby trades RPO for lower cost by using asynchronous replication.
- RTO (how fast you recover) and RPO (how much data you lose) are the two metrics that drive all DR architecture decisions. AWS RDS Multi-AZ achieves RTO <60 seconds and RPO near-zero through synchronous replication to a standby instance in a separate AZ.
- Automated failover without a human gate risks split-brain: both the primary and promoted secondary accept writes simultaneously, creating conflicting data that cannot be automatically reconciled. For stateful databases, require human approval before promotion.
- DNS TTL is the hidden RTO contributor. RDS endpoint TTL is 5 seconds — designed for fast failover. A custom load balancer DNS record with TTL=3600 extends effective RTO by up to 1 hour after the infrastructure failover completes.
- Backups that have never been tested are not backups. Salesforce's DR drills and Netflix's Chaos Kong quarterly regional evacuations are the only way to validate that the recovery procedure actually works at the required RTO.
- Active-active multi-region costs 2× the infrastructure of active-passive but provides near-zero RTO and RPO. The deciding factor is the cost of downtime — if 1 hour of downtime costs more than 1 year of active-active overhead, the math favors active-active.
The Problem
A cloud region suffers a major failure. The team has a DR plan documented in a wiki last updated 14 months ago. The plan references a manual failover script that fails because the standby database's IAM role was rotated 6 months ago and the script's credentials are invalid. While debugging the credentials, the automated failover script that a junior engineer triggers attempts to promote the standby while the primary is still partially online — both databases accept writes for 8 minutes, creating conflicting transaction records that cannot be automatically merged. The incident post-mortem reveals the DR plan was never tested end-to-end, the manual steps assumed a cloud console that no longer looks the same, and the RTO target of "4 hours" was never validated against the actual recovery procedure time.
Core System Idea
Disaster recovery architecture is governed by two metrics: RTO (Recovery Time Objective) — the maximum acceptable duration from failure detection to restored service — and RPO (Recovery Point Objective) — the maximum acceptable age of data at the point of recovery. These metrics directly determine the architecture tier. Three topologies: (1) Cold standby — infrastructure is provisioned only after a disaster is declared. RTO: hours to days (depends on provisioning speed). RPO: hours (determined by backup frequency). Cost: minimal — pay only for storage of backups and IaC templates, not running compute. Use case: internal tools, dev environments. (2) Warm standby — a scaled-down replica runs continuously, receiving asynchronous replication from the primary. RTO: minutes (scale up replica, promote database, redirect DNS). RPO: seconds to minutes (determined by replication lag). Cost: 30–50% of full production cost. Use case: SaaS products with 99.9% SLA. (3) Active-active — full production infrastructure runs in multiple regions simultaneously, serving live traffic with synchronous or near-synchronous replication. RTO: seconds (traffic redirected without infrastructure changes). RPO: near-zero (synchronous replication). Cost: 2× per additional region. Use case: financial systems, global consumer products with <1 minute RTO requirements. The architecture choice must be validated against actual RTO/RPO requirements — an organization that has never tested their DR procedure does not know their actual RTO.
System Flow
A controlled disaster recovery failover balances automation with manual verification — automated promotion risks split-brain with asynchronous replication; a human gate validates data consistency before promotion.
Real-World Examples Indicative
AWS RDS Multi-AZ maintains a synchronous standby in a separate Availability Zone. Every write to the primary is committed to the standby before the write is acknowledged to the application — RPO is near-zero (no committed transactions are lost). Failover is automatic: when the primary fails, AWS updates the RDS CNAME endpoint (TTL=5 seconds) to point to the standby, and the standby is promoted to primary. Total failover time: 20–60 seconds for instance-level failures, up to 120 seconds for AZ-level failures. Applications use the RDS endpoint hostname throughout — no configuration change, IP address change, or connection string update required. The DNS-based failover is why RDS endpoint TTL is deliberately set to 5 seconds: a 30-minute TTL would mean 30 minutes of failed connections pointing to the unavailable primary even after the standby is promoted.
Netflix operates active-active across multiple AWS regions (us-east-1, us-west-2, eu-west-1) using Zuul as the regional API gateway and Eureka for service discovery. Traffic is distributed across regions at all times. During the 2011 AWS us-east-1 EBS outage, Netflix was unable to fully evacuate the region — manual steps took 4 hours and required engineer intervention at each step. Netflix subsequently built automated regional evacuation tested quarterly via Chaos Kong, a chaos tool that evacuates an entire AWS region's traffic by reconfiguring Denominator (Netflix's multi-DNS-provider abstraction) to withdraw Route 53 records for the affected region. By 2015, the evacuation procedure completed in minutes. The quarterly tests are non-negotiable: an untested evacuation procedure degrades with every infrastructure change and is effectively invalid within 6 months.
Cloudflare operates 200+ Points of Presence with anycast IP routing — multiple datacenters advertise the same IP prefix. When a datacenter fails, BGP withdraws the anycast route for that PoP; client traffic automatically routes to the nearest remaining PoP. BGP convergence typically completes within 20–90 seconds globally. RPO is zero for stateless traffic (DNS resolution, HTTP reverse proxy) — there is no state to lose. For Cloudflare's Durable Objects (stateful edge compute), state is replicated across 3 datacenters using Raft consensus: RTO is approximately 2 seconds (Raft leader re-election) and RPO is zero (Raft guarantees no committed data loss). This is the architecture distinction between stateless failover (RPO=0 trivially) and stateful failover (RPO=0 requires explicit consensus protocol).
Anti-Patterns
Configuring automated database promotion to fire on a 30-second network timeout. A transient partition that self-heals in 60 seconds triggers a full primary-to-secondary promotion, potentially causing split-brain if the original primary recovers while the new primary is accepting writes. Use human gates for stateful database promotion.
Backing up databases daily without verifying that the backup can be restored to the same data state. PostgreSQL pg_dump backups with --format=custom that have never been tested with pg_restore may fail silently on schema migrations applied after the backup was created.
Setting Route 53 TTL=3600 on the application's primary hostname. When the region fails and DNS is updated to the DR endpoint, clients cache the old record for up to 1 hour — extending the effective RTO by 60 minutes beyond the infrastructure recovery time. Set failover DNS records to TTL=60.
Promoting an asynchronous standby without checking replication lag. If the primary was 90 seconds ahead of the standby at the time of failure, RPO is 90 seconds — all transactions in that window are lost. For financial data, this must be explicitly accepted and communicated before promotion proceeds.
Design Tradeoffs
| Dimension | Active-Passive Warm Standby | Active-Active Multi-Region |
|---|---|---|
| Cost | 30–50% of active-active — secondary runs at reduced capacity | 2× infrastructure cost per additional region serving live traffic |
| RTO | Minutes — database promotion plus DNS TTL drain time | Seconds — traffic shifted by load balancer without infrastructure changes |
| RPO | Seconds to minutes — bounded by asynchronous replication lag | Near-zero — synchronous replication or consensus protocol eliminates lag |
| Write consistency | Simple — single write master, no conflict resolution needed | Complex — multi-master writes require conflict detection and resolution |
Best Practices
When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Operating mission-critical services where downtime has direct revenue impact or legal consequences | Building internal tools or non-critical applications where multi-hour outages are acceptable and have no financial consequence |
| Designing systems that must survive complete cloud provider or regional infrastructure failures | Working with tight budgets where the cost of redundant infrastructure exceeds the cost of the longest expected downtime |
| Operating under regulatory requirements for geographic data redundancy and maximum data loss | Early-stage startups where product-market fit and iteration velocity are prioritized over multi-region reliability |