← System Design Reliability Engineering
System Design

Chaos Engineering

Chaos engineering proactively injects controlled failures to surface hidden systemic weaknesses before they become production incidents — the first time your circuit breakers and failover mechanisms are tested should not be during a customer-impacting outage.

TL;DR
  • Chaos engineering is not "breaking things randomly" — it follows a scientific methodology: define steady state using business metrics, formulate a falsifiable hypothesis, inject a specific fault, measure deviation, abort if thresholds are exceeded.
  • Netflix Chaos Monkey terminates 1 random EC2 instance per day per auto-scaling group during business hours only — so engineers are awake when the fault fires. The constraint is intentional: run experiments when you can respond, not at 3am.
  • Gremlin's CPU spike attack (80% CPU for 60 seconds) and network latency attack (add 100ms to all outbound connections) are the two most commonly revealing experiments — they surface synchronous call chains and missing circuit breakers more reliably than any code review.
  • AWS Fault Injection Simulator (FIS) integrates with CloudWatch: configure a stop condition at error_rate > 5% and FIS automatically aborts and rolls back the experiment without human intervention.
  • Start chaos experiments on a single instance in staging, not across a production fleet. The goal is to find the first failure mode, not to validate that the system survives large-scale simultaneous failures.

The Problem

Engineers design systems assuming components fail cleanly — a process crashes, a network connection drops, and the retry logic handles it. In production, failures are messy and partial: network links degrade slowly without dropping connections, disks fill up to 95% while still accepting writes, and third-party APIs return HTTP 200 with a body that causes a NullPointerException. Circuit breakers that have never been tripped have hidden configuration bugs. Failover mechanisms tested only in staging against a 10× smaller replica database behave differently at production scale. Without proactive fault injection, these edge cases remain hidden until they trigger multi-hour incidents — and the first test of your resilience mechanisms occurs during a high-stakes crisis with real customer impact.

Core System Idea

Chaos engineering follows a strict scientific methodology, not random destruction. Five steps: (1) Define steady state — identify measurable business metrics that indicate normal operation: checkout completion rate, successful API call percentage, P99 latency. Do not use system metrics (CPU, memory) as the primary steady-state indicator — chaos engineering is about user impact, not infrastructure health. (2) Formulate a falsifiable hypothesis — state what should happen: "If we terminate one of five payment service pods, checkout success rate will remain above 99.9% within 30 seconds as the remaining pods absorb the load." (3) Define blast radius and kill switch — limit the scope: 1 instance, 10% of traffic, or 1 availability zone. Configure an automated abort condition: if checkout success rate drops below 95%, halt the experiment immediately. (4) Inject the fault — terminate an instance, inject 100ms network latency on outbound calls, spike CPU to 80%. (5) Analyze results and fix — if the hypothesis holds, scale the experiment; if it fails, document the failure mode and fix the systemic weakness. The kill switch is not optional — an experiment without an automated abort condition is a production incident waiting to happen.

System Flow

flowchart TD A["Define Steady-State Metrics"] --> B["Formulate Resilience Hypothesis"] B --> C["Define Blast Radius and Kill Switch"] C --> D["Inject Controlled Fault"] D --> E{"Steady State Maintained?"} E -- "Yes" --> F["Scale Experiment"] E -- "No" --> G["Trigger Kill Switch"] G --> H["Document and Fix Vulnerability"]

The chaos engineering lifecycle enforces a controlled, hypothesis-driven loop with automated abort mechanisms — any deviation from steady state halts the experiment immediately.

Real-World Examples Indicative

Netflix Chaos Monkey — termination constraints

Netflix Chaos Monkey (part of the Simian Army, open-sourced 2011) runs as a scheduled service within Netflix's Spinnaker deployment platform. It terminates 1 random EC2 instance per auto-scaling group per day, but only during business hours (9am–3pm Pacific) — the explicit constraint ensures engineers are awake and available to respond when the termination fires. The key metric Netflix monitors after termination: time for the ASG to launch a replacement instance and for that instance to pass health checks and enter the load balancer rotation. Target: <60 seconds from termination to replacement ready. When Chaos Monkey exposed that a specific service took 8 minutes to recover (due to slow AMI baking), Netflix added the service to the Chaos Monkey exclusion list and fixed the slow startup before re-enabling terminations.

Gremlin attack types at DoorDash

DoorDash uses Gremlin's commercial chaos platform for two categories of experiments. CPU spike attack (80% CPU for 60 seconds) on the delivery routing service revealed a synchronous Python route-optimization function that blocked the event loop under CPU pressure, causing 8-second P99 latency for the API that coordinates driver assignments. Network latency attack (add 100ms to all outbound TCP connections from the ETA calculation service) revealed 12 sequential synchronous API calls to restaurant systems — each adding 100ms latency meant 1.2 seconds added to the critical path during any network degradation event. Both were identified in staging via Gremlin before reaching production. Gremlin's shutdownOnExit=true safety valve ensures injected faults are automatically removed if the Gremlin agent process dies — preventing orphaned network delay rules from persisting after an aborted experiment.

AWS Fault Injection Simulator (FIS) GameDays at Amazon

Amazon's own teams use FIS for quarterly GameDay experiments where aws:eks:terminate-nodegroup-instances is applied to 20% of nodes in a production EKS cluster. The target: cluster autoscaler provisions replacement nodes and pods reschedule within 3 minutes. FIS integrates with CloudWatch Alarms as stop conditions: alarm_arn: arn:aws:cloudwatch:us-east-1:...:alarm:checkout-error-rate-5pct — if checkout errors exceed 5%, FIS aborts the experiment and emits a CloudWatch event that pages the on-call engineer. FIS experiment templates are stored as CloudFormation resources in the same repository as the infrastructure they test, allowing experiments to be version-controlled and reviewed in pull requests like any other infrastructure change.

Anti-Patterns

Injecting faults on known broken systems

Running a chaos experiment when the service already has a missing circuit breaker or known single point of failure. The result is a predictable outage, not a discovery. Chaos engineering surfaces unknown weaknesses — known ones should be fixed before the experiment runs.

No automated kill switch

Starting an experiment without a CloudWatch stop condition, a Gremlin maxDuration limit, or equivalent abort mechanism. An experiment that cannot be automatically halted is an uncontrolled production incident waiting for human intervention at the worst possible moment.

Using system metrics as steady state

Defining steady state as "CPU < 70%" instead of "checkout success rate > 99.5%". A fault that causes CPU to spike from 30% to 65% but has zero impact on user-facing success rates would incorrectly trigger the abort condition.

Starting at fleet-wide scope

Injecting 100ms latency on all instances simultaneously as the first experiment. The right progression: start with 1 instance in staging, then 1 instance in production, then 10% of instances — each step only after the previous step's hypothesis was confirmed.

Design Tradeoffs

DimensionProduction ChaosStaging Chaos
RealismReal traffic patterns, production scale, and background noiseOften 10–100× smaller than production; synthetic traffic only
Customer riskPossible if blast radius or kill switch is misconfiguredZero customer impact regardless of experiment outcome
Failure discoveryFinds emergent failures from scale and real traffic interactionFinds component failures and configuration bugs
Required maturityMature observability, SLOs, and organizational buy-in requiredSuitable for teams in early reliability investment stages

Best Practices

Define the kill switch before injecting the fault, not after. The abort condition must be automated and based on a business metric: if checkout_success_rate < 95% for 60 seconds, abort experiment. Manual kill switches fail during incidents when the engineer is distracted diagnosing the fault.
Run experiments during business hours when engineers are available. Netflix's explicit business-hours constraint on Chaos Monkey is intentional — the goal is to find weaknesses with engineers present, not to validate that auto-healing works without human awareness.
Progress from single-instance staging experiments to production fleet experiments over multiple quarters. The path: (1) staging single-instance, (2) production off-peak single-instance, (3) production peak single-instance, (4) production multi-instance. Each level requires a clean pass at the previous level.
Measure business metrics as the steady-state indicator, not infrastructure metrics. A fault that causes CPU to spike without affecting user-visible success rate is not impacting reliability — the chaos experiment did not find a weakness worth fixing.
Conduct GameDays where engineers are aware of the experiment and practice diagnosing and responding. GameDays validate runbooks and alerting configuration — the goal is not just to confirm the system survives, but to confirm engineers can respond effectively when it doesn't.

When to Use / Avoid

Use WhenAvoid When
Operating complex distributed microservice architectures with circuit breakers and failover mechanisms that have never been triggered in productionThe system is highly unstable, lacks basic monitoring, or has known single points of failure that should be fixed first
Validating that automated recovery mechanisms work at production scale — auto-scaling, circuit breaker tripping, failoverBuilding simple monolithic applications where failure modes are obvious and recovery is manual
Preparing engineering teams for on-call rotations by simulating real-world incident scenarios and validating runbooksHighly regulated environments where any production fault injection requires extensive change-management approval processes