Chaos Engineering
Chaos engineering proactively injects controlled failures to surface hidden systemic weaknesses before they become production incidents — the first time your circuit breakers and failover mechanisms are tested should not be during a customer-impacting outage.
- Chaos engineering is not "breaking things randomly" — it follows a scientific methodology: define steady state using business metrics, formulate a falsifiable hypothesis, inject a specific fault, measure deviation, abort if thresholds are exceeded.
- Netflix Chaos Monkey terminates 1 random EC2 instance per day per auto-scaling group during business hours only — so engineers are awake when the fault fires. The constraint is intentional: run experiments when you can respond, not at 3am.
- Gremlin's CPU spike attack (80% CPU for 60 seconds) and network latency attack (add 100ms to all outbound connections) are the two most commonly revealing experiments — they surface synchronous call chains and missing circuit breakers more reliably than any code review.
- AWS Fault Injection Simulator (FIS) integrates with CloudWatch: configure a stop condition at
error_rate > 5%and FIS automatically aborts and rolls back the experiment without human intervention. - Start chaos experiments on a single instance in staging, not across a production fleet. The goal is to find the first failure mode, not to validate that the system survives large-scale simultaneous failures.
The Problem
Engineers design systems assuming components fail cleanly — a process crashes, a network connection drops, and the retry logic handles it. In production, failures are messy and partial: network links degrade slowly without dropping connections, disks fill up to 95% while still accepting writes, and third-party APIs return HTTP 200 with a body that causes a NullPointerException. Circuit breakers that have never been tripped have hidden configuration bugs. Failover mechanisms tested only in staging against a 10× smaller replica database behave differently at production scale. Without proactive fault injection, these edge cases remain hidden until they trigger multi-hour incidents — and the first test of your resilience mechanisms occurs during a high-stakes crisis with real customer impact.
Core System Idea
Chaos engineering follows a strict scientific methodology, not random destruction. Five steps: (1) Define steady state — identify measurable business metrics that indicate normal operation: checkout completion rate, successful API call percentage, P99 latency. Do not use system metrics (CPU, memory) as the primary steady-state indicator — chaos engineering is about user impact, not infrastructure health. (2) Formulate a falsifiable hypothesis — state what should happen: "If we terminate one of five payment service pods, checkout success rate will remain above 99.9% within 30 seconds as the remaining pods absorb the load." (3) Define blast radius and kill switch — limit the scope: 1 instance, 10% of traffic, or 1 availability zone. Configure an automated abort condition: if checkout success rate drops below 95%, halt the experiment immediately. (4) Inject the fault — terminate an instance, inject 100ms network latency on outbound calls, spike CPU to 80%. (5) Analyze results and fix — if the hypothesis holds, scale the experiment; if it fails, document the failure mode and fix the systemic weakness. The kill switch is not optional — an experiment without an automated abort condition is a production incident waiting to happen.
System Flow
The chaos engineering lifecycle enforces a controlled, hypothesis-driven loop with automated abort mechanisms — any deviation from steady state halts the experiment immediately.
Real-World Examples Indicative
Netflix Chaos Monkey (part of the Simian Army, open-sourced 2011) runs as a scheduled service within Netflix's Spinnaker deployment platform. It terminates 1 random EC2 instance per auto-scaling group per day, but only during business hours (9am–3pm Pacific) — the explicit constraint ensures engineers are awake and available to respond when the termination fires. The key metric Netflix monitors after termination: time for the ASG to launch a replacement instance and for that instance to pass health checks and enter the load balancer rotation. Target: <60 seconds from termination to replacement ready. When Chaos Monkey exposed that a specific service took 8 minutes to recover (due to slow AMI baking), Netflix added the service to the Chaos Monkey exclusion list and fixed the slow startup before re-enabling terminations.
DoorDash uses Gremlin's commercial chaos platform for two categories of experiments. CPU spike attack (80% CPU for 60 seconds) on the delivery routing service revealed a synchronous Python route-optimization function that blocked the event loop under CPU pressure, causing 8-second P99 latency for the API that coordinates driver assignments. Network latency attack (add 100ms to all outbound TCP connections from the ETA calculation service) revealed 12 sequential synchronous API calls to restaurant systems — each adding 100ms latency meant 1.2 seconds added to the critical path during any network degradation event. Both were identified in staging via Gremlin before reaching production. Gremlin's shutdownOnExit=true safety valve ensures injected faults are automatically removed if the Gremlin agent process dies — preventing orphaned network delay rules from persisting after an aborted experiment.
Amazon's own teams use FIS for quarterly GameDay experiments where aws:eks:terminate-nodegroup-instances is applied to 20% of nodes in a production EKS cluster. The target: cluster autoscaler provisions replacement nodes and pods reschedule within 3 minutes. FIS integrates with CloudWatch Alarms as stop conditions: alarm_arn: arn:aws:cloudwatch:us-east-1:...:alarm:checkout-error-rate-5pct — if checkout errors exceed 5%, FIS aborts the experiment and emits a CloudWatch event that pages the on-call engineer. FIS experiment templates are stored as CloudFormation resources in the same repository as the infrastructure they test, allowing experiments to be version-controlled and reviewed in pull requests like any other infrastructure change.
Anti-Patterns
Running a chaos experiment when the service already has a missing circuit breaker or known single point of failure. The result is a predictable outage, not a discovery. Chaos engineering surfaces unknown weaknesses — known ones should be fixed before the experiment runs.
Starting an experiment without a CloudWatch stop condition, a Gremlin maxDuration limit, or equivalent abort mechanism. An experiment that cannot be automatically halted is an uncontrolled production incident waiting for human intervention at the worst possible moment.
Defining steady state as "CPU < 70%" instead of "checkout success rate > 99.5%". A fault that causes CPU to spike from 30% to 65% but has zero impact on user-facing success rates would incorrectly trigger the abort condition.
Injecting 100ms latency on all instances simultaneously as the first experiment. The right progression: start with 1 instance in staging, then 1 instance in production, then 10% of instances — each step only after the previous step's hypothesis was confirmed.
Design Tradeoffs
| Dimension | Production Chaos | Staging Chaos |
|---|---|---|
| Realism | Real traffic patterns, production scale, and background noise | Often 10–100× smaller than production; synthetic traffic only |
| Customer risk | Possible if blast radius or kill switch is misconfigured | Zero customer impact regardless of experiment outcome |
| Failure discovery | Finds emergent failures from scale and real traffic interaction | Finds component failures and configuration bugs |
| Required maturity | Mature observability, SLOs, and organizational buy-in required | Suitable for teams in early reliability investment stages |
Best Practices
if checkout_success_rate < 95% for 60 seconds, abort experiment. Manual kill switches fail during incidents when the engineer is distracted diagnosing the fault.When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Operating complex distributed microservice architectures with circuit breakers and failover mechanisms that have never been triggered in production | The system is highly unstable, lacks basic monitoring, or has known single points of failure that should be fixed first |
| Validating that automated recovery mechanisms work at production scale — auto-scaling, circuit breaker tripping, failover | Building simple monolithic applications where failure modes are obvious and recovery is manual |
| Preparing engineering teams for on-call rotations by simulating real-world incident scenarios and validating runbooks | Highly regulated environments where any production fault injection requires extensive change-management approval processes |