Circuit Breaker Pattern
Circuit breakers prevent cascading failures by fast-failing calls to degraded downstream dependencies — a single slow service must not be allowed to exhaust the thread pool of every upstream caller.
- A circuit breaker is a stateful proxy around remote calls that stops sending traffic to a degraded dependency — fast-failing with a fallback instead of blocking threads waiting for timeouts.
- Netflix ran 100+ circuit breakers across their microservices using Hystrix. The defaults that became industry standard: 20 requests minimum volume, 50% failure rate threshold, 5-second open window before attempting half-open recovery.
- The half-open state sends a configurable number of probe requests (Resilience4j default: 3). If ≥ 50% succeed, the breaker closes; any failure re-opens it immediately.
- Only count server errors (5xx, timeouts) toward the failure rate. HTTP 400/401/404 are client-side bugs — counting them trips the breaker when the downstream service is healthy.
- Emit a metric on every state transition (Closed→Open, Open→Half-Open, Half-Open→Closed). A breaker that trips and closes silently is operationally invisible.
The Problem
An e-commerce platform's recommendation service starts timing out at 3 seconds due to a slow database query. Every checkout page request calls the recommendation service — 300 concurrent checkout threads are now blocked waiting for 3-second timeouts. The thread pool is exhausted in 30 seconds. New checkout requests queue, then fail with connection timeout errors. The recommendation service — a non-critical, cosmetic feature — has taken down checkout revenue. The root failure is that the upstream caller had no mechanism to stop sending traffic to a dependency it already knew was failing.
Core System Idea
A circuit breaker wraps remote calls in a stateful proxy that tracks execution outcomes within a sliding window. Three states: Closed (normal operation — all calls pass through, outcomes tracked), Open (breaker tripped — all calls immediately return a fallback response, no network call made), Half-Open (recovery probe — a small number of test calls pass through; success closes the breaker, failure re-opens it). The breaker trips from Closed to Open when failure rate or slow-call rate crosses a threshold over a minimum call volume. Minimum volume is critical: without it, a single failure out of one call during low traffic triggers an unnecessary outage. Resilience4j defaults: sliding window of 10 calls, 50% failure threshold, 20-call minimum before evaluation, 5-second wait in Open state, 3 probe calls in Half-Open. These numbers are a starting point — production thresholds should be derived from each dependency's SLA and observed error rate baseline.
System Flow
Breaker routes traffic based on state: Closed passes all calls, Open returns fallback immediately, Half-Open sends probes to test recovery.
Real-World Examples Indicative
Netflix pioneered the circuit breaker pattern in microservices with Hystrix, running 100+ breakers across their service mesh. Each breaker was configured per dependency with dedicated thread pools — a recommendation service thread pool of size 10 meant a maximum of 10 concurrent in-flight recommendation calls; the 11th was immediately rejected. Hystrix is now in maintenance mode; Resilience4j is the successor. Key Resilience4j config: slidingWindowSize=10, failureRateThreshold=50, waitDurationInOpenState=30s, permittedNumberOfCallsInHalfOpenState=3. This means: after 10 calls, if 5+ failed, open for 30 seconds, then send 3 probes — if 2+ succeed, close.
Envoy implements two circuit breaker dimensions at the cluster level: max_connections (default: 1024 — TCP-level; new connections rejected once reached) and max_pending_requests (default: 1024 — HTTP-level; queued requests rejected when queue is full). Lyft, which created Envoy, enforces max_connections=100 per upstream cluster in production. During a database failover event, Envoy's circuit breaker prevented 50,000+ requests from queuing against an unavailable DB replica — without it, the queue would have grown until memory exhaustion.
Zalando (European e-commerce) uses Resilience4j circuit breakers on all inter-service calls. Their configuration for payment provider integrations: slowCallDurationThreshold=2s, slowCallRateThreshold=50% — this trips the breaker if 50% of calls take more than 2 seconds, even if they succeed. Slow calls that succeed are as dangerous as failing calls for thread pool exhaustion, because threads are held open for the full duration. This slow-call threshold is a Resilience4j feature that Hystrix lacked.
Anti-Patterns
Setting failureRateThreshold=50% without a minimum volume means one failure during low-traffic periods trips the breaker. A single 404 at 3am should not open the circuit breaker for a healthy service.
Returning stale cached data on fallback without emitting a circuit_breaker.open metric. The on-call engineer sees normal response rates in dashboards while users are receiving stale data. Every fallback must emit a metric.
Including 400/401/422 responses in the failure rate. These are client-side errors — the downstream service is working correctly. Counting them trips the breaker when the service is healthy.
Setting waitDurationInOpenState=300s — a 5-minute recovery window. The downstream service recovers in 30 seconds but the breaker stays open for 5 minutes, unnecessarily degrading user experience.
Design Tradeoffs
| Dimension | Client-Side Breaker | Service Mesh Breaker |
|---|---|---|
| Latency overhead | Zero (in-process) | 1–5ms (sidecar proxy hop) |
| Language support | Library required per language (Resilience4j, pybreaker) | Language-agnostic declarative YAML |
| Fallback logic | Full application-level fallbacks (cached data, defaults) | Static HTTP response codes only |
| Visibility | Per-service metrics in app instrumentation | Centralized mesh observability |
Best Practices
slowCallDurationThreshold in addition to failure rate. Slow calls that succeed still hold threads — a breaker that only trips on errors misses the latency degradation failure mode.breaker.state{name=payment-service, state=open}. Alert when any breaker opens — this is a production signal, not background noise.null return or an empty response is not a fallback — it's a silent failure waiting to cause a NullPointerException downstream.When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Calling external third-party APIs or services with unpredictable SLAs | Making in-process calls or local database connections where network failure is impossible |
| Downstream services are prone to latency spikes that exhaust thread pools | Non-idempotent writes where a fast-fail response would leave data in an inconsistent state |
| Preventing a non-critical feature (recommendations, analytics) from taking down critical paths | The upstream call is the only viable path — there is no meaningful fallback to provide |