← System Design Reliability Engineering
System Design

Circuit Breaker Pattern

Circuit breakers prevent cascading failures by fast-failing calls to degraded downstream dependencies — a single slow service must not be allowed to exhaust the thread pool of every upstream caller.

TL;DR
  • A circuit breaker is a stateful proxy around remote calls that stops sending traffic to a degraded dependency — fast-failing with a fallback instead of blocking threads waiting for timeouts.
  • Netflix ran 100+ circuit breakers across their microservices using Hystrix. The defaults that became industry standard: 20 requests minimum volume, 50% failure rate threshold, 5-second open window before attempting half-open recovery.
  • The half-open state sends a configurable number of probe requests (Resilience4j default: 3). If ≥ 50% succeed, the breaker closes; any failure re-opens it immediately.
  • Only count server errors (5xx, timeouts) toward the failure rate. HTTP 400/401/404 are client-side bugs — counting them trips the breaker when the downstream service is healthy.
  • Emit a metric on every state transition (Closed→Open, Open→Half-Open, Half-Open→Closed). A breaker that trips and closes silently is operationally invisible.

The Problem

An e-commerce platform's recommendation service starts timing out at 3 seconds due to a slow database query. Every checkout page request calls the recommendation service — 300 concurrent checkout threads are now blocked waiting for 3-second timeouts. The thread pool is exhausted in 30 seconds. New checkout requests queue, then fail with connection timeout errors. The recommendation service — a non-critical, cosmetic feature — has taken down checkout revenue. The root failure is that the upstream caller had no mechanism to stop sending traffic to a dependency it already knew was failing.

Core System Idea

A circuit breaker wraps remote calls in a stateful proxy that tracks execution outcomes within a sliding window. Three states: Closed (normal operation — all calls pass through, outcomes tracked), Open (breaker tripped — all calls immediately return a fallback response, no network call made), Half-Open (recovery probe — a small number of test calls pass through; success closes the breaker, failure re-opens it). The breaker trips from Closed to Open when failure rate or slow-call rate crosses a threshold over a minimum call volume. Minimum volume is critical: without it, a single failure out of one call during low traffic triggers an unnecessary outage. Resilience4j defaults: sliding window of 10 calls, 50% failure threshold, 20-call minimum before evaluation, 5-second wait in Open state, 3 probe calls in Half-Open. These numbers are a starting point — production thresholds should be derived from each dependency's SLA and observed error rate baseline.

System Flow

flowchart TD A["Client Request"] --> B{"Breaker State?"} B -- "Closed" --> C["Execute Remote Call"] C -- "Success" --> D["Return Response"] C -- "Failure or Timeout" --> E["Increment Failure Count"] E --> F{"Threshold Exceeded?"} F -- "Yes" --> G["Trip to OPEN"] F -- "No" --> D B -- "Open" --> H["Return Fallback"] B -- "Half-Open" --> I["Send Probe Request"] I -- "Success" --> J["Close Breaker"] I -- "Failure" --> G

Breaker routes traffic based on state: Closed passes all calls, Open returns fallback immediately, Half-Open sends probes to test recovery.

Real-World Examples Indicative

Netflix Hystrix / Resilience4j

Netflix pioneered the circuit breaker pattern in microservices with Hystrix, running 100+ breakers across their service mesh. Each breaker was configured per dependency with dedicated thread pools — a recommendation service thread pool of size 10 meant a maximum of 10 concurrent in-flight recommendation calls; the 11th was immediately rejected. Hystrix is now in maintenance mode; Resilience4j is the successor. Key Resilience4j config: slidingWindowSize=10, failureRateThreshold=50, waitDurationInOpenState=30s, permittedNumberOfCallsInHalfOpenState=3. This means: after 10 calls, if 5+ failed, open for 30 seconds, then send 3 probes — if 2+ succeed, close.

Envoy's upstream circuit breaking

Envoy implements two circuit breaker dimensions at the cluster level: max_connections (default: 1024 — TCP-level; new connections rejected once reached) and max_pending_requests (default: 1024 — HTTP-level; queued requests rejected when queue is full). Lyft, which created Envoy, enforces max_connections=100 per upstream cluster in production. During a database failover event, Envoy's circuit breaker prevented 50,000+ requests from queuing against an unavailable DB replica — without it, the queue would have grown until memory exhaustion.

Resilience4j at Zalando

Zalando (European e-commerce) uses Resilience4j circuit breakers on all inter-service calls. Their configuration for payment provider integrations: slowCallDurationThreshold=2s, slowCallRateThreshold=50% — this trips the breaker if 50% of calls take more than 2 seconds, even if they succeed. Slow calls that succeed are as dangerous as failing calls for thread pool exhaustion, because threads are held open for the full duration. This slow-call threshold is a Resilience4j feature that Hystrix lacked.

Anti-Patterns

No minimum call volume

Setting failureRateThreshold=50% without a minimum volume means one failure during low-traffic periods trips the breaker. A single 404 at 3am should not open the circuit breaker for a healthy service.

Silent fallbacks without metrics

Returning stale cached data on fallback without emitting a circuit_breaker.open metric. The on-call engineer sees normal response rates in dashboards while users are receiving stale data. Every fallback must emit a metric.

Wrapping non-transient errors

Including 400/401/422 responses in the failure rate. These are client-side errors — the downstream service is working correctly. Counting them trips the breaker when the service is healthy.

Infinite open window

Setting waitDurationInOpenState=300s — a 5-minute recovery window. The downstream service recovers in 30 seconds but the breaker stays open for 5 minutes, unnecessarily degrading user experience.

Design Tradeoffs

DimensionClient-Side BreakerService Mesh Breaker
Latency overheadZero (in-process)1–5ms (sidecar proxy hop)
Language supportLibrary required per language (Resilience4j, pybreaker)Language-agnostic declarative YAML
Fallback logicFull application-level fallbacks (cached data, defaults)Static HTTP response codes only
VisibilityPer-service metrics in app instrumentationCentralized mesh observability

Best Practices

Set a minimum call volume (20–100 requests) before the breaker evaluates thresholds. This prevents low-traffic false positives during off-hours.
Configure a slowCallDurationThreshold in addition to failure rate. Slow calls that succeed still hold threads — a breaker that only trips on errors misses the latency degradation failure mode.
Emit state transition events as metrics: breaker.state{name=payment-service, state=open}. Alert when any breaker opens — this is a production signal, not background noise.
Define fallbacks explicitly for every circuit-broken call. A null return or an empty response is not a fallback — it's a silent failure waiting to cause a NullPointerException downstream.
Use the Half-Open state's probe count conservatively: 1–3 probes. Sending too many probes under a recovering service adds load at the worst moment.

When to Use / Avoid

Use WhenAvoid When
Calling external third-party APIs or services with unpredictable SLAsMaking in-process calls or local database connections where network failure is impossible
Downstream services are prone to latency spikes that exhaust thread poolsNon-idempotent writes where a fast-fail response would leave data in an inconsistent state
Preventing a non-critical feature (recommendations, analytics) from taking down critical pathsThe upstream call is the only viable path — there is no meaningful fallback to provide