← System Design Reliability Engineering
System Design

Retry Strategies

Retries mask transient failures but become self-inflicted denial-of-service attacks without jitter and retry budgets — at Twitter's scale, 20% retry amplification on 500K RPS adds 100K extra requests per second to a recovering service.

TL;DR
  • Retries mask transient failures but become self-inflicted denial-of-service attacks without jitter. Every client retrying at exactly t+1s, t+2s, t+4s creates synchronized traffic spikes that prevent a recovering service from stabilizing.
  • Full jitter: random(0, min(max_backoff, base * 2^attempt)). Randomizing the entire range — not just adding noise to a fixed delay — is the only approach that breaks synchronized retry waves.
  • Retry budgets cap the percentage of traffic that can be retries. Twitter's Finagle: max 20% of requests can be retries. At 500K RPS, 20% amplification = 100K extra requests — enough to re-saturate a service that just recovered.
  • AWS SDK v2 uses exponential backoff with full jitter by default: base 1s, max 20s, max 3 retries. Never override these to be more aggressive in production.
  • Only retry idempotent operations and explicit transient error codes (503, 429, network timeout). A 400 Bad Request or 401 Unauthorized will never succeed on retry — retrying them wastes resources.

The Problem

A database replica failover takes 30 seconds. Every service that calls the database retries immediately on connection failure — with 100 client threads each retrying every 500ms, the recovering database receives 200 requests/second during its first seconds of recovery, before it has accepted any connections. It never recovers; the retry storm keeps it saturated. Manual intervention is required. This is the canonical retry storm: a transient failure that would have self-healed in 30 seconds becomes a 2-hour incident because clients applied maximum retry load at exactly the wrong moment.

Core System Idea

A resilient retry strategy has four components: (1) Transient error classification — only retry errors that are inherently transient: 503 Service Unavailable, 429 Too Many Requests, connection timeouts, network resets. Never retry 400 Bad Request, 401 Unauthorized, or 422 Unprocessable Entity — these will not succeed on retry. (2) Exponential backoff — double the delay between retries: 100ms, 200ms, 400ms, 800ms, capped at a max (e.g., 20s). This gives the downstream service increasing recovery time. (3) Full jitter — randomize the entire backoff range: random(0, current_cap). Equal jitter (adding random noise to a fixed delay) leaves clients partially synchronized. Full jitter breaks synchronization completely. (4) Retry budgets — cap the ratio of retried requests to original requests (e.g., 10–20%). When a downstream service is in sustained failure, retry budgets shut off retries before they amplify into a storm. Circuit breakers and retry budgets are complementary: budgets control per-client amplification, circuit breakers stop retrying entirely when failure rate crosses a threshold.

System Flow

flowchart TD A["Initiate Request"] --> B["Execute Call"] B -- "Success" --> C["Return Response"] B -- "Failure" --> D{"Transient Error?"} D -- "No" --> E["Return Error"] D -- "Yes" --> F{"Retry Budget Available?"} F -- "No" --> E F -- "Yes" --> G["Calculate Backoff + Jitter"] G --> H["Wait for Delay"] H --> B

Retry loop checks error transience and budget before computing jittered backoff — non-transient errors and budget exhaustion fail immediately.

Real-World Examples Indicative

AWS SDK v2 retry behavior

AWS SDK v2 uses "standard" retry mode: max 3 retries, exponential backoff with full jitter, base delay 1s, max delay 20s. The formula: random(0, min(20s, 1s * 2^attempt)). For attempt 1: 0–2s delay; attempt 2: 0–4s; attempt 3: 0–8s. This breaks synchronized retries — 1,000 clients all hitting a brief AWS control plane blip will spread their retries across an 8-second window instead of all retrying at the same instant. The SDK also implements retry quotas (500 retry tokens per client, costing 5 per retry) to prevent aggressive retry storms from a single SDK instance.

gRPC retry policy

gRPC defines retries in service config: maxAttempts: 3, initialBackoff: "0.1s", maxBackoff: "1s", backoffMultiplier: 2, retryableStatusCodes: ["UNAVAILABLE"]. A critical gRPC detail: retries only fire before the server has processed the request — once the first byte of the request body is sent, gRPC does not retry by default. This protects non-idempotent RPCs from duplicate execution. For read-only RPCs, hedged requests can be enabled: fire a second request after a deadline without waiting for the first to fail.

Twitter's Finagle retry budgets

Finagle popularized retry budgets as a first-class primitive: RetryBudget(ttl=10s, minRetriesPerSec=5, percentCanRetry=0.2) — at most 20% of requests can be retries over a 10-second window, with a minimum of 5 retries/second to handle low-traffic scenarios. At Twitter's peak of 500K RPS, 20% amplification allows up to 100K retry requests per second. When failure rates spike and every client starts retrying, the budget caps total retry traffic and allows the recovering service to breathe. Without budgets, a 5% failure rate at 500K RPS generates 25K retries/second — enough to re-overload a recovering service.

Anti-Patterns

Immediate retries (no backoff)

Retrying on failure in a tight loop. 1,000 clients each retrying 10 times within 100ms = 10,000 requests in 100ms against a service that just had a blip. The backoff is what gives the downstream time to recover.

Retrying non-idempotent operations

Retrying a POST /payments that timed out. The first request may have succeeded — the timeout was on the response, not the execution. Without an idempotency key, the retry creates a duplicate charge.

Fan-out retry amplification

Retrying at every layer of a 4-tier call chain. Service A retries 3×, Service B retries 3×, Service C retries 3×. A single failure at the leaf service generates 27 requests to the leaf. Design retries at one layer: the edge or the client, not at every hop.

Retrying hard failures

Retrying 401 Unauthorized or 400 Bad Request. These will not succeed on retry — the request is malformed or the credentials are invalid. Every retry is wasted CPU and wasted retry budget.

Design Tradeoffs

DimensionClient-Side RetriesGateway-Level Retries
Idempotency awarenessHigh — application knows which operations are safeLow — gateway can't distinguish safe vs unsafe RPCs
Budget granularityPer-service, per-operationUniform across all traffic
Implementation complexityRequires retry logic in each client libraryCentralized, no app code changes
Fan-out riskLower — client controls its own retry countHigher — gateway retries multiply across all clients

Best Practices

Apply full jitter: delay = random(0, min(max_backoff, base * 2^attempt)). This is the AWS-recommended formula and the only approach that prevents synchronized retry waves at scale.
Cap max attempts at 2–3. Beyond 3 retries, you're either dealing with a sustained outage (retry budget should stop you) or a non-transient error (which shouldn't be retried anyway).
Check whether the original request deadline has expired before initiating a retry. A retry started with 10ms remaining on the 100ms client deadline will time out immediately — it's wasted work.
Classify retryable error codes explicitly: 503, 429, UNAVAILABLE, connection reset/timeout. Default all others to non-retryable. It's safer to not retry an operation that was actually transient than to retry one that wasn't.
Implement retry budgets at the service client level, not just individual operations. The budget should track total retry load from the entire client instance, not per-endpoint.

When to Use / Avoid

Use WhenAvoid When
Interacting with unreliable networks or third-party integrations with <99.9% SLAExecuting non-idempotent operations without idempotency keys (payment creation, resource mutation)
Handling known transient errors: 503, 429, TCP reset, connection timeoutDownstream service is in sustained failure — retry budgets should disable retries at this point
Response time SLO allows 1–2 additional round trips for retry budgetDeeply nested call chains where each layer already retries — fan-out amplification is unacceptable