← System Design Reliability Engineering
System Design

Timeout Design

Misconfigured timeouts — typically missing entirely or set to the library default — are the leading cause of thread pool exhaustion cascades. A single slow downstream service with no timeout constraint can freeze an entire upstream call chain.

TL;DR
  • requests.get(url) with no timeout parameter uses the OS default — often infinite. A degraded third-party API returning responses in 90 seconds holds 100 threads open for 90 seconds each. requests.get(url, timeout=(3.05, 30)) is the minimum safe baseline.
  • Separate connection timeout (50ms for same-datacenter, 200ms for cross-region) from read timeout (P99 of downstream service + 50% buffer). They measure different failure modes.
  • Deadline propagation prevents executing work that the upstream client has already abandoned. When Service A has 100ms left on its 500ms deadline, it passes that budget to Service B — Service B aborts if it can't complete in 100ms.
  • Set connection timeouts aggressively — TCP handshake within the same datacenter should take <10ms. A 5-second connection timeout means you don't detect a dead host for 5 seconds.
  • Static timeouts in deep call chains compound: 5s at Layer 1 + 5s at Layer 2 + 5s at Layer 3 = 15s end-to-end timeout. Use deadline propagation instead of stacked static timeouts.

The Problem

A Python service calls a third-party analytics API using requests.get(url) — no timeout parameter. The analytics provider degrades during a maintenance window and starts responding in 120 seconds. Each request holds a thread open for 120 seconds. At 100 concurrent users, 100 threads are blocked after 60 seconds. New requests queue, then fail. The service's own health check endpoint — which calls the analytics API — also starts timing out, causing load balancers to remove the instance. The analytics API — a non-critical reporting feature — has taken down the entire service, all because of a missing timeout parameter.

Core System Idea

Defensive timeout design enforces strict time limits at each phase of a network operation. Three distinct timeout types: (1) Connection timeout — time allowed to establish a TCP connection. Same-datacenter: 50ms; cross-region: 200ms. This detects dead hosts, full connection pools, or network partitions quickly. (2) Read timeout — time to wait for data after the connection is established. Set to P99 latency of the downstream service + 50% buffer. If the downstream P99 is 200ms, read timeout = 300ms. (3) Write timeout — time to send the request body. Rarely tuned but critical for large uploads or slow client connections. Beyond individual call timeouts, distributed systems require deadline propagation: a request entering the system at Layer 1 carries a total time budget (e.g., 500ms). Each downstream call subtracts elapsed time and passes the remaining budget as a context header (gRPC metadata, Request-Timeout HTTP header). If Layer 3 receives a request with 10ms remaining, it aborts immediately rather than executing a 100ms database query that the upstream caller will never see the result of. This prevents "dead work" — computation that consumes resources but produces results nobody uses.

System Flow

flowchart TD A["Client Request"] --> B{"TCP Connect"} B -- "Exceeds Connect Timeout" --> C["Fail Fast"] B -- "Success" --> D{"Send Request"} D -- "Exceeds Write Timeout" --> C D -- "Success" --> E{"Wait for Response"} E -- "Exceeds Read Timeout" --> C E -- "Success" --> F["Process Response"] C --> G["Return Error / Fallback"]

Each network phase has an independent timeout; any phase failure returns immediately rather than blocking the thread indefinitely.

Real-World Examples Indicative

Python requests library — the most common bug

requests.get(url) with no timeout uses the OS socket timeout (often infinite or 120 seconds). The safe baseline: requests.get(url, timeout=(3.05, 30)) — connect timeout 3.05 seconds (slightly over 3s to avoid race with TCP retransmit at exactly 3s), read timeout 30 seconds. For internal services where P99 is known, use timeout=(0.05, downstream_p99 * 1.5). This is the single most commonly missing reliability configuration in Python services.

gRPC deadline propagation at Google

gRPC's native deadline propagation: when Service A calls Service B with a 200ms deadline, gRPC automatically calculates remaining time (e.g., 140ms after Service A's own processing) and serializes it into the gRPC metadata as grpc-timeout: 140m (milliseconds). Service B's handler can call ctx.Err() to check if the deadline has already expired before executing database queries. Google's internal RPC framework (Stubby, predecessor to gRPC) enforced deadline propagation as a compile-time requirement — any RPC that didn't propagate deadlines was rejected. The result: no dead work accumulates in deep call stacks.

Netflix per-tier timeout profiles

Netflix maintains different timeout profiles by service tier. Internal service-to-service calls (same AZ): connect 50ms, read 500ms. Calls to their primary Cassandra persistence layer: connect 50ms, read 1000ms. Calls to third-party content distributors: connect 200ms, read 5000ms. The key insight: connection timeout reflects network topology (same AZ = <5ms round-trip, cross-region = up to 150ms), while read timeout reflects the service's observed P99. Conflating the two — using the same value for both — means either your connection timeout is too generous (doesn't detect dead hosts quickly) or your read timeout is too tight (fires before legitimate responses arrive).

Anti-Patterns

No timeout on HTTP clients

requests.get(url) without timeout, urllib.request.urlopen(url) without timeout, Java's HttpClient with default configuration. These use OS-level defaults that are often 60–120 seconds or infinite.

Identical connection and read timeouts

Setting both to 5s. A dead host with a dropped connection should fail in 50ms, not 5 seconds. The connection timeout should be an order of magnitude smaller than the read timeout.

Static timeouts in deep call chains

5s timeout at every layer of a 5-service chain = 25s total end-to-end timeout. No user will wait 25 seconds. Use deadline propagation to budget the total end-to-end latency, not stacked per-hop timeouts.

Dead work execution

A service that receives a request with 10ms remaining on the deadline executes a 500ms database query, completes, and returns the result — to a caller that disconnected 490ms ago. Check deadline before executing heavy operations.

Design Tradeoffs

DimensionStatic Per-Hop TimeoutsDeadline Propagation
Configuration complexitySimple — set once per serviceRequires consistent header propagation across all services
Dead work preventionNo — each hop still executes even if upstream timed outYes — aborts immediately when deadline expires
Deep call chain behaviorTimeouts multiply — 5s × 5 hops = 25sTotal deadline enforced end-to-end regardless of depth
Infrastructure requirementNoneRequires context propagation in all service clients

Best Practices

Set connection timeouts at 50ms for same-datacenter calls and 200ms for cross-region. These should be an order of magnitude below read timeouts — connection failure is a network event, not a service processing event.
Derive read timeouts from observed P99 latency: read_timeout = p99 * 1.5. Review and update these values quarterly as service performance changes. Static values set years ago are usually wrong.
Propagate deadlines explicitly in service-to-service calls. Pass the remaining budget in a Request-Timeout header or gRPC metadata. Check the deadline before executing expensive operations — if the remaining budget is less than the operation's expected latency, abort immediately.
When a timeout fires, close the socket immediately and return the thread to the pool. A thread that pauses before closing leaves the socket in TIME_WAIT state, consuming file descriptors.
Log timeout events with the service name, timeout type (connect/read/write), configured timeout value, and actual elapsed time. This data drives timeout tuning — the configured value vs observed elapsed time shows whether timeouts are too tight or never firing.

When to Use / Avoid

Use WhenAvoid When
Any service-to-service call over a network (HTTP, gRPC, TCP)Long-lived background jobs with inherently variable execution time (data exports, batch ETL)
User-facing API endpoints where slow responses degrade user experienceWebSocket connections or long-polling endpoints designed for persistent connection
Protecting thread pools from exhaustion by slow downstream dependenciesNon-idempotent writes that must run to completion regardless of time (financial transactions without idempotency keys)