← System Design Workflow Engineering
System Design

Idempotency in Distributed Systems

Idempotency makes operations safe to retry by ensuring multiple identical requests produce exactly one set of side effects — the foundation of at-least-once delivery without double-charges or duplicate records.

TL;DR
  • Stripe stores Idempotency-Key headers for 24 hours — any retry within that window returns the original response byte-for-byte without re-executing business logic.
  • The atomic check-and-set must happen in a single operation: read the key, verify absent, write PENDING — all under a lock. Two concurrent requests that both read "key absent" will both execute and both charge the card.
  • Derive idempotency keys from stable input data, not timestamps or random UUIDs generated per attempt. A key that changes between retries provides no deduplication.
  • Kafka's enable.idempotence=true assigns each producer a PID and per-partition sequence number — the broker rejects duplicates within a single producer session, giving exactly-once writes to a partition.
  • Deduplication windows must be finite: 24 hours for Stripe, 5 minutes for SQS FIFO, 4 hours for Twilio SMS. After expiry, the same key is treated as a new request.

The Problem

A mobile payments app retries a charge request on timeout. The first request succeeded — the card was charged — but the response was lost in transit. The retry executes again without an idempotency key and charges the card a second time. The customer disputes the charge; the business absorbs the chargeback fee. At a payment processor handling 1M transactions/day with a 0.5% timeout retry rate, that's 5,000 potential double-charges daily — a compliance and trust catastrophe. Without idempotency, at-least-once delivery in networks and queues translates directly into at-least-once business operations.

Core System Idea

Idempotency makes an operation safe to retry by ensuring multiple identical requests produce exactly one set of side effects. The mechanism: a client generates a unique idempotency key from stable input data and includes it in the request. The server atomically checks the key in a durable store: if absent, writes it as PENDING and executes; if present as COMPLETED, returns the cached response without re-executing; if present as PENDING, returns 409 (execution already in progress). Two storage strategies: (1) Redis idempotency store — sub-millisecond key checks using SET key PENDING NX PX 86400000 (atomic check-and-set with 24-hour TTL); ideal for high-throughput payment APIs; requires careful handling of Redis failover since a missed write during restart allows duplicate execution. (2) In-database idempotency table — key check and business write in a single ACID transaction; stronger consistency; 2–5ms overhead; ideal for financial operations where the idempotency record and the transaction record must be atomically consistent. Both strategies require tracking three states: PENDING (return 409), COMPLETED (return cached response), FAILED (allow retry or return original error).

System Flow

flowchart TD A["Client"] --> B["Service API"] B --> C{"Idempotency Key Check"} C -- "COMPLETED" --> D["Return Cached Result"] C -- "PENDING" --> E["Return 409 In Progress"] C -- "Not Found" --> F["Write Key as PENDING"] F --> G["Execute Business Logic"] G -- "Success" --> H["Mark COMPLETED, Cache Response"] G -- "Failure" --> I["Mark FAILED"] H --> J["Return Result"] D --> J

Key check determines execution path: cached result, in-progress rejection, or fresh execution with atomic status tracking.

Real-World Examples Indicative

Stripe's idempotency key system

Clients include an Idempotency-Key header on every POST. Stripe stores the key for 24 hours in a per-API-key namespace. If the same key arrives while the original request is still processing, Stripe returns 409 Conflict — it will not execute a second time. The cached response includes the full HTTP status and body from the original execution, so a 402 Payment Required is returned exactly as the original, not re-evaluated. Stripe's idempotency system processes billions of API calls per year; without it, every client retry policy would require manual deduplication.

Kafka idempotent producers

Setting enable.idempotence=true assigns each producer a unique PID (producer ID) and increments a per-partition sequence number for each message batch. The broker tracks the last sequence per (PID, partition) — any batch with a sequence ≤ last seen is rejected as a duplicate and acknowledged without re-writing. This gives exactly-once writes to a partition without a separate deduplication store, as long as the producer session doesn't restart (a new session generates a new PID).

Twilio SMS deduplication

Twilio's X-Twilio-Idempotency-Token header deduplicates SMS sends within a 4-hour window. Without it, a carrier timeout causes the caller to retry and the customer receives two identical SMS messages. Twilio documents this as a required header for any send that a client will retry on timeout — at Twilio's scale of 2B+ messages/year, duplicate sends without idempotency keys would generate millions of duplicate customer notifications annually.

Anti-Patterns

Using transaction IDs as idempotency keys

A transaction ID identifies a business transaction, not a specific attempt. If a retry for the same transaction generates a new transaction ID, both execute — you have two charges, not one.

Relying solely on database unique constraints

A UNIQUE constraint prevents duplicate DB rows but doesn't prevent a credit card charge that happened before the row insert failed. Idempotency must cover all side effects, not just database writes.

Non-atomic check-and-set

Reading the idempotency store and writing PENDING in two separate operations allows two concurrent retries to both read "key absent" and both proceed. Use Redis SET key PENDING NX or INSERT ... WHERE NOT EXISTS to make the check and write atomic.

No status tracking

Storing only key presence (not PENDING/COMPLETED/FAILED) makes it impossible to distinguish an in-progress execution from a completed one. A crashed execution looks identical to a successful one — the retry doesn't know whether to re-execute or return the cached result.

Design Tradeoffs

DimensionRedis Idempotency StoreIn-Database Idempotency Table
Key check latency<1ms2–5ms (DB round-trip)
Consistency modelRequires careful TTL management; Redis failover risksStrong ACID — key + business write in one transaction
Operational overheadRequires Redis HA deploymentUses existing DB, no new infrastructure
Best forHigh-throughput APIs, webhook deduplicationFinancial writes needing atomic key + transaction record
Expiry mechanismTTL per key (Redis EXPIRE)Scheduled cleanup job or partitioned table

Best Practices

Generate idempotency keys from stable input data: sha256(user_id + action + amount + date). A key derived from the input is identical across every retry attempt for the same logical operation.
Use Redis SET key PENDING NX PX 86400000 for atomic check-and-set. If the command returns null, the key already exists — return the cached result or 409 without executing.
Store the full response (HTTP status code + body) alongside the key. Retries receive identical responses, including error responses — a 402 is returned as-is, not re-evaluated.
Track PENDING, COMPLETED, and FAILED states explicitly. A FAILED state means the execution ran but produced an error; clients can retry based on the error type. A PENDING state means execution is still running; return 409.
Set a deduplication window appropriate to your retry policy. Stripe uses 24 hours (matches maximum retry window). SQS FIFO uses 5 minutes. Longer windows reduce duplicate risk but consume more storage.

When to Use / Avoid

Use WhenAvoid When
Financial transactions or state-mutating external API calls are involvedOperations are inherently idempotent by nature (e.g., PUT /resource with full state)
Message queues deliver at-least-once (Kafka, SQS, RabbitMQ)Read-only operations — GET requests have no side effects to deduplicate
Long-running operations may be retried on timeoutHigh-frequency internal operations where 1–5ms dedup overhead exceeds acceptable latency budget
Multiple clients may concurrently retry the same logical operationThe cost of idempotency infra exceeds the business risk of rare duplicate operations