Idempotency in Distributed Systems
Idempotency makes operations safe to retry by ensuring multiple identical requests produce exactly one set of side effects — the foundation of at-least-once delivery without double-charges or duplicate records.
- Stripe stores
Idempotency-Keyheaders for 24 hours — any retry within that window returns the original response byte-for-byte without re-executing business logic. - The atomic check-and-set must happen in a single operation: read the key, verify absent, write PENDING — all under a lock. Two concurrent requests that both read "key absent" will both execute and both charge the card.
- Derive idempotency keys from stable input data, not timestamps or random UUIDs generated per attempt. A key that changes between retries provides no deduplication.
- Kafka's
enable.idempotence=trueassigns each producer a PID and per-partition sequence number — the broker rejects duplicates within a single producer session, giving exactly-once writes to a partition. - Deduplication windows must be finite: 24 hours for Stripe, 5 minutes for SQS FIFO, 4 hours for Twilio SMS. After expiry, the same key is treated as a new request.
The Problem
A mobile payments app retries a charge request on timeout. The first request succeeded — the card was charged — but the response was lost in transit. The retry executes again without an idempotency key and charges the card a second time. The customer disputes the charge; the business absorbs the chargeback fee. At a payment processor handling 1M transactions/day with a 0.5% timeout retry rate, that's 5,000 potential double-charges daily — a compliance and trust catastrophe. Without idempotency, at-least-once delivery in networks and queues translates directly into at-least-once business operations.
Core System Idea
Idempotency makes an operation safe to retry by ensuring multiple identical requests produce exactly one set of side effects. The mechanism: a client generates a unique idempotency key from stable input data and includes it in the request. The server atomically checks the key in a durable store: if absent, writes it as PENDING and executes; if present as COMPLETED, returns the cached response without re-executing; if present as PENDING, returns 409 (execution already in progress). Two storage strategies: (1) Redis idempotency store — sub-millisecond key checks using SET key PENDING NX PX 86400000 (atomic check-and-set with 24-hour TTL); ideal for high-throughput payment APIs; requires careful handling of Redis failover since a missed write during restart allows duplicate execution. (2) In-database idempotency table — key check and business write in a single ACID transaction; stronger consistency; 2–5ms overhead; ideal for financial operations where the idempotency record and the transaction record must be atomically consistent. Both strategies require tracking three states: PENDING (return 409), COMPLETED (return cached response), FAILED (allow retry or return original error).
System Flow
Key check determines execution path: cached result, in-progress rejection, or fresh execution with atomic status tracking.
Real-World Examples Indicative
Clients include an Idempotency-Key header on every POST. Stripe stores the key for 24 hours in a per-API-key namespace. If the same key arrives while the original request is still processing, Stripe returns 409 Conflict — it will not execute a second time. The cached response includes the full HTTP status and body from the original execution, so a 402 Payment Required is returned exactly as the original, not re-evaluated. Stripe's idempotency system processes billions of API calls per year; without it, every client retry policy would require manual deduplication.
Setting enable.idempotence=true assigns each producer a unique PID (producer ID) and increments a per-partition sequence number for each message batch. The broker tracks the last sequence per (PID, partition) — any batch with a sequence ≤ last seen is rejected as a duplicate and acknowledged without re-writing. This gives exactly-once writes to a partition without a separate deduplication store, as long as the producer session doesn't restart (a new session generates a new PID).
Twilio's X-Twilio-Idempotency-Token header deduplicates SMS sends within a 4-hour window. Without it, a carrier timeout causes the caller to retry and the customer receives two identical SMS messages. Twilio documents this as a required header for any send that a client will retry on timeout — at Twilio's scale of 2B+ messages/year, duplicate sends without idempotency keys would generate millions of duplicate customer notifications annually.
Anti-Patterns
A transaction ID identifies a business transaction, not a specific attempt. If a retry for the same transaction generates a new transaction ID, both execute — you have two charges, not one.
A UNIQUE constraint prevents duplicate DB rows but doesn't prevent a credit card charge that happened before the row insert failed. Idempotency must cover all side effects, not just database writes.
Reading the idempotency store and writing PENDING in two separate operations allows two concurrent retries to both read "key absent" and both proceed. Use Redis SET key PENDING NX or INSERT ... WHERE NOT EXISTS to make the check and write atomic.
Storing only key presence (not PENDING/COMPLETED/FAILED) makes it impossible to distinguish an in-progress execution from a completed one. A crashed execution looks identical to a successful one — the retry doesn't know whether to re-execute or return the cached result.
Design Tradeoffs
| Dimension | Redis Idempotency Store | In-Database Idempotency Table |
|---|---|---|
| Key check latency | <1ms | 2–5ms (DB round-trip) |
| Consistency model | Requires careful TTL management; Redis failover risks | Strong ACID — key + business write in one transaction |
| Operational overhead | Requires Redis HA deployment | Uses existing DB, no new infrastructure |
| Best for | High-throughput APIs, webhook deduplication | Financial writes needing atomic key + transaction record |
| Expiry mechanism | TTL per key (Redis EXPIRE) | Scheduled cleanup job or partitioned table |
Best Practices
sha256(user_id + action + amount + date). A key derived from the input is identical across every retry attempt for the same logical operation.SET key PENDING NX PX 86400000 for atomic check-and-set. If the command returns null, the key already exists — return the cached result or 409 without executing.PENDING, COMPLETED, and FAILED states explicitly. A FAILED state means the execution ran but produced an error; clients can retry based on the error type. A PENDING state means execution is still running; return 409.When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Financial transactions or state-mutating external API calls are involved | Operations are inherently idempotent by nature (e.g., PUT /resource with full state) |
| Message queues deliver at-least-once (Kafka, SQS, RabbitMQ) | Read-only operations — GET requests have no side effects to deduplicate |
| Long-running operations may be retried on timeout | High-frequency internal operations where 1–5ms dedup overhead exceeds acceptable latency budget |
| Multiple clients may concurrently retry the same logical operation | The cost of idempotency infra exceeds the business risk of rare duplicate operations |