Webhook Architecture
Decouple webhook ingestion from delivery using a persistent message queue and per-subscriber isolation — one slow subscriber must never delay deliveries to thousands of healthy ones.
- One slow or failing subscriber endpoint must not delay deliveries to all other subscribers — per-subscriber queue isolation is the architectural constraint that makes webhook systems reliable at scale.
- Sign every payload with HMAC-SHA256 and include a timestamp in the signature header. Reject any delivery where the timestamp is more than 300 seconds old — this prevents replay attacks without a separate nonce store.
- Stripe retries failed webhooks up to 20 times over 72 hours. Each retry is a new task enqueued with a scheduled delay, not a sleep in a worker thread.
- Disable subscriptions automatically after sustained failure (e.g., 48 hours of 5xx responses) — an endpoint that has been down for 2 days will not suddenly recover, and retrying it wastes worker capacity.
- At-least-once delivery with subscriber-side idempotency is the industry standard. Exactly-once webhook delivery requires strict serialization and causes head-of-line blocking — no major provider ships it.
The Problem
Inbound API calls are predictable, but outbound webhook delivery is chaotic. When your system must notify thousands of external subscriber endpoints about state changes, slow or unresponsive subscribers exhaust your delivery worker pool. If you attempt to deliver webhooks synchronously within the event's lifecycle, a single slow third-party server blocks your entire delivery pipeline. Without retry infrastructure, transient network failures silently drop events. Without per-subscriber isolation, one customer's broken webhook endpoint delays every other customer's deliveries — the failure radius of a single bad actor expands to the entire platform.
Core System Idea
A webhook delivery system decouples event production from HTTP delivery through three layers: (1) Event capture — the producing service writes a lightweight event payload to a durable store (database outbox or internal Kafka topic) and returns. This guarantees the event is not lost even if the delivery pipeline is down. (2) Dispatcher — a webhook dispatcher reads events from the store, resolves subscriber configurations (URL, signing secret, active status, retry policy), and enqueues delivery tasks into per-subscriber partitioned queues. Per-subscriber partitioning ensures one subscriber's delivery failures cannot delay another's. (3) Delivery workers — workers pull from subscriber queues, sign the payload with HMAC-SHA256, execute the HTTP POST with a hard timeout (5–10 seconds), and on non-2xx responses, re-enqueue with exponential backoff. Deliveries that exhaust all retries route to a DLQ. Subscriptions with sustained failure rates automatically deactivated.
System Flow
Per-subscriber queues isolate failure; delivery workers sign and POST; non-2xx responses trigger backoff retry; exhausted tasks route to DLQ.
Real-World Examples Indicative
Stripe retries failed webhooks on a schedule of 1hr → 3hr → 12hr → 24hr → 48hr — up to 5 attempts over 3 days total (some sources cite up to 20 attempts over 72 hours depending on subscription type). The Stripe-Signature header contains t=<unix_timestamp>,v1=<hmac_sha256> — the timestamp must be within 300 seconds of the current time; requests older than that are rejected to prevent replay attacks. Each delivery attempt is logged: request headers, request body, response headers, response status code, and the first 500 bytes of the response body — operators see the full delivery trace in the Stripe Dashboard without touching server logs.
GitHub webhooks fire on repository events — push, pull request, release — to potentially millions of subscriber URLs across 100M+ repositories. GitHub provides a "Recent Deliveries" page per webhook showing each attempt's request body, response headers, and HTTP status. Per-repository delivery queues prevent a high-traffic open-source repository's webhook volume from delaying deliveries for other repositories. GitHub retries on failure up to 3 times over 30 seconds; after that, the delivery is marked failed and logged.
Twilio fires StatusCallback webhooks on message state changes (delivered, failed, undelivered). Twilio enforces a hard 15-second response timeout — if the subscriber doesn't respond within 15 seconds, Twilio retries once immediately, then falls back to the fallbackUrl if configured. Twilio tracks per-endpoint reliability: endpoints that fail more than a threshold of deliveries receive reduced delivery priority and operators are notified via console alerts. This feedback loop prevents Twilio's delivery workers from spending disproportionate resources on endpoints that have been broken for days.
Anti-Patterns
Executing HTTP POST to subscriber URLs directly inside the event handler. One slow subscriber stalls the handler thread, exhausting the thread pool and cascading to unrelated operations.
A single global delivery queue means one subscriber's sustained failure causes their retry tasks to back up the entire queue, delaying all other deliveries. Partition by subscriber ID.
Retrying all failed deliveries at 1s, 2s, 4s, 8s precisely creates a thundering herd when a subscriber recovers — thousands of tasks fire simultaneously. Add randomized jitter: random(0, base_delay * 2^attempt).
Delivering webhook payloads without HMAC signatures allows any actor to spoof events to subscriber servers. Subscribers cannot distinguish a legitimate delivery from a forged request.
Continuing to enqueue delivery tasks for a subscriber that has returned 5xx for 72 hours wastes worker capacity and storage. Auto-disable subscriptions after sustained failure; require subscriber operators to re-enable after fixing their endpoint.
Design Tradeoffs
| Dimension | At-Least-Once Delivery | Exactly-Once Delivery |
|---|---|---|
| Throughput | High — parallel workers per subscriber | Very low — strict serialization required |
| Head-of-line blocking | No — failed delivery retried independently | Yes — one stuck delivery blocks subsequent events |
| Subscriber complexity | Must implement dedup on evt_id | No dedup needed |
| Industry standard | Yes — Stripe, GitHub, Twilio, PagerDuty | No — impractical at scale |
| Dedup window needed | Yes (24h event ID cache or idempotency key) | N/A |
Best Practices
evt_abc123) to every event and include it in the payload. Subscribers use this ID as their idempotency key — receiving the same event ID twice means the second delivery is a retry and should be a no-op.HMAC-SHA256(secret, timestamp + '.' + payload). Include the timestamp in the signature header. Reject deliveries where the timestamp delta exceeds 300 seconds.When to Use / Avoid
| Use When | Avoid When |
|---|---|
| External third-party systems need real-time notification of state changes | Synchronous bidirectional communication is required (caller must wait for the response) |
| Subscriber availability and response time are outside your control | Strong cross-service transactional consistency is required — webhooks are fire-and-forget |
| Fan-out to multiple independent subscribers on a single event | Data payload is so sensitive it cannot traverse the public internet even with TLS |
| Event replay or delivery history is required for debugging | Simple point-to-point integration where a polling API would suffice |