← System Design Workflow Engineering
System Design

Webhook Architecture

Decouple webhook ingestion from delivery using a persistent message queue and per-subscriber isolation — one slow subscriber must never delay deliveries to thousands of healthy ones.

TL;DR
  • One slow or failing subscriber endpoint must not delay deliveries to all other subscribers — per-subscriber queue isolation is the architectural constraint that makes webhook systems reliable at scale.
  • Sign every payload with HMAC-SHA256 and include a timestamp in the signature header. Reject any delivery where the timestamp is more than 300 seconds old — this prevents replay attacks without a separate nonce store.
  • Stripe retries failed webhooks up to 20 times over 72 hours. Each retry is a new task enqueued with a scheduled delay, not a sleep in a worker thread.
  • Disable subscriptions automatically after sustained failure (e.g., 48 hours of 5xx responses) — an endpoint that has been down for 2 days will not suddenly recover, and retrying it wastes worker capacity.
  • At-least-once delivery with subscriber-side idempotency is the industry standard. Exactly-once webhook delivery requires strict serialization and causes head-of-line blocking — no major provider ships it.

The Problem

Inbound API calls are predictable, but outbound webhook delivery is chaotic. When your system must notify thousands of external subscriber endpoints about state changes, slow or unresponsive subscribers exhaust your delivery worker pool. If you attempt to deliver webhooks synchronously within the event's lifecycle, a single slow third-party server blocks your entire delivery pipeline. Without retry infrastructure, transient network failures silently drop events. Without per-subscriber isolation, one customer's broken webhook endpoint delays every other customer's deliveries — the failure radius of a single bad actor expands to the entire platform.

Core System Idea

A webhook delivery system decouples event production from HTTP delivery through three layers: (1) Event capture — the producing service writes a lightweight event payload to a durable store (database outbox or internal Kafka topic) and returns. This guarantees the event is not lost even if the delivery pipeline is down. (2) Dispatcher — a webhook dispatcher reads events from the store, resolves subscriber configurations (URL, signing secret, active status, retry policy), and enqueues delivery tasks into per-subscriber partitioned queues. Per-subscriber partitioning ensures one subscriber's delivery failures cannot delay another's. (3) Delivery workers — workers pull from subscriber queues, sign the payload with HMAC-SHA256, execute the HTTP POST with a hard timeout (5–10 seconds), and on non-2xx responses, re-enqueue with exponential backoff. Deliveries that exhaust all retries route to a DLQ. Subscriptions with sustained failure rates automatically deactivated.

System Flow

flowchart TD A["Event Producer"] --> B["Event Store / Outbox"] B --> C["Webhook Dispatcher"] C --> D["Per-Subscriber Queue"] D --> E["Delivery Worker"] E --> F["HMAC Signature Engine"] E --> G["Subscriber Endpoint"] G -- "Non-2xx / Timeout" --> H["Retry Manager"] H -- "Retry with backoff" --> D H -- "Exhausted" --> I["Dead Letter Queue"]

Per-subscriber queues isolate failure; delivery workers sign and POST; non-2xx responses trigger backoff retry; exhausted tasks route to DLQ.

Real-World Examples Indicative

Stripe's webhook delivery

Stripe retries failed webhooks on a schedule of 1hr → 3hr → 12hr → 24hr → 48hr — up to 5 attempts over 3 days total (some sources cite up to 20 attempts over 72 hours depending on subscription type). The Stripe-Signature header contains t=<unix_timestamp>,v1=<hmac_sha256> — the timestamp must be within 300 seconds of the current time; requests older than that are rejected to prevent replay attacks. Each delivery attempt is logged: request headers, request body, response headers, response status code, and the first 500 bytes of the response body — operators see the full delivery trace in the Stripe Dashboard without touching server logs.

GitHub's webhook infrastructure

GitHub webhooks fire on repository events — push, pull request, release — to potentially millions of subscriber URLs across 100M+ repositories. GitHub provides a "Recent Deliveries" page per webhook showing each attempt's request body, response headers, and HTTP status. Per-repository delivery queues prevent a high-traffic open-source repository's webhook volume from delaying deliveries for other repositories. GitHub retries on failure up to 3 times over 30 seconds; after that, the delivery is marked failed and logged.

Twilio StatusCallback delivery

Twilio fires StatusCallback webhooks on message state changes (delivered, failed, undelivered). Twilio enforces a hard 15-second response timeout — if the subscriber doesn't respond within 15 seconds, Twilio retries once immediately, then falls back to the fallbackUrl if configured. Twilio tracks per-endpoint reliability: endpoints that fail more than a threshold of deliveries receive reduced delivery priority and operators are notified via console alerts. This feedback loop prevents Twilio's delivery workers from spending disproportionate resources on endpoints that have been broken for days.

Anti-Patterns

Synchronous delivery in the request thread

Executing HTTP POST to subscriber URLs directly inside the event handler. One slow subscriber stalls the handler thread, exhausting the thread pool and cascading to unrelated operations.

Shared queue for all subscribers

A single global delivery queue means one subscriber's sustained failure causes their retry tasks to back up the entire queue, delaying all other deliveries. Partition by subscriber ID.

Exponential backoff without jitter

Retrying all failed deliveries at 1s, 2s, 4s, 8s precisely creates a thundering herd when a subscriber recovers — thousands of tasks fire simultaneously. Add randomized jitter: random(0, base_delay * 2^attempt).

Unsigned payloads

Delivering webhook payloads without HMAC signatures allows any actor to spoof events to subscriber servers. Subscribers cannot distinguish a legitimate delivery from a forged request.

No automatic subscription disabling

Continuing to enqueue delivery tasks for a subscriber that has returned 5xx for 72 hours wastes worker capacity and storage. Auto-disable subscriptions after sustained failure; require subscriber operators to re-enable after fixing their endpoint.

Design Tradeoffs

DimensionAt-Least-Once DeliveryExactly-Once Delivery
ThroughputHigh — parallel workers per subscriberVery low — strict serialization required
Head-of-line blockingNo — failed delivery retried independentlyYes — one stuck delivery blocks subsequent events
Subscriber complexityMust implement dedup on evt_idNo dedup needed
Industry standardYes — Stripe, GitHub, Twilio, PagerDutyNo — impractical at scale
Dedup window neededYes (24h event ID cache or idempotency key)N/A

Best Practices

Assign a globally unique event ID (e.g., evt_abc123) to every event and include it in the payload. Subscribers use this ID as their idempotency key — receiving the same event ID twice means the second delivery is a retry and should be a no-op.
Sign payloads with HMAC-SHA256 using a per-subscriber secret: HMAC-SHA256(secret, timestamp + '.' + payload). Include the timestamp in the signature header. Reject deliveries where the timestamp delta exceeds 300 seconds.
Enforce a hard 10-second timeout on outbound HTTP requests. Workers that wait indefinitely for a subscriber response hold their concurrency slot open and reduce throughput for all other subscribers.
Apply per-subscriber rate limiting on outbound delivery. During an event spike (e.g., 100,000 events from a single source), rate-limit delivery to 100/second per subscriber to protect their infrastructure.
Auto-disable subscriptions after 48 consecutive hours of failure. Notify the subscriber operator via email or dashboard. Require explicit re-enablement — a recovering endpoint does not self-heal silently.

When to Use / Avoid

Use WhenAvoid When
External third-party systems need real-time notification of state changesSynchronous bidirectional communication is required (caller must wait for the response)
Subscriber availability and response time are outside your controlStrong cross-service transactional consistency is required — webhooks are fire-and-forget
Fan-out to multiple independent subscribers on a single eventData payload is so sensitive it cannot traverse the public internet even with TLS
Event replay or delivery history is required for debuggingSimple point-to-point integration where a polling API would suffice