← System Design Data & Messaging Systems
System Design

Message Queue Design

Decouple producers and consumers using an asynchronous, lease-based message broker to handle traffic spikes.

TL;DR
  • Decouple producers and consumers using an asynchronous, lease-based message broker to handle traffic spikes.
  • Prevent message loss with at-least-once delivery guarantees backed by explicit consumer acknowledgments.
  • Protect downstream systems from poison pills by routing repeatedly failing messages to a Dead Letter Queue (DLQ).
  • Monitor queue backlog growth as a primary indicator of consumer starvation or capacity exhaustion.

The Problem

When synchronous HTTP calls couple microservices, a slowdown in one downstream service cascades upstream, exhausting connection pools and causing system-wide outages. Without an asynchronous buffer, sudden traffic spikes overwhelm downstream databases, leading to dropped requests and data loss. Additionally, transient network failures or consumer crashes during processing result in incomplete operations with no built-in mechanism for retrying failed tasks safely.

Core System Idea

The core system relies on an asynchronous message broker utilizing a competing consumers pattern with lease-based visibility timeouts. Producers append messages to a durable queue. Instead of immediately deleting a message upon delivery, the broker hides it from other consumers by applying a visibility timeout (or lease).

The consumer must process the message and send an explicit acknowledgment (ACK) back to the broker within this timeout window to permanently delete the message. If the consumer crashes or exceeds the timeout, the lease expires, and the message becomes visible again for other consumers to process. This mechanism guarantees at-least-once delivery.

To prevent broken or unprocessable payloads (poison pills) from cycling indefinitely, the broker tracks the delivery count of each message. If a message's delivery count exceeds a configured threshold, the broker routes it to a Dead Letter Queue (DLQ) for isolated debugging.

System Flow

flowchart TD A[Producer] -- "Publish Message" --> B[Message Queue] B -- "Lease Message" --> C[Consumer Group] C -- "Process Success: ACK" --> B C -- "Process Fail: NACK" --> B B -- "Max Retries Exceeded" --> D[Dead Letter Queue] D -- "Manual Triage" --> E["Operator/Fix Script"]

Message lifecycle from ingestion through lease management, consumer acknowledgment, and dead-letter routing.

Real-World Examples Indicative

Amazon SQS at DoorDash

DoorDash uses SQS for order confirmation notifications. Each order publishes a message; Lambda polls in batches of 10 and calls DeleteMessage on success. Visibility timeout is set to 3 minutes (6× Lambda's 30s max execution time) to prevent duplicate notification sends. Failed messages after maxReceiveCount=3 route to a DLQ; DLQ depth exceeding 100 triggers a PagerDuty alert. This pattern handles 10M+ deliveries/day with zero duplicate notifications.

Shopify Sidekiq on Redis

Shopify processes 50M+ background jobs/day (image resizing, email delivery, inventory sync) through Sidekiq backed by Redis. Sidekiq implements the visibility timeout via a working sorted set—each job is claimed with score equal to its expiry timestamp. A heartbeat thread extends the lock; if a Sidekiq process dies, a cleanup sweep reschedules any job whose expiry has passed. Dead jobs (after exhausting retries with exponential backoff) are moved to a dead sorted set retained for 6 months for manual triage.

Stripe Webhook Retry Ladder

Stripe's webhook system delivers payment events to merchant-configured URLs using SQS for durability. Failed deliveries are retried with exponential backoff: 1 min, 5 min, 30 min, 2 hr, 8 hr, 24 hr—up to 3 days total. Jitter is added to each delay to prevent synchronized retry storms when a merchant's endpoint recovers. Permanently failed webhooks land in Stripe's DLQ, queryable via stripe events retrieve --limit 100 for merchant debugging.

Anti-Patterns

Using a Database as a Message Queue

Polling a relational database table (e.g., SELECT * FROM tasks WHERE status = 'pending' FOR UPDATE) causes severe lock contention, table bloat, and limits horizontal scalability.

Infinite Retries Without Backoff

Retrying failed messages immediately and indefinitely creates a self-inflicted denial-of-service attack on downstream databases and external APIs.

Setting Visibility Timeouts Too Short

If the visibility timeout is shorter than the actual processing time, multiple consumers will pull the same message concurrently, causing duplicate processing and write conflicts.

Ignoring DLQ Growth

Treating a DLQ as a silent dumping ground without active alerting leads to silent data loss and unmonitored application bugs.

Design Tradeoffs

DimensionAt-Least-Once DeliveryExactly-Once Delivery
ThroughputHigh; broker tracks only delivery state per consumer with no global consensus requiredLow; requires distributed coordination (two-phase commit or idempotency key storage) on every delivery
Consumer complexityConsumers must implement idempotency (UPSERT, dedup by message ID) to safely handle redeliveriesConsumers receive each message exactly once; no deduplication logic required in application code
Failure recoverySimple; expired lease automatically re-queues messages without coordinator involvementComplex; coordinator failures during commit leave messages in an ambiguous state requiring manual resolution

Best Practices

Ensure Consumer IdempotencyDesign consumer database writes using unique transaction IDs or natural keys (e.g., UPSERT or INSERT ON CONFLICT DO NOTHING) to safely handle duplicate message deliveries.
Configure DLQ AlertsSet up high-priority alerts on the depth of the Dead Letter Queue to detect code bugs or downstream API outages immediately.
Align Visibility Timeout with Max Processing TimeSet the visibility timeout to at least 2-3× the maximum expected processing time to account for network latency and GC pauses.
Implement Exponential Backoff with JitterWhen retrying failed operations, increase the delay exponentially and add random noise to prevent synchronized retry storms.
Monitor Queue AgeTrack the age of the oldest unacknowledged message rather than just queue depth; a small queue with a very old message indicates a stuck consumer.

When to Use / Avoid

Use WhenAvoid When
Decoupling heavy background tasks (e.g., image processing, PDF generation) from user-facing request threads.Low-latency, synchronous request-response cycles are required (e.g., real-time authentication checks).
Buffering unpredictable traffic spikes to protect downstream databases and third-party APIs.Strict message ordering is required across a massive volume of concurrent, unrelated entities.
Implementing asynchronous event-driven architectures where eventual consistency is acceptable.Systems require ACID transactions spanning multiple services simultaneously.