← System Design Workflow Engineering
System Design

Task Queue Design

Task queues decouple producers from workers — Stripe processes 250M+ webhook deliveries/month without blocking API responses by offloading all delivery retries to an async queue.

TL;DR
  • Task queues decouple producers from workers — Stripe processes 250M+ webhook deliveries/month without blocking API responses by offloading retries to an async queue.
  • SQS visibility timeout is your at-least-once delivery mechanism: a worker that crashes mid-task causes the message to reappear after the timeout, not be lost. Set it to 2× your task's P99 execution time.
  • Worker concurrency is workload-dependent: Celery I/O-bound workers (API calls, DB writes) run at 50–200 concurrency; CPU-bound workers (image processing, ML inference) should match CPU core count — typically 4–8.
  • Queue depth is your primary scaling signal: 500+ messages per worker is the auto-scale threshold, not a soft warning. Scaling on CPU is a lagging indicator.
  • Dead letter queues are not a dump — every DLQ message must carry structured metadata: task type, failure reason, retry count, and original enqueue timestamp.

The Problem

A checkout service sends confirmation emails synchronously during the payment response path. The email provider degrades during Black Friday — 3-second email timeouts cascade into 3-second checkout responses, conversion drops 60%, and the on-call engineer discovers 80,000 failed transactions because the email provider went down. The email send had no business being in the payment critical path. Without task queues, every transient failure in a non-critical downstream operation becomes a user-visible incident.

Core System Idea

A task queue separates work acceptance from work execution. A producer enqueues a serialized task into a durable broker — Redis, SQS, RabbitMQ, or Kafka — and returns immediately. A worker pool polls the broker, claims tasks using visibility locks (SQS), BRPOP (Redis), or channel prefetch (RabbitMQ), executes the work, and acknowledges completion. Unacknowledged tasks are redelivered after the visibility timeout, guaranteeing at-least-once execution. Three design levers determine queue behavior: (1) Priority queues — separate high-priority (payment notifications) from low-priority (weekly reports) queues with dedicated worker pools per tier; Shopify runs 5 priority tiers in Resque so critical jobs never wait behind bulk CSV exports. (2) Concurrency tuning — I/O-bound tasks (HTTP calls, DB writes) tolerate high concurrency (50–200); CPU-bound tasks (transcoding, ML inference) need concurrency ≤ CPU cores; mixing them in one pool at one concurrency setting under-utilizes one class and over-commits the other. (3) Backpressure — producers must block or reject when queue depth exceeds a threshold; unbounded queue growth leads to stale tasks processed hours after their result matters.

System Flow

flowchart TD A["Producer"] --> B["Priority Router"] B --> C["High-Priority Queue"] B --> D["Default Queue"] C --> E["Worker Pool (Fast)"] D --> F["Worker Pool (Bulk)"] E --> G["Result Store"] F --> G E -- "Exhausted retries" --> H["Dead Letter Queue"] F -- "Exhausted retries" --> H

Priority router separates tasks by tier; dedicated worker pools per priority; exhausted-retry tasks route to DLQ for inspection.

Real-World Examples Indicative

Stripe's webhook delivery

Stripe delivers 250M+ webhooks/month using an exponential retry schedule: initial attempt → 1s → 5s → 30s → 2min → 10min → 1hr, up to 20 attempts over 72 hours. Each retry is an independent task enqueued with a scheduled delay. Tasks that exhaust all attempts land in a dead letter queue visible in Stripe's merchant dashboard — operators see the HTTP response code and retry history for each failed delivery, giving full visibility without touching server logs.

Shopify's Resque infrastructure

Shopify processes 50M+ background jobs/day via Resque (Redis-backed Ruby queue). They run 5 priority tiers: critical (inventory reservations, payment captures), high (order confirmations), default (search indexing), low (analytics), bulk (CSV exports, report generation). During Black Friday, the bulk queue accumulates hours of backlog while critical jobs maintain sub-second processing. Priority queue separation — not more hardware — is what makes this operational guarantee possible.

Celery + Redis for ML workloads

Teams processing document analysis workloads run separate Celery queues: a gpu_inference queue with concurrency=4 (one slot per A10G GPU) and a preprocessing queue with concurrency=50 (I/O-bound file downloads and text extraction). Without this separation, a preprocessing surge at concurrency=50 starves GPU inference slots entirely. The right concurrency per task class is as important as the queue infrastructure itself.

Anti-Patterns

Synchronous execution in request handlers

Sending emails, calling payment APIs, or processing images inline with the HTTP response. When the dependency degrades, every user request degrades with it. Push work onto a queue and return 202 Accepted.

Mismatched visibility timeout

SQS default visibility timeout is 30 seconds. A task that takes 5 minutes reappears as a duplicate every 30 seconds while still in-flight. Set visibility timeout to 2× your task's P99 execution time.

Single homogeneous worker pool

One pool for both CPU-intensive ML inference and I/O-bound email sending. Slow inference jobs block fast email sends. Separate pools with tuned concurrency per task class.

Ignoring queue depth

A depth of 50,000 messages means tasks wait hours before processing — results are stale, customer-facing flows have already timed out. Alert on queue depth, not just worker error rates.

Unstructured dead letter queues

Dumping failed tasks to a DLQ without preserving failure reason, retry count, and original timestamp makes post-incident debugging impossible. A DLQ message without context is forensic evidence with the labels removed.

Generating idempotency keys at enqueue time

A retry that generates a new UUID means the second attempt looks like a new task, not a retry. Derive idempotency keys from task payload (e.g., order_id + action_type) so retries are recognized as duplicates.

Design Tradeoffs

DimensionPush-basedPull-based
Dispatch latencySub-millisecond (broker pushes immediately)1–20ms polling interval
BackpressureBroker throttles — complex to configureWorker controls rate — natural backpressure
Worker overload riskHigh (broker pushes regardless of worker capacity)Low (workers fetch only when ready)
Best forReal-time notifications, low-latency tasksBatch workloads, CPU-bound processing
Example toolsRabbitMQ, Redis Pub/SubSQS, Celery + Redis BRPOP, Resque

Best Practices

Design all task handlers as idempotent operations. Derive the idempotency key from the task payload (order ID + action type), not a UUID generated at enqueue time. At-least-once delivery means retries are guaranteed, not exceptional.
Set visibility timeout to 2× your task's P99 execution time. SQS default (30s) is wrong for tasks that take minutes. Too short creates duplicates; too long delays redelivery of genuinely stuck tasks.
Use separate queues for separate concurrency profiles: CPU-bound tasks at concurrency = CPU_CORES, I/O-bound at concurrency = 50–200. Celery supports this natively via named queues with --queues routing.
Alert and auto-scale when queue depth per worker exceeds 500 tasks. This is an earlier signal than CPU utilization — a growing queue predicts user impact before workers hit saturation.
Route failed tasks to a DLQ after 3–5 retries with structured metadata: original payload, exception class and message, retry count, first-enqueue timestamp, last-attempt timestamp. This metadata is your operational audit trail for post-incident diagnosis.
For scheduled or delayed tasks, use a separate delay queue (SQS delay queues, Celery eta parameter, Sidekiq scheduled jobs) rather than sleeping inside a worker. Sleeping workers consume concurrency slots without doing work.

When to Use / Avoid

Use WhenAvoid When
Non-critical work (email, reports, indexing) must not block API responsesSynchronous response is required (payment auth result must return inline)
Retries on transient failures are requiredTask is sub-millisecond — coordination overhead exceeds task duration
Producers and workers must scale independentlyStrict global ordering across all producers is required
Load spikes require buffering before processingSystem is simple enough that a synchronous cron job suffices