Task Queue Design
Task queues decouple producers from workers — Stripe processes 250M+ webhook deliveries/month without blocking API responses by offloading all delivery retries to an async queue.
- Task queues decouple producers from workers — Stripe processes 250M+ webhook deliveries/month without blocking API responses by offloading retries to an async queue.
- SQS visibility timeout is your at-least-once delivery mechanism: a worker that crashes mid-task causes the message to reappear after the timeout, not be lost. Set it to 2× your task's P99 execution time.
- Worker concurrency is workload-dependent: Celery I/O-bound workers (API calls, DB writes) run at 50–200 concurrency; CPU-bound workers (image processing, ML inference) should match CPU core count — typically 4–8.
- Queue depth is your primary scaling signal: 500+ messages per worker is the auto-scale threshold, not a soft warning. Scaling on CPU is a lagging indicator.
- Dead letter queues are not a dump — every DLQ message must carry structured metadata: task type, failure reason, retry count, and original enqueue timestamp.
The Problem
A checkout service sends confirmation emails synchronously during the payment response path. The email provider degrades during Black Friday — 3-second email timeouts cascade into 3-second checkout responses, conversion drops 60%, and the on-call engineer discovers 80,000 failed transactions because the email provider went down. The email send had no business being in the payment critical path. Without task queues, every transient failure in a non-critical downstream operation becomes a user-visible incident.
Core System Idea
A task queue separates work acceptance from work execution. A producer enqueues a serialized task into a durable broker — Redis, SQS, RabbitMQ, or Kafka — and returns immediately. A worker pool polls the broker, claims tasks using visibility locks (SQS), BRPOP (Redis), or channel prefetch (RabbitMQ), executes the work, and acknowledges completion. Unacknowledged tasks are redelivered after the visibility timeout, guaranteeing at-least-once execution. Three design levers determine queue behavior: (1) Priority queues — separate high-priority (payment notifications) from low-priority (weekly reports) queues with dedicated worker pools per tier; Shopify runs 5 priority tiers in Resque so critical jobs never wait behind bulk CSV exports. (2) Concurrency tuning — I/O-bound tasks (HTTP calls, DB writes) tolerate high concurrency (50–200); CPU-bound tasks (transcoding, ML inference) need concurrency ≤ CPU cores; mixing them in one pool at one concurrency setting under-utilizes one class and over-commits the other. (3) Backpressure — producers must block or reject when queue depth exceeds a threshold; unbounded queue growth leads to stale tasks processed hours after their result matters.
System Flow
Priority router separates tasks by tier; dedicated worker pools per priority; exhausted-retry tasks route to DLQ for inspection.
Real-World Examples Indicative
Stripe delivers 250M+ webhooks/month using an exponential retry schedule: initial attempt → 1s → 5s → 30s → 2min → 10min → 1hr, up to 20 attempts over 72 hours. Each retry is an independent task enqueued with a scheduled delay. Tasks that exhaust all attempts land in a dead letter queue visible in Stripe's merchant dashboard — operators see the HTTP response code and retry history for each failed delivery, giving full visibility without touching server logs.
Shopify processes 50M+ background jobs/day via Resque (Redis-backed Ruby queue). They run 5 priority tiers: critical (inventory reservations, payment captures), high (order confirmations), default (search indexing), low (analytics), bulk (CSV exports, report generation). During Black Friday, the bulk queue accumulates hours of backlog while critical jobs maintain sub-second processing. Priority queue separation — not more hardware — is what makes this operational guarantee possible.
Teams processing document analysis workloads run separate Celery queues: a gpu_inference queue with concurrency=4 (one slot per A10G GPU) and a preprocessing queue with concurrency=50 (I/O-bound file downloads and text extraction). Without this separation, a preprocessing surge at concurrency=50 starves GPU inference slots entirely. The right concurrency per task class is as important as the queue infrastructure itself.
Anti-Patterns
Sending emails, calling payment APIs, or processing images inline with the HTTP response. When the dependency degrades, every user request degrades with it. Push work onto a queue and return 202 Accepted.
SQS default visibility timeout is 30 seconds. A task that takes 5 minutes reappears as a duplicate every 30 seconds while still in-flight. Set visibility timeout to 2× your task's P99 execution time.
One pool for both CPU-intensive ML inference and I/O-bound email sending. Slow inference jobs block fast email sends. Separate pools with tuned concurrency per task class.
A depth of 50,000 messages means tasks wait hours before processing — results are stale, customer-facing flows have already timed out. Alert on queue depth, not just worker error rates.
Dumping failed tasks to a DLQ without preserving failure reason, retry count, and original timestamp makes post-incident debugging impossible. A DLQ message without context is forensic evidence with the labels removed.
A retry that generates a new UUID means the second attempt looks like a new task, not a retry. Derive idempotency keys from task payload (e.g., order_id + action_type) so retries are recognized as duplicates.
Design Tradeoffs
| Dimension | Push-based | Pull-based |
|---|---|---|
| Dispatch latency | Sub-millisecond (broker pushes immediately) | 1–20ms polling interval |
| Backpressure | Broker throttles — complex to configure | Worker controls rate — natural backpressure |
| Worker overload risk | High (broker pushes regardless of worker capacity) | Low (workers fetch only when ready) |
| Best for | Real-time notifications, low-latency tasks | Batch workloads, CPU-bound processing |
| Example tools | RabbitMQ, Redis Pub/Sub | SQS, Celery + Redis BRPOP, Resque |
Best Practices
concurrency = CPU_CORES, I/O-bound at concurrency = 50–200. Celery supports this natively via named queues with --queues routing.eta parameter, Sidekiq scheduled jobs) rather than sleeping inside a worker. Sleeping workers consume concurrency slots without doing work.When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Non-critical work (email, reports, indexing) must not block API responses | Synchronous response is required (payment auth result must return inline) |
| Retries on transient failures are required | Task is sub-millisecond — coordination overhead exceeds task duration |
| Producers and workers must scale independently | Strict global ordering across all producers is required |
| Load spikes require buffering before processing | System is simple enough that a synchronous cron job suffices |