Background Job Processing
Offload work from the HTTP request path by returning 202 Accepted immediately and processing asynchronously — a job that takes 5 seconds does not belong in a request handler that must respond in 50ms.
- Return 202 Accepted immediately for any operation taking >200ms — enqueue a job, give the client a job ID to poll, and process asynchronously. Blocking the HTTP response for background work is the most common cause of request timeout cascades.
- Pass only the minimum identifier in the job payload — the database record ID, not the full object. Serializing large payloads into the queue bloats memory, and stale data embedded at enqueue time is processed as stale data hours later.
- Shopify processes 50M+ Sidekiq jobs/day; graceful shutdown during deploys is
sidekiqctl quiet(stop accepting new jobs) → drain running jobs → restart. Hard-killing workers during deploy leaves jobs half-processed with no recovery path. - AWS Lambda triggered by SQS auto-scales to queue depth. Without reserved concurrency, 1,000 messages trigger 1,000 concurrent Lambda invocations, exhausting a 100-connection RDS pool within seconds.
- Job status must be stored in a shared datastore (Redis, DB), not in-process. Clients polling for job completion cannot reach the worker that's running the job — the status must be independently readable.
The Problem
An API endpoint generates a PDF invoice when called. PDF generation takes 3–8 seconds depending on invoice size. At 100 concurrent requests, 100 threads are blocked waiting for PDF rendering — the server has no capacity for any other requests. When the PDF library occasionally crashes, the request thread hangs until the 30-second timeout, holding the connection and the thread open. The PDF generation has no retry mechanism: a crash means the customer never receives their invoice, and no error is logged because the failure happens inside the HTTP response path.
Core System Idea
A background job processing system separates work acceptance from work execution. The web server performs minimal input validation, writes a lightweight job payload to a durable queue, and immediately returns 202 Accepted with a job ID. Independent worker processes — isolated from the web server — poll the queue, execute the business logic, update job status in a shared store, and acknowledge completion. The client uses the job ID to poll for completion or receives a webhook/WebSocket notification. Key design decisions: (1) Payload design — pass only the record ID, not the full object; workers fetch current state at execution time, preventing stale data issues. (2) Idempotency — every job handler must produce the same result if run twice; derive the idempotency key from the job payload, not from a UUID generated at enqueue time. (3) Concurrency profile — I/O-bound workers (sending email, calling APIs) run at high concurrency (50–200); CPU-bound workers (PDF generation, video transcoding) run at concurrency = CPU_CORES. Mix them in one pool and I/O-bound jobs dominate the slots. (4) Graceful shutdown — workers must handle SIGTERM: stop accepting new jobs, complete the current job or re-enqueue it, then exit cleanly.
System Flow
Web server enqueues and returns 202 immediately; worker processes asynchronously; client polls result store using job ID.
Real-World Examples Indicative
Shopify processes 50M+ Sidekiq background jobs/day using Redis as the queue backend. They run 5 named queues with separate worker pools: critical (inventory mutations, payment captures), high (order confirmations), default (search index updates), low (analytics events), mailers (transactional email). During deployments, Shopify uses sidekiqctl quiet to stop new job pickup, waits for running jobs to drain (typically <30 seconds for their job mix), then restarts workers — zero jobs are lost or left in partial state. Without graceful shutdown, a rolling deploy that hard-kills workers during job execution creates orphaned jobs that never complete.
A photo-sharing platform processes uploaded images (resize to 5 resolution variants, generate thumbnails, run content moderation) via SQS + Lambda. Lambda auto-scales to match queue depth: at 1,000 queued images, Lambda spawns concurrent invocations in batches of 10 (configurable batch size). Without reserved concurrency, this triggers 100 concurrent Lambda executions, each opening a DB connection to write results — immediately exhausting a 100-connection RDS pool and causing cascading failures across other services. Setting ReservedConcurrentExecutions = 20 on the image-processing Lambda caps DB connections while the queue drains at a controlled rate.
Linear (the project management tool) uses BullMQ (Node.js, Redis-backed) for their issue sync and Slack notification jobs. When a user imports 10,000 Jira issues into Linear, 10,000 notification jobs queue simultaneously. BullMQ's built-in rate limiter caps Slack webhook deliveries at 100/second per workspace, preventing Linear from hitting Slack's rate limits during bulk imports. Without rate limiting, the same import that completes in 2 minutes triggers 10,000 Slack messages delivered in 30 seconds — Slack throttles the workspace and no notifications arrive for the rest of the hour.
Anti-Patterns
Generating PDFs, sending emails, or calling payment APIs inline with the request handler. One slow operation degrades the entire request pool. Return 202 and process asynchronously.
Embedding the entire order object in the job payload. The object is stale by the time the worker executes (minutes or hours later), and large payloads bloat queue memory. Pass only the record ID; workers fetch current state at execution time.
SQS default visibility timeout is 30 seconds. A job that takes 5 minutes reappears to other workers every 30 seconds while in-flight, causing concurrent duplicate execution. Set visibility timeout to 2× P99 execution time.
Terminating worker processes with SIGKILL during rolling deploys leaves jobs in an undefined state — some side effects may have executed, others not. The job is re-delivered from the queue and partially re-executes, corrupting state. Workers must handle SIGTERM gracefully.
Workers that update a shared jobs table only on completion leave clients unable to distinguish "queued, not started" from "started, crashed" from "completed". Track QUEUED, IN_PROGRESS, COMPLETED, FAILED with timestamps.
Design Tradeoffs
| Dimension | Push-based Workers (RabbitMQ) | Pull-based Workers (SQS, Redis) |
|---|---|---|
| Dispatch latency | Sub-millisecond (broker pushes immediately) | 1–20ms (polling interval) |
| Backpressure | Broker throttles — requires channel flow control | Natural — workers stop polling when busy |
| Worker overload risk | High (broker pushes regardless of worker capacity) | Low (workers fetch only when ready) |
| Auto-scaling trigger | Broker queue depth metric | Queue depth or consumer lag metric |
| Best for | Real-time, low-latency notification jobs | Variable-load batch and async workloads |
Best Practices
GET /jobs/{id} endpoint to poll for completion.celery -A app worker --queues=gpu_inference --concurrency=4 and celery -A app worker --queues=email_send --concurrency=100. Mixing them at a single concurrency starves one type.SIGTERM in every worker: catch the signal, finish the current job (or re-enqueue it), write a completion record, then exit. This is the entire graceful shutdown contract.{status, enqueued_at, started_at, completed_at, error_message}. This table is your operational dashboard for job health and your debugging record when jobs fail silently.When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Operation takes >200ms or involves external I/O that must not block API response | Client requires an immediate synchronous response containing the result (e.g., payment authorization) |
| Work must be retried on transient failure without user re-triggering | Operation is sub-millisecond — queue overhead exceeds task duration |
| Producers and workers must scale independently during load spikes | Simple, low-traffic app where in-process async (asyncio, threads) suffices |
| Operations must be auditable and recoverable across server restarts | Task requires strict global ordering across all producers |