← System Design Workflow Engineering
System Design

Background Job Processing

Offload work from the HTTP request path by returning 202 Accepted immediately and processing asynchronously — a job that takes 5 seconds does not belong in a request handler that must respond in 50ms.

TL;DR
  • Return 202 Accepted immediately for any operation taking >200ms — enqueue a job, give the client a job ID to poll, and process asynchronously. Blocking the HTTP response for background work is the most common cause of request timeout cascades.
  • Pass only the minimum identifier in the job payload — the database record ID, not the full object. Serializing large payloads into the queue bloats memory, and stale data embedded at enqueue time is processed as stale data hours later.
  • Shopify processes 50M+ Sidekiq jobs/day; graceful shutdown during deploys is sidekiqctl quiet (stop accepting new jobs) → drain running jobs → restart. Hard-killing workers during deploy leaves jobs half-processed with no recovery path.
  • AWS Lambda triggered by SQS auto-scales to queue depth. Without reserved concurrency, 1,000 messages trigger 1,000 concurrent Lambda invocations, exhausting a 100-connection RDS pool within seconds.
  • Job status must be stored in a shared datastore (Redis, DB), not in-process. Clients polling for job completion cannot reach the worker that's running the job — the status must be independently readable.

The Problem

An API endpoint generates a PDF invoice when called. PDF generation takes 3–8 seconds depending on invoice size. At 100 concurrent requests, 100 threads are blocked waiting for PDF rendering — the server has no capacity for any other requests. When the PDF library occasionally crashes, the request thread hangs until the 30-second timeout, holding the connection and the thread open. The PDF generation has no retry mechanism: a crash means the customer never receives their invoice, and no error is logged because the failure happens inside the HTTP response path.

Core System Idea

A background job processing system separates work acceptance from work execution. The web server performs minimal input validation, writes a lightweight job payload to a durable queue, and immediately returns 202 Accepted with a job ID. Independent worker processes — isolated from the web server — poll the queue, execute the business logic, update job status in a shared store, and acknowledge completion. The client uses the job ID to poll for completion or receives a webhook/WebSocket notification. Key design decisions: (1) Payload design — pass only the record ID, not the full object; workers fetch current state at execution time, preventing stale data issues. (2) Idempotency — every job handler must produce the same result if run twice; derive the idempotency key from the job payload, not from a UUID generated at enqueue time. (3) Concurrency profile — I/O-bound workers (sending email, calling APIs) run at high concurrency (50–200); CPU-bound workers (PDF generation, video transcoding) run at concurrency = CPU_CORES. Mix them in one pool and I/O-bound jobs dominate the slots. (4) Graceful shutdown — workers must handle SIGTERM: stop accepting new jobs, complete the current job or re-enqueue it, then exit cleanly.

System Flow

flowchart TD A["Client"] --> B["Web Server"] B --> C["Job Queue"] B -- "202 Accepted + job_id" --> A C --> D["Worker Process"] D --> E{"Success?"} E -- "Yes" --> F["Update Status: Completed"] E -- "No" --> G["Retry or DLQ"] F --> H["Result Store"] A -- "Poll job_id" --> H

Web server enqueues and returns 202 immediately; worker processes asynchronously; client polls result store using job ID.

Real-World Examples Indicative

Shopify + Sidekiq

Shopify processes 50M+ Sidekiq background jobs/day using Redis as the queue backend. They run 5 named queues with separate worker pools: critical (inventory mutations, payment captures), high (order confirmations), default (search index updates), low (analytics events), mailers (transactional email). During deployments, Shopify uses sidekiqctl quiet to stop new job pickup, waits for running jobs to drain (typically <30 seconds for their job mix), then restarts workers — zero jobs are lost or left in partial state. Without graceful shutdown, a rolling deploy that hard-kills workers during job execution creates orphaned jobs that never complete.

AWS SQS + Lambda for image processing

A photo-sharing platform processes uploaded images (resize to 5 resolution variants, generate thumbnails, run content moderation) via SQS + Lambda. Lambda auto-scales to match queue depth: at 1,000 queued images, Lambda spawns concurrent invocations in batches of 10 (configurable batch size). Without reserved concurrency, this triggers 100 concurrent Lambda executions, each opening a DB connection to write results — immediately exhausting a 100-connection RDS pool and causing cascading failures across other services. Setting ReservedConcurrentExecutions = 20 on the image-processing Lambda caps DB connections while the queue drains at a controlled rate.

Linear + BullMQ for notification batching

Linear (the project management tool) uses BullMQ (Node.js, Redis-backed) for their issue sync and Slack notification jobs. When a user imports 10,000 Jira issues into Linear, 10,000 notification jobs queue simultaneously. BullMQ's built-in rate limiter caps Slack webhook deliveries at 100/second per workspace, preventing Linear from hitting Slack's rate limits during bulk imports. Without rate limiting, the same import that completes in 2 minutes triggers 10,000 Slack messages delivered in 30 seconds — Slack throttles the workspace and no notifications arrive for the rest of the hour.

Anti-Patterns

Blocking the HTTP response for slow work

Generating PDFs, sending emails, or calling payment APIs inline with the request handler. One slow operation degrades the entire request pool. Return 202 and process asynchronously.

Serializing full objects into job payloads

Embedding the entire order object in the job payload. The object is stale by the time the worker executes (minutes or hours later), and large payloads bloat queue memory. Pass only the record ID; workers fetch current state at execution time.

Mismatched visibility timeout

SQS default visibility timeout is 30 seconds. A job that takes 5 minutes reappears to other workers every 30 seconds while in-flight, causing concurrent duplicate execution. Set visibility timeout to 2× P99 execution time.

Hard-killing workers during deploy

Terminating worker processes with SIGKILL during rolling deploys leaves jobs in an undefined state — some side effects may have executed, others not. The job is re-delivered from the queue and partially re-executes, corrupting state. Workers must handle SIGTERM gracefully.

No per-job status tracking

Workers that update a shared jobs table only on completion leave clients unable to distinguish "queued, not started" from "started, crashed" from "completed". Track QUEUED, IN_PROGRESS, COMPLETED, FAILED with timestamps.

Design Tradeoffs

DimensionPush-based Workers (RabbitMQ)Pull-based Workers (SQS, Redis)
Dispatch latencySub-millisecond (broker pushes immediately)1–20ms (polling interval)
BackpressureBroker throttles — requires channel flow controlNatural — workers stop polling when busy
Worker overload riskHigh (broker pushes regardless of worker capacity)Low (workers fetch only when ready)
Auto-scaling triggerBroker queue depth metricQueue depth or consumer lag metric
Best forReal-time, low-latency notification jobsVariable-load batch and async workloads

Best Practices

Return 202 Accepted with a job ID for any operation that takes >200ms or involves external I/O. Give clients a GET /jobs/{id} endpoint to poll for completion.
Pass only record IDs in job payloads. Workers fetch current state from the database at execution time. Stale data serialized at enqueue time produces incorrect results hours later.
Run separate worker pools for CPU-bound and I/O-bound job types. Celery: celery -A app worker --queues=gpu_inference --concurrency=4 and celery -A app worker --queues=email_send --concurrency=100. Mixing them at a single concurrency starves one type.
Handle SIGTERM in every worker: catch the signal, finish the current job (or re-enqueue it), write a completion record, then exit. This is the entire graceful shutdown contract.
Track job status with timestamps in a shared store: {status, enqueued_at, started_at, completed_at, error_message}. This table is your operational dashboard for job health and your debugging record when jobs fail silently.

When to Use / Avoid

Use WhenAvoid When
Operation takes >200ms or involves external I/O that must not block API responseClient requires an immediate synchronous response containing the result (e.g., payment authorization)
Work must be retried on transient failure without user re-triggeringOperation is sub-millisecond — queue overhead exceeds task duration
Producers and workers must scale independently during load spikesSimple, low-traffic app where in-process async (asyncio, threads) suffices
Operations must be auditable and recoverable across server restartsTask requires strict global ordering across all producers