← System Design Workflow Engineering
System Design

Job Scheduling Systems

Decouple the scheduler (which decides when to run a job) from the executor (which runs the workload) — the scheduler must never execute business logic itself, only claim the execution slot and enqueue a task.

TL;DR
  • Decouple the scheduler (which decides when to run a job) from the executor (which runs the workload) — the scheduler must never execute business logic itself, only claim the execution slot and enqueue a task.
  • Prevent duplicate execution across distributed nodes with an atomic lock: UPDATE jobs SET claimed_by = $node WHERE id = $job AND claimed_by IS NULL — only the node that wins the update executes the job.
  • Generate a unique execution token per scheduled run: sha256(job_id + scheduled_time). Workers validate this token before executing — duplicate tokens mean duplicate invocations, and one is rejected.
  • Apache Airflow uses catchup=True by default — if a DAG was paused for a week, it will backfill 7 × daily_runs on resume. This is a common production surprise for new Airflow operators.
  • AWS EventBridge Scheduler's "flexible time window" distributes executions within a configured window to prevent thundering herd — 10,000 daily jobs all firing at midnight create the same database contention as a DDoS.

The Problem

A team deploys their billing cron job to a fleet of 10 application servers. Standard Linux crontab on each server fires at midnight — 10 nodes all execute the monthly invoice job simultaneously. Customers receive 10 invoices. The fix is to designate one server as the "cron server" — but when that server goes down at midnight, no invoices are generated and finance opens a ticket at 8am. This is the canonical distributed scheduling failure: naive cron on multiple nodes creates duplicate execution; single-node cron creates a silent failure when that node is unavailable.

Core System Idea

A distributed job scheduler separates two concerns: (1) Schedule evaluation — a lightweight, highly available scheduling engine polls a persistent job registry, identifies due jobs, and atomically claims execution slots using a distributed lock or optimistic update. The scheduler writes nothing to business databases and runs no heavy logic. (2) Job execution — after claiming a slot, the scheduler enqueues a lightweight task (containing job ID and execution token) to a worker queue. Worker nodes consume tasks, validate the execution token against the registry to prevent double-execution, and run the business logic. The separation ensures heavy jobs don't starve the scheduling engine. Three critical mechanisms: Distributed lock — atomic UPDATE ... WHERE claimed_by IS NULL or Redis SET NX on the job row before execution; only the winner proceeds. Misfire policy — explicit behavior when a job misses its window: FIRE_ONCE (run the most recent missed instance), FIRE_ALL (backfill all missed instances), or DO_NOTHING (skip and wait for the next window). Idempotent execution tokensha256(job_id + scheduled_time) is the idempotency key; a worker that crashes and recovers uses the same token, detected as a duplicate by the registry.

System Flow

flowchart TD A["Cron Evaluator"] --> B["Job Registry DB"] B --> C{"Atomic Lock Acquired?"} C -- "No" --> D["Skip: Already Claimed"] C -- "Yes" --> E["Generate Execution Token"] E --> F["Worker Queue"] F --> G["Worker Pool"] G --> H["Validate Token"] H -- "Duplicate" --> D H -- "Valid" --> I["Execute Business Logic"] I --> B

Scheduler claims slot atomically, enqueues token to worker queue; workers validate token before executing to prevent double-run.

Real-World Examples Indicative

Apache Airflow at Airbnb

Airflow's scheduler polls its metadata database every SCHEDULER_HEARTBEAT_SEC (default: 5 seconds) for due DAG runs. Airbnb runs 30,000+ DAG instances per day. The scheduler uses a database row-level lock to claim each DagRun — only one scheduler instance proceeds even in HA mode. Misfire handling is per-DAG: catchup=True backfills all missed runs (dangerous if a DAG was paused for a week — it will enqueue 7 daily runs immediately on resume); catchup=False runs only the most recent. Production deployments use CeleryExecutor or KubernetesExecutor so task execution is fully decoupled from the scheduler process.

AWS EventBridge Scheduler

Serverless cron replacement supporting 100M+ scheduled invocations/day globally. The key operational insight is "flexible time windows" — rather than all 10,000 jobs firing at exactly midnight, EventBridge distributes executions within a configured window (e.g., any time within 15 minutes of midnight). This prevents the thundering herd of database connections, Lambda cold starts, and downstream API calls that rigid cron scheduling creates. At-least-once delivery: if the Lambda target is throttled, EventBridge retries for up to 24 hours with exponential backoff.

Kubernetes CronJobs at Shopify

Shopify runs billing reconciliation and inventory sync CronJobs in Kubernetes with concurrencyPolicy: Forbid — if a previous run is still active when the next fire time arrives, the new run is skipped. For billing-critical jobs, they combine concurrencyPolicy: Forbid with an in-database execution token check: before doing any work, the job writes INSERT INTO executions (token) VALUES ($token) with a UNIQUE constraint. A duplicate token fails the insert and the job exits cleanly, protecting against the edge case where Kubernetes itself fires two pods simultaneously.

Anti-Patterns

OS cron on multiple application nodes

crontab on every server without a coordination layer. All nodes fire at the same second, producing duplicate billing runs, duplicate emails, and duplicate external API calls.

Scheduler executing business logic directly

Running heavy processing inside the scheduling thread or process. This delays evaluation of subsequent schedules, causing missed windows for other jobs — the scheduling engine must only claim and enqueue, never execute.

Database polling without a covering index

SELECT * FROM jobs WHERE run_at <= NOW() AND claimed_by IS NULL on an unindexed table locks rows across the entire table at high frequency. Index on (run_at, claimed_by) to limit scan scope.

Ignoring misfire behavior

Restarting a scheduler after 4 hours of downtime with FIRE_ALL policy causes 240 hourly job instances to enqueue simultaneously, overwhelming workers and downstream services. Define misfire policy explicitly before deployment, not during an incident.

Clock skew across scheduler nodes

Without NTP synchronization, a node with a 5-second clock drift may fire a job 5 seconds before or after the target. Combined with a short lock TTL, this creates duplicate executions. Enforce NTP sync and set lock TTLs to at least 10× the expected clock skew.

Design Tradeoffs

DimensionCentralized SchedulerPartitioned Scheduler
Execution guaranteeExactly-once via distributed lockAt-least-once with token dedup
Throughput ceilingScheduler DB bottlenecks at ~10K claims/minScales linearly with partition count
Misfire handlingExplicit policy in single schedulerPer-partition; complex to reason about globally
Debug complexitySingle execution log, one source of truthDistributed correlation across partitions
Best for<10K jobs/day, strong consistency required>100K jobs/day, horizontal scale required

Best Practices

Generate execution tokens as sha256(job_id + iso_scheduled_time) — deterministic, so a crashed-and-recovered worker generates the same token and the duplicate is detected via UNIQUE constraint.
Define misfire policies per job type. Billing jobs should use FIRE_ONCE (run the most recent missed instance). Cleanup jobs should use DO_NOTHING (skip missed windows). Data export jobs should use FIRE_ALL (backfill all missed exports). The default must not be FIRE_ALL.
Set scheduler lock TTLs to 2× the maximum expected job execution time. A lock that expires before the job finishes allows another node to re-acquire and fire a duplicate.
Store all schedules in UTC. Cron expressions evaluated in local time will misfire during DST transitions — a job scheduled for 2:00am will not fire on the day clocks spring forward because that time doesn't exist.
Monitor scheduler heartbeat lag and missed-fire counts as primary SLIs. A scheduler that hasn't checked in within 2× its heartbeat interval is unhealthy — alert before the next scheduled job window arrives.

When to Use / Avoid

Use WhenAvoid When
Business-critical tasks must run on a strict recurring schedule (billing, backups, reconciliation)Tasks need to run immediately in response to events — use event-driven queues instead
Centralized visibility and audit logs for all recurring jobs are requiredExecution interval is sub-second — scheduler overhead introduces unacceptable latency
Complex DAGs of dependent tasks must run in sequence based on time triggersSingle-instance application where in-process scheduling (APScheduler, cron) is sufficient