Job Scheduling Systems
Decouple the scheduler (which decides when to run a job) from the executor (which runs the workload) — the scheduler must never execute business logic itself, only claim the execution slot and enqueue a task.
- Decouple the scheduler (which decides when to run a job) from the executor (which runs the workload) — the scheduler must never execute business logic itself, only claim the execution slot and enqueue a task.
- Prevent duplicate execution across distributed nodes with an atomic lock:
UPDATE jobs SET claimed_by = $node WHERE id = $job AND claimed_by IS NULL— only the node that wins the update executes the job. - Generate a unique execution token per scheduled run:
sha256(job_id + scheduled_time). Workers validate this token before executing — duplicate tokens mean duplicate invocations, and one is rejected. - Apache Airflow uses
catchup=Trueby default — if a DAG was paused for a week, it will backfill 7 × daily_runs on resume. This is a common production surprise for new Airflow operators. - AWS EventBridge Scheduler's "flexible time window" distributes executions within a configured window to prevent thundering herd — 10,000 daily jobs all firing at midnight create the same database contention as a DDoS.
The Problem
A team deploys their billing cron job to a fleet of 10 application servers. Standard Linux crontab on each server fires at midnight — 10 nodes all execute the monthly invoice job simultaneously. Customers receive 10 invoices. The fix is to designate one server as the "cron server" — but when that server goes down at midnight, no invoices are generated and finance opens a ticket at 8am. This is the canonical distributed scheduling failure: naive cron on multiple nodes creates duplicate execution; single-node cron creates a silent failure when that node is unavailable.
Core System Idea
A distributed job scheduler separates two concerns: (1) Schedule evaluation — a lightweight, highly available scheduling engine polls a persistent job registry, identifies due jobs, and atomically claims execution slots using a distributed lock or optimistic update. The scheduler writes nothing to business databases and runs no heavy logic. (2) Job execution — after claiming a slot, the scheduler enqueues a lightweight task (containing job ID and execution token) to a worker queue. Worker nodes consume tasks, validate the execution token against the registry to prevent double-execution, and run the business logic. The separation ensures heavy jobs don't starve the scheduling engine. Three critical mechanisms: Distributed lock — atomic UPDATE ... WHERE claimed_by IS NULL or Redis SET NX on the job row before execution; only the winner proceeds. Misfire policy — explicit behavior when a job misses its window: FIRE_ONCE (run the most recent missed instance), FIRE_ALL (backfill all missed instances), or DO_NOTHING (skip and wait for the next window). Idempotent execution token — sha256(job_id + scheduled_time) is the idempotency key; a worker that crashes and recovers uses the same token, detected as a duplicate by the registry.
System Flow
Scheduler claims slot atomically, enqueues token to worker queue; workers validate token before executing to prevent double-run.
Real-World Examples Indicative
Airflow's scheduler polls its metadata database every SCHEDULER_HEARTBEAT_SEC (default: 5 seconds) for due DAG runs. Airbnb runs 30,000+ DAG instances per day. The scheduler uses a database row-level lock to claim each DagRun — only one scheduler instance proceeds even in HA mode. Misfire handling is per-DAG: catchup=True backfills all missed runs (dangerous if a DAG was paused for a week — it will enqueue 7 daily runs immediately on resume); catchup=False runs only the most recent. Production deployments use CeleryExecutor or KubernetesExecutor so task execution is fully decoupled from the scheduler process.
Serverless cron replacement supporting 100M+ scheduled invocations/day globally. The key operational insight is "flexible time windows" — rather than all 10,000 jobs firing at exactly midnight, EventBridge distributes executions within a configured window (e.g., any time within 15 minutes of midnight). This prevents the thundering herd of database connections, Lambda cold starts, and downstream API calls that rigid cron scheduling creates. At-least-once delivery: if the Lambda target is throttled, EventBridge retries for up to 24 hours with exponential backoff.
Shopify runs billing reconciliation and inventory sync CronJobs in Kubernetes with concurrencyPolicy: Forbid — if a previous run is still active when the next fire time arrives, the new run is skipped. For billing-critical jobs, they combine concurrencyPolicy: Forbid with an in-database execution token check: before doing any work, the job writes INSERT INTO executions (token) VALUES ($token) with a UNIQUE constraint. A duplicate token fails the insert and the job exits cleanly, protecting against the edge case where Kubernetes itself fires two pods simultaneously.
Anti-Patterns
crontab on every server without a coordination layer. All nodes fire at the same second, producing duplicate billing runs, duplicate emails, and duplicate external API calls.
Running heavy processing inside the scheduling thread or process. This delays evaluation of subsequent schedules, causing missed windows for other jobs — the scheduling engine must only claim and enqueue, never execute.
SELECT * FROM jobs WHERE run_at <= NOW() AND claimed_by IS NULL on an unindexed table locks rows across the entire table at high frequency. Index on (run_at, claimed_by) to limit scan scope.
Restarting a scheduler after 4 hours of downtime with FIRE_ALL policy causes 240 hourly job instances to enqueue simultaneously, overwhelming workers and downstream services. Define misfire policy explicitly before deployment, not during an incident.
Without NTP synchronization, a node with a 5-second clock drift may fire a job 5 seconds before or after the target. Combined with a short lock TTL, this creates duplicate executions. Enforce NTP sync and set lock TTLs to at least 10× the expected clock skew.
Design Tradeoffs
| Dimension | Centralized Scheduler | Partitioned Scheduler |
|---|---|---|
| Execution guarantee | Exactly-once via distributed lock | At-least-once with token dedup |
| Throughput ceiling | Scheduler DB bottlenecks at ~10K claims/min | Scales linearly with partition count |
| Misfire handling | Explicit policy in single scheduler | Per-partition; complex to reason about globally |
| Debug complexity | Single execution log, one source of truth | Distributed correlation across partitions |
| Best for | <10K jobs/day, strong consistency required | >100K jobs/day, horizontal scale required |
Best Practices
sha256(job_id + iso_scheduled_time) — deterministic, so a crashed-and-recovered worker generates the same token and the duplicate is detected via UNIQUE constraint.FIRE_ONCE (run the most recent missed instance). Cleanup jobs should use DO_NOTHING (skip missed windows). Data export jobs should use FIRE_ALL (backfill all missed exports). The default must not be FIRE_ALL.When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Business-critical tasks must run on a strict recurring schedule (billing, backups, reconciliation) | Tasks need to run immediately in response to events — use event-driven queues instead |
| Centralized visibility and audit logs for all recurring jobs are required | Execution interval is sub-second — scheduler overhead introduces unacceptable latency |
| Complex DAGs of dependent tasks must run in sequence based on time triggers | Single-instance application where in-process scheduling (APScheduler, cron) is sufficient |