HLD11 min read

Design a Distributed Job Scheduler

Cron-class scheduler at planet scale: leases, exactly-once-effect, retries with back-off, DAG dependencies, and at-most-once concurrency.

hldsystem-designscheduling

Intro

A distributed job scheduler runs millions of recurring + one-off tasks across a worker fleet. The hard parts are leadership-free coordination (no single dispatcher), idempotent execution (workers crash mid-task), and per-job concurrency limits (often `at-most-one running` for cron). Think Kubernetes CronJob + AWS EventBridge + Airbnb's Airflow.

Functional

Schedule one-off + recurring (cron) jobs.
Execute on a worker pool with retries + back-off.
DAG dependencies — job B starts when job A succeeds.
Per-job concurrency limit (often 1).

Non-functional

Trigger latency p95 < 1 s after due time.
1 M scheduled jobs; 100 k peak concurrent runs.
Exactly-once-effect (idempotent + dedupe) on retries.
Worker crash tolerance — leases recovered automatically.

Components

Schedule store
Postgres / Cassandra of (job_id, next_run_at, ...)
Dispatcher
Periodically scans due jobs and enqueues runs.
Run queue
Kafka / Redis Streams partitioned by job_id.
Workers
Lease + execute with heartbeats.
Lease store
Redis with atomic Lua.
DAG engine
Tracks dependencies + emits on success.

Trade-offs

Push (broker fans to workers) vs. pull (workers poll)

Pros

Push: low latency.
Pull: workers self-throttle.

Cons

Push: needs membership info.
Pull: poll overhead.

Exactly-once semantics vs. at-least-once + idempotency