Design a Distributed Job Scheduler
Cron-class scheduler at planet scale: leases, exactly-once-effect, retries with back-off, DAG dependencies, and at-most-once concurrency.
Intro
A distributed job scheduler runs millions of recurring + one-off tasks across a worker fleet. The hard parts are leadership-free coordination (no single dispatcher), idempotent execution (workers crash mid-task), and per-job concurrency limits (often `at-most-one running` for cron). Think Kubernetes CronJob + AWS EventBridge + Airbnb's Airflow.
Functional
- Schedule one-off + recurring (cron) jobs.
- Execute on a worker pool with retries + back-off.
- DAG dependencies — job B starts when job A succeeds.
- Per-job concurrency limit (often 1).
Non-functional
- Trigger latency p95 < 1 s after due time.
- 1 M scheduled jobs; 100 k peak concurrent runs.
- Exactly-once-effect (idempotent + dedupe) on retries.
- Worker crash tolerance — leases recovered automatically.
Components
Schedule store
Postgres / Cassandra of (job_id, next_run_at, ...)
Dispatcher
Periodically scans due jobs and enqueues runs.
Run queue
Kafka / Redis Streams partitioned by job_id.
Workers
Lease + execute with heartbeats.
Lease store
Redis with atomic Lua.
DAG engine
Tracks dependencies + emits on success.
Trade-offs
Push (broker fans to workers) vs. pull (workers poll)
Pros
- Push: low latency.
- Pull: workers self-throttle.
Cons
- Push: needs membership info.
- Pull: poll overhead.
Exactly-once semantics vs. at-least-once + idempotency
Pros
- Idempotency is achievable; exactly-once across crashes is not.
Cons
- Tasks must be designed idempotent (idempotency_key).
Scale concerns
- Thundering herd at 0:00 cron boundaries.
- Long-running tasks blocking concurrency budget.
- Worker crash with held lease — must time-out cleanly.
- DAG cycles or stuck nodes.