Skip to content
Jarviix
HLD11 min read

Design a Distributed Job Scheduler

Cron-class scheduler at planet scale: leases, exactly-once-effect, retries with back-off, DAG dependencies, and at-most-once concurrency.

hldsystem-designscheduling

Intro

A distributed job scheduler runs millions of recurring + one-off tasks across a worker fleet. The hard parts are leadership-free coordination (no single dispatcher), idempotent execution (workers crash mid-task), and per-job concurrency limits (often `at-most-one running` for cron). Think Kubernetes CronJob + AWS EventBridge + Airbnb's Airflow.

Functional

  • Schedule one-off + recurring (cron) jobs.
  • Execute on a worker pool with retries + back-off.
  • DAG dependencies — job B starts when job A succeeds.
  • Per-job concurrency limit (often 1).

Non-functional

  • Trigger latency p95 < 1 s after due time.
  • 1 M scheduled jobs; 100 k peak concurrent runs.
  • Exactly-once-effect (idempotent + dedupe) on retries.
  • Worker crash tolerance — leases recovered automatically.

Components

  • Schedule store

    Postgres / Cassandra of (job_id, next_run_at, ...)

  • Dispatcher

    Periodically scans due jobs and enqueues runs.

  • Run queue

    Kafka / Redis Streams partitioned by job_id.

  • Workers

    Lease + execute with heartbeats.

  • Lease store

    Redis with atomic Lua.

  • DAG engine

    Tracks dependencies + emits on success.

Trade-offs

Push (broker fans to workers) vs. pull (workers poll)

Pros

  • Push: low latency.
  • Pull: workers self-throttle.

Cons

  • Push: needs membership info.
  • Pull: poll overhead.

Exactly-once semantics vs. at-least-once + idempotency

Pros

  • Idempotency is achievable; exactly-once across crashes is not.

Cons

  • Tasks must be designed idempotent (idempotency_key).

Scale concerns

  • Thundering herd at 0:00 cron boundaries.
  • Long-running tasks blocking concurrency budget.
  • Worker crash with held lease — must time-out cleanly.
  • DAG cycles or stuck nodes.

Related reads