Skip to content
Jarviix
HLD11 min read

Design a Web Crawler (Googlebot-class)

Politeness-aware distributed crawler with a frontier queue, dedupe, and DNS-first discovery — scales to billions of pages.

hldsystem-designcrawler

Intro

A web crawler discovers, fetches, parses, and stores pages from across the open internet. The hard parts: politeness (one host, one connection, with backoff), dedupe (URL canonicalisation + content hashing), and scale (1 B+ URLs/day at sustained throughput).

Functional

  • Seed → BFS the open web; respect robots.txt.
  • Fetch HTML, extract links, follow recursively.
  • Re-crawl pages on a freshness schedule.
  • Store raw + parsed content for downstream indexing.

Non-functional

  • Throughput: 10 k pages/sec sustained (≈ 800 M/day).
  • Politeness: 1 in-flight request per host; respect crawl-delay.
  • Dedupe: > 99% URL dedupe, > 95% content dedupe.
  • Eventual coverage of 1 B+ URLs in catalog.

Components

  • Frontier

    Per-host FIFOs of URLs awaiting fetch; politeness-respecting.

  • Fetcher

    Workers that fetch HTML from frontier URLs.

  • Parser

    Extracts links + content; emits new frontier URLs.

  • URL store

    Bloom filter + KV for dedupe.

  • Content store

    Object storage for raw HTML; columnar for parsed text.

Trade-offs

Per-host queue vs. global priority queue

Pros

  • Per-host gives natural politeness.
  • Global priority gives importance-aware crawling.

Cons

  • Per-host can starve hot hosts.
  • Global priority needs N² politeness checks.

Scale concerns

  • DNS becomes a bottleneck at 10 k QPS — local caches mandatory.
  • Hot hosts (Wikipedia, NYT) need higher per-host concurrency than tail sites.
  • Frontier explosion — a single bad seed page can leak 10 M URLs.

Related reads