HLD11 min read

Design a Web Crawler (Googlebot-class)

Politeness-aware distributed crawler with a frontier queue, dedupe, and DNS-first discovery — scales to billions of pages.

hldsystem-designcrawler

Intro

A web crawler discovers, fetches, parses, and stores pages from across the open internet. The hard parts: politeness (one host, one connection, with backoff), dedupe (URL canonicalisation + content hashing), and scale (1 B+ URLs/day at sustained throughput).

Functional

Seed → BFS the open web; respect robots.txt.
Fetch HTML, extract links, follow recursively.
Re-crawl pages on a freshness schedule.
Store raw + parsed content for downstream indexing.

Non-functional

Throughput: 10 k pages/sec sustained (≈ 800 M/day).
Politeness: 1 in-flight request per host; respect crawl-delay.
Dedupe: > 99% URL dedupe, > 95% content dedupe.
Eventual coverage of 1 B+ URLs in catalog.

Components

Frontier
Per-host FIFOs of URLs awaiting fetch; politeness-respecting.
Fetcher
Workers that fetch HTML from frontier URLs.
Parser
Extracts links + content; emits new frontier URLs.
URL store
Bloom filter + KV for dedupe.
Content store
Object storage for raw HTML; columnar for parsed text.

Trade-offs

Per-host queue vs. global priority queue

Pros

Per-host gives natural politeness.
Global priority gives importance-aware crawling.

Cons

Per-host can starve hot hosts.
Global priority needs N² politeness checks.

Scale concerns

DNS becomes a bottleneck at 10 k QPS — local caches mandatory.
Hot hosts (Wikipedia, NYT) need higher per-host concurrency than tail sites.
Frontier explosion — a single bad seed page can leak 10 M URLs.

Intro

Components

Trade-offs

Scale concerns

Related reads