HLD11 min read
Design a Web Crawler (Googlebot-class)
Politeness-aware distributed crawler with a frontier queue, dedupe, and DNS-first discovery — scales to billions of pages.
hldsystem-designcrawler
Intro
A web crawler discovers, fetches, parses, and stores pages from across the open internet. The hard parts: politeness (one host, one connection, with backoff), dedupe (URL canonicalisation + content hashing), and scale (1 B+ URLs/day at sustained throughput).
Functional
- Seed → BFS the open web; respect robots.txt.
- Fetch HTML, extract links, follow recursively.
- Re-crawl pages on a freshness schedule.
- Store raw + parsed content for downstream indexing.
Non-functional
- Throughput: 10 k pages/sec sustained (≈ 800 M/day).
- Politeness: 1 in-flight request per host; respect crawl-delay.
- Dedupe: > 99% URL dedupe, > 95% content dedupe.
- Eventual coverage of 1 B+ URLs in catalog.
Components
Frontier
Per-host FIFOs of URLs awaiting fetch; politeness-respecting.
Fetcher
Workers that fetch HTML from frontier URLs.
Parser
Extracts links + content; emits new frontier URLs.
URL store
Bloom filter + KV for dedupe.
Content store
Object storage for raw HTML; columnar for parsed text.
Trade-offs
Per-host queue vs. global priority queue
Pros
- Per-host gives natural politeness.
- Global priority gives importance-aware crawling.
Cons
- Per-host can starve hot hosts.
- Global priority needs N² politeness checks.
Scale concerns
- DNS becomes a bottleneck at 10 k QPS — local caches mandatory.
- Hot hosts (Wikipedia, NYT) need higher per-host concurrency than tail sites.
- Frontier explosion — a single bad seed page can leak 10 M URLs.