DDSA Solutions
Case Study7 min read·

Design a Web Crawler (Google Search Bot)

System design for a web crawler at scale: URL frontier, politeness, deduplication, DNS, and distributed fetching for interview prep.

A web crawler discovers and downloads pages so a search engine can index them. Interviewers use it to test BFS thinking, distributed queues, and politeness constraints — the same graph traversal instincts as LeetCode, but with robots.txt and rate limits. Use the interview framework to clarify: are you crawling the whole public web or a fixed domain set?

Requirements

Functional

  • Seed URLs in; discover new links; fetch HTML; pass documents to indexer.
  • Respect robots.txt and per-host crawl-delay.
  • Deduplicate URLs (canonical form) and avoid revisiting unchanged pages when possible.
  • Prioritize important domains or fresh content (optional v2).

Non-functional

  • Crawl billions of pages; millions of fetches per day.
  • Politeness: max N concurrent requests per host (e.g. 1–2).
  • Handle flaky sites, redirects, and DNS failures without stalling the fleet.
  • Horizontally scale fetchers; frontier must not be a single-machine bottleneck.

Clarify scope

Google-scale crawler vs single-tenant site mirror — scale and storage differ. Most interviews want distributed BFS with politeness, not a full PageRank lecture.

High-level architecture

ComponentRole
URL frontier (Kafka / priority queues)Pending URLs to fetch, partitioned by host hash
Fetcher workersHTTP GET, follow redirects, extract links
URL dedup store (Bloom + DB)Seen URLs; Bloom filter first, DB for certainty
Robots cache (Redis)robots.txt rules per host
DNS resolver cacheAvoid repeated lookups; respect TTL
Content store (S3 / HDFS)Raw HTML for indexer pipeline
Link extractorParse <a href>, normalize, enqueue new URLs

Crawl loop (BFS at scale)

  1. Normalize URL: lowercase host, strip fragment, resolve relative paths.
  2. Check Bloom filter — if probably seen, skip (or confirm in dedup DB).
  3. Load robots.txt for host; skip if disallowed.
  4. Enqueue to host-specific queue with politeness scheduler.
  5. Fetcher pulls when host token available; HTTP GET with timeout.
  6. Store body; extract links; enqueue unseen URLs back to frontier.
  7. Emit crawl metadata (status, checksum, fetched_at) for indexer.

Politeness and per-host queues

Never hammer one server with 10K parallel requests. Partition frontier by `hash(host) % N` so all URLs for `example.com` land in one partition. A scheduler grants tokens: 1 fetch per host every X ms (robots crawl-delay or default 500ms). This is a rate limiter per host, not per user. Fetchers ask the scheduler before each GET.

URL deduplication

Billions of URLs cannot fit in one Redis set. Use a Bloom filter for fast “probably seen” checks, then a distributed key-value store (Cassandra, RocksDB) for definitive membership. Canonicalize: `http://foo.com/` vs `http://foo.com/index.html` may be the same page — normalize trailing slashes and default ports. Store content hash to skip re-indexing unchanged pages (caching checksums).

Capacity estimation

1B pages, average 50 KB HTML → 50 TB raw storage (compressed ~15 TB). 10M pages/day ≈ 115 fetches/sec average; peak 5× with batch windows. If mean page has 50 links and 20% are new, frontier grows ~1M new URLs/day — Kafka handles millions/sec ingress; bottleneck is politeness, not queue throughput. 100K hosts × 1 req/sec max = 100K concurrent polite fetches upper bound — size fetcher fleet accordingly.

Failure modes

FailureMitigation
Slow or hung hostPer-request timeout; circuit-breaker host for 1 hour
DNS failureRetry with backoff; cache negative results briefly
Redirect loopMax redirect depth 5; mark URL bad
Duplicate stormBloom false positives rare; confirm in DB before permanent skip
Fetcher crash mid-fetchAt-least-once queue; idempotent store by URL id

Freshness and recrawl

Not every URL needs daily refetch. Priority score: PageRank proxy, sitemap lastmod, or time-since-last-crawl. High-priority news sites hourly; long-tail yearly. Separate message queue topic for “recrawl due” vs “discovered link”.

Data model sketch

  • urls: url_hash, canonical_url, host, last_fetched_at, content_hash, status
  • host_politeness: host, max_concurrent, delay_ms, robots_fetched_at
  • frontier_queue: Kafka topic partitioned by host hash

DNS and robots.txt

Advertisement

Before fetching, resolve host via cached DNS (TTL respected). robots.txt is fetched once per host per day (cached in Redis). Parse Disallow rules and Crawl-delay. User-agent wildcard or specific bot name. If robots fetch fails, conservative default: slow rate or skip unknown host. This is legal/ethical requirement — not optional at Google scale.

Link extraction and normalization

  1. Parse HTML (or PDF pipeline separately); extract href from <a> tags.
  2. Resolve relative URLs against base URL of parent page.
  3. Drop javascript:, mailto:, data: schemes.
  4. Canonical link tag may override duplicate URL choice.
  5. Enqueue only http/https within allowed domain policy if scoped crawl.

Handoff to indexer

Crawler does not rank pages — it produces `(url, content_hash, fetch_time, raw_html_s3_key)` events on a Kafka topic. Indexer tokenizes, builds inverted index, computes PageRank offline. Decouple so fetch spikes do not block indexing. Poison pages (infinite SPA shells) detected by content-type and size limits.

Latency budget per fetch

StepTarget
Dedup + robots check< 5ms
Wait for politeness token0–2000ms (policy)
DNS (cached)< 2ms
HTTP GET< 5s timeout
Store S3 + enqueue links< 50ms async

Worked example: single host politeness

Host `news.site` has crawl-delay 1s and 500 URLs discovered. Only one fetcher token per second is granted for that host — 500 seconds minimum to drain the queue for that host alone. Other hosts proceed in parallel on other partitions. Interview point: politeness caps per-host throughput regardless of fleet size.

Sample opening (first three minutes)

Interviewer: "Design a web crawler." You: "I will assume we crawl the public web for search indexing. I need a distributed URL frontier, fetch workers that respect robots.txt and per-host rate limits, Bloom-filter dedup, and blob storage for HTML. I will start BFS from seed URLs, normalize links, and hand documents to a separate indexer. Politeness is non-negotiable at scale."

What to say in the last five minutes

Close with: "Kafka frontier partitioned by host, per-host politeness scheduler, Bloom + DB dedup, S3 for pages, link extractor feeds BFS." Mention failures: timeouts, circuit breakers, redirect caps. That is a complete crawler answer.

Scaling the fetcher fleet

Fetchers are stateless workers behind a load balancer pulling from partitioned Kafka consumers. Scale consumer instances up to partition count per topic. Bottleneck is rarely CPU — it is politeness tokens and target site response time. Monitor p95 fetch latency and per-host error rate. Auto-scale on queue depth per partition, not global QPS alone.

Security and abuse

  • Do not crawl login walls or paywalled content without permission.
  • Cap response body size (e.g. 10 MB) to avoid decompression bombs.
  • Sandbox malware links — indexer runs in isolated pipeline.
  • Identify crawler via User-Agent string with contact URL.

Comparison: BFS vs priority crawl

StrategyWhenTrade-off
Pure BFS from seedsSmall domain mirrorSimple; may miss deep pages
Priority queue by PageRankWeb-wide searchComplex scheduler; better crawl budget use
Sitemap-drivenCooperative sitesFast fresh URLs; incomplete without BFS fallback

Indexer pipeline contract

Define event schema: `{ url, fetch_id, http_status, content_type, body_s3_key, outlinks_count }`. Indexer acks after durable write to inverted index. Failed index retries from S3 without re-fetching URL — saves politeness budget. Dead-letter index jobs for malformed HTML.

Mock interview checklist

  1. Clarified scale and robots.txt / politeness.
  2. Described BFS with URL frontier and link extraction.
  3. Explained per-host rate limiting and queue partitioning.
  4. Named dedup strategy (Bloom + canonical URLs).
  5. Mentioned storage for raw HTML and handoff to indexer.

Closing summary

Web crawlers are distributed BFS with manners — frontier queue, polite fetchers, dedup, and blob storage. Connect to graph traversal from DSA prep; the hard part is operations at billion-URL scale.

More in this series