Design a Search Engine (Google)
System design for web search: inverted index, query processing, ranking, autocomplete, and scaling reads to billions of queries for interview prep.
Search engine design sits one layer above the web crawler: the crawler discovers pages; the search system indexes them and answers queries in milliseconds. Interviewers want inverted indexes, ranking intuition, and how you scale read-heavy query traffic — not a PhD in information retrieval. Use the framework to scope: web search vs enterprise document search vs autocomplete-only.
Requirements
Functional
- User types query; system returns ranked list of URLs with snippets.
- Support phrase search, spelling correction, and safe-search filters.
- Index updates as crawler discovers new/changed pages.
- Autocomplete suggestions as user types (optional deep dive).
Non-functional
- 5B queries/day ≈ 60K QPS average; peak 3× in business hours.
- p99 query latency under 200 ms end-to-end.
- Index freshness: hours for news, days for long-tail.
- High availability — search downtime is front-page news.
Crawler is upstream
If the interviewer has not covered crawling, sketch it in two minutes: frontier queue, politeness, dedup — then say "documents land in the indexing pipeline." Do not spend twenty minutes on robots.txt unless asked.
High-level architecture
| Component | Role |
|---|---|
| Query service | Parse query, fan-out to index shards, merge, rank, render |
| Inverted index shards | term → posting list (doc_id, positions, fields) |
| Document store | doc_id → URL, title, body text for snippets |
| Indexer pipeline | Tokenize, stem, build postings from crawler events |
| PageRank / rank features (offline) | Precomputed signals merged at query time |
| Autocomplete (typeahead) | Prefix trie + popular query log |
| Spell checker | Edit distance against dictionary / query log |
Inverted index
Core data structure: each term maps to a posting list of documents containing it. Posting entry: `(doc_id, term_frequency, field_bitmap, positions[])`. Boolean AND of terms = intersect posting lists (merge join on sorted doc_ids — same skill as merge intervals on LeetCode). Phrase query requires position check: "new york" means "new" immediately before "york" in the same field. Shards partition by term hash: all postings for "cat" live on shard `hash("cat") % N`.
Indexing pipeline
- Crawler emits `(url, html_s3_key, fetch_time)` on Kafka.
- Indexer parses HTML: title, headings, body text; drop boilerplate/nav.
- Tokenize, lowercase, stem ("running" → "run"), remove stop words (optional).
- Assign doc_id; write forward index (doc → terms) and inverted updates.
- Batch merge small index segments into larger segments (LSM-style).
- Offline job recomputes PageRank, anchor text, spam scores.
Indexing is write-heavy and async — hours of lag is acceptable. Query path reads only merged, immutable index segments plus small real-time buffer for fresh docs. Dual-path indexing (batch + incremental) is a common senior talking point.
Query path
- Normalize query: lowercase, spell-correct, expand synonyms if configured.
- Determine which index shard owns each query term (typically one shard per term, not every shard).
- Parallel RPC to those shards: each returns top-K doc candidates with scores.
- Merge shard results; apply ranking model (BM25 + PageRank + freshness + user context).
- Fetch snippets from document store; highlight query terms.
- Return JSON results; log query for analytics and autocomplete.
Ranking (what to say without overclaiming)
v1: BM25 text relevance + PageRank authority + recency decay. Senior: learning-to-rank with hundreds of features, click-through rate as feedback loop. Interview sweet spot: "I combine lexical match score with precomputed PageRank and boost recent news for time-sensitive queries." Personalization (search history) is a separate cache layer — optional.
Capacity estimation
10B web pages, average 50 terms/page after stemming → 500B postings. Compressed posting list ~4 bytes/doc entry is optimistic — real systems use delta encoding and often land in tens of TB, not 2 TB. Order-of-magnitude is enough in interviews. 60K QPS × ~3 terms avg (after stop words) ≈ one shard RPC per term → ~180K internal shard reads/sec before replication — need replicated index servers in each region. Query logs: 5B rows/day → Kafka to data warehouse, not SQL.
Sharding and replication
Shard by term hash for even spread — hot terms like "the" are stop-word removed. Each shard has read replicas in multiple regions; query router picks nearest replica. Database sharding principles apply: avoid cross-shard joins at query time; merge only top-K lists. Rebalance when shard size skews (split hot terms).
Autocomplete integration
Separate low-latency path: prefix trie built from top 10M queries by frequency, cached in Redis per prefix (typeahead article). Does not hit full inverted index. Spell correction uses edit-distance-1 against vocabulary or "did you mean" from query log co-occurrence.
Consistency model
Index segments are immutable once published — readers never see half-written postings. New crawler docs go to a "fresh" tier merged periodically. Under CAP, search is AP on reads: stale results for minutes beat unavailable search. Payment-grade consistency not required.
Failure modes
| Failure | Mitigation |
|---|---|
| Index shard timeout | Return partial results + degrade message; retry other replicas |
| Hot term shard overload | Dedicated replica pool; cache popular query results |
| Indexer backlog | Scale consumers; prioritize news domains |
| Spam / SEO abuse | Offline classifier; demote in rank features |
| Snippet store miss | Fallback to URL + title only |
API sketch
- GET /search?q=...&page=1 — ranked results + snippets
- GET /suggest?q=pre — autocomplete (separate service)
- Internal: POST /index/document — crawler pipeline only
Sample opening (first three minutes)
Interviewer: "Design Google search." You: "I will assume web-scale query volume with an existing crawler feeding an async indexer. Queries hit a router that fans out to term-sharded inverted indexes, merges top candidates, ranks with BM25 plus PageRank, and fetches snippets. Autocomplete and spell-check are separate fast paths. I will estimate QPS and index size next."
Worked example: query "distributed cache"
Tokenizer emits ["distribut", "cache"] after stemming. Router contacts shards holding each term. Shard for "cache" returns 50K doc_ids; shard for "distribut" returns 12K. Intersect sorted lists → 800 candidates. Ranker scores top 100 with BM25 + PageRank. Snippet service pulls titles and highlights "cache" in context. Total shard RPC time budget: 50 ms; merge + rank: 20 ms; snippet fetch parallelized. Misspell "distribted" → spell checker suggests "distributed" before shard fan-out.
Deep dive: posting list compression
Raw posting lists are huge. Production systems delta-encode doc_ids (store gaps, not absolute ids), variable-byte encode integers, and partition by document frequency (rare terms inline, common terms in skip lists). You do not need to implement this — saying "postings are compressed and merged in segments" signals you understand why the index fits on disk and not in one giant hash map.
What to say in the last five minutes
Close with: "Term-sharded inverted index, parallel top-K merge, BM25 plus PageRank, async indexer from crawler Kafka stream, autocomplete on separate trie." Mention compression and spell-check if time allows.
Mock interview checklist
- Explained inverted index and posting list intersection.
- Separated indexing (async) from query (sync, low latency).
- Described sharding by term and merging top-K across shards.
- Mentioned ranking signals without pretending to know ML details.
- Connected to crawler upstream and typeahead sidecar.
Closing summary
Search is inverted indexes at scale plus a ranking layer. Crawl offline, query online. If you know tries and message queues, you already have half the vocabulary — this article connects them into a complete Google-style answer.