Designing a Search Engine — Crawl, Index, Query, Ranking
Session 44 of the 48-session learning series.
Date: Mon, 2026-07-13 · Time: 18:00–20:00 IST · Track: 🏗️ System Design (SYS) · Parent 28-day topic: Day 04 · Est. read: 2 h
Why this session matters
This is Session 44 of 48 in the System Design track. "Search" looks simple until you have to build it. Inverted indices, query parsing, ranking signals, freshness — the layers are well-understood but rarely covered properly. Knowing how Google works at a sketch-level is a system-design interview classic and a real architectural skill.
Agenda
- The 4 stages — crawl, index, query, rank
- The inverted index — how it actually works, what it costs
- Query parsing, tokenisation, normalisation
- Ranking — BM25 + signals + ML re-rank
- Modern hybrid search — keyword + vector together
Pre-read (skim before the session)
- Manning, Raghavan, Schütze — Introduction to Information Retrieval (free book)
- Elastic — Anatomy of an Elasticsearch cluster
- Google — How Search organises information (PageRank historical)
- Pinecone — Hybrid search guide
Deep dive
1. The 4-stage pipeline
[ CRAWL ] → [ INDEX ] → [ QUERY ] → [ RANK ]
fetch web build index parse user sort results
respect robots tokenise, store query, expand BM25 + signals
discover URLs inverted index lookup postings ML re-rank
Each stage scales independently. Crawl is the slow, expensive front. Query latency budget is < 200 ms; everything else can be slow.
2. The inverted index
A normal index: docID → content. Inverted index: term → list of (docID, position, frequency).
"transformer" → [(d1, [3, 47]), (d7, [12]), (d99, [1, 8, 33])]
"attention" → [(d1, [4]), (d12, [2]), (d99, [2])]
Query "transformer attention" = intersect posting lists for both terms, score the intersection.
Critical for: storing efficiently (compressed integers, delta encoding), supporting fast intersection, ranking via term-frequency stats.
3. Tokenisation and normalisation
Text in → terms out. Steps:
- Tokenise (split on whitespace, punctuation; language-aware).
- Lowercase (or case-preserve where case matters — names, code).
- Remove stop words (
the,a,is) — or keep with low weight. - Stem (
running→run) or lemmatise (better→good). - Normalise unicode (NFC), strip diacritics for tolerance.
- Synonyms (
tv↔television). - N-grams (bigrams for phrase support).
Every choice changes recall + precision tradeoff. Tune by language.
4. Building the index
For a small corpus, one in-memory index. For billions of docs:
- Segments — many small immutable indices, merged in background (LSM-tree-like).
- Sharding — index split across nodes by document hash.
- Replication — each shard on N nodes for HA + read throughput.
Lucene (Elasticsearch / OpenSearch / Solr underneath) is the canonical implementation. New indices: Tantivy (Rust), Vespa (multi-modal first-class), Meilisearch (small-scale UX).
5. The query path
"best laptop for coding 2026"
↓
parse → ["best", "laptop", "coding", "2026"] (stopword "for")
↓
synonyms → ["best", "laptop", "notebook", "coding", "programming", "2026"]
↓
fetch posting lists for each
↓
intersect (AND) or union (OR) per query operator
↓
score candidates (BM25 + signals)
↓
top-K to ranker
↓
return
Lucene-style engines do steps 1–6 in milliseconds for indices in the billions.
6. BM25 — the classic ranker
A scoring function that beats vanilla TF-IDF for most workloads:
score(D, Q) = Σ IDF(qi) * f(qi, D) * (k1+1) / (f(qi, D) + k1 * (1 - b + b * |D|/avgdl))
Where:
f(qi, D)— term frequency.IDF— inverse document frequency.|D|— document length;avgdl— average doc length.k1,b— tunable constants (k1=1.2, b=0.75typical).
BM25 captures: rare terms matter more (IDF), term frequency helps but saturates (k1), shorter docs scoring higher per term (b).
7. Signals on top of BM25
Real ranking blends BM25 with:
- Freshness — recent docs ranked higher for some queries.
- Popularity — clickthrough rate, view count.
- Quality — author authority, trustworthiness, spam score.
- Personalisation — user history match.
- Engagement — dwell time, return rate.
- Geographic / language — match user locale.
Each signal is a feature. Sum/weight them, or learn via LTR (next).
8. Learning to Rank (LTR)
Modern engines (Google, Bing, e-commerce search) use ML to combine signals:
- Pointwise — predict relevance score; sort.
- Pairwise — train to order pairs correctly.
- Listwise — optimise NDCG directly (LambdaMART).
LightGBM's LambdaRank is the default for many production search teams. Fast at serve, easy to interpret.
9. Hybrid search — keyword + vector
Pure keyword fails on semantic mismatch ("how to make my code faster" vs docs titled "optimization techniques"). Pure vector misses exact matches and rare terms.
Hybrid: run both retrievals, fuse results.
Fusion strategies:
- Reciprocal Rank Fusion (RRF) —
score = Σ 1/(k + rank)across retrievers. Simple, robust. - Weighted score —
final = α * keyword + (1-α) * vector. Needs normalisation. - Rerank — feed top-K from both into an LLM/cross-encoder reranker.
OpenSearch, Elasticsearch, Vespa, Pinecone, Weaviate all support hybrid out of the box now.
10. Crawl politely
If you're the search engine:
- Respect
robots.txt. - Rate limit per host (don't DDoS).
- Identify yourself in User-Agent.
- Recrawl based on change frequency (popular news = hourly; static docs = weekly).
- Discover URLs from sitemaps + link graph.
- Deduplicate (canonical URLs, shingle-based near-duplicate detection).
A polite crawl of the web is a massive project. Most teams crawl their own data only.
11. Query understanding
Hard problems:
- Spell correction —
tesla cybertuck→tesla cybertruck. - Query expansion — add synonyms.
- Intent classification — informational / navigational / transactional.
- Entity detection —
restaurants in tokyo→ location filter ontokyo. - Question answering — modern hybrid: search + LLM summarises top results.
Each is a small ML model. Stacked, they 2× search quality without changing the index.
12. Observability
- Latency — p50, p95, p99 query time.
- Recall — known good docs in top-N for sample queries.
- Click-through rate (CTR) — by position; flat curve = bad ranking.
- Zero-result rate — too high = expansion / synonyms broken.
- Click-to-action rate — quality at the answer level.
A search engine without click metrics is operating blind.
13. Reality check
A search engine for a SaaS product, day-1:
- Postgres
tsvectoror SQLite FTS5 for < 100k docs. - Elasticsearch / OpenSearch for 100k–1B docs.
- Vespa / proprietary for > 1B docs or multi-modal.
- Hybrid (keyword + vector via embedding) for any modern UX.
- Click logging from day 1.
- LTR layer when you have > 6 months of click data.
The hardest part isn't the index; it's the evaluation. Build the click-log + relevance-judgement pipeline early.
Reading material
- Introduction to Information Retrieval (Manning et al.)
- Designing Data-Intensive Applications — search chapter
- Elasticsearch — Definitive Guide
- Lucene Internals (paper / talks)
In-depth research material
- Pinecone — Hybrid search guide
- Vespa documentation
- Lin Yi — Learning to Rank survey
- Spotify — Search ranking
Video reference
▶︎ Designing a Search Engine (System Design Interview)
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Design Search Autocomplete System
- Link: https://leetcode.com/problems/design-search-autocomplete-system/
- Difficulty: Hard
- Why this problem: Trie + frequency tracking — the in-memory data structure behind every search-as-you-type box. (We saw this earlier in S09; different lens.)
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Draw the inverted index data structure with delta encoding.
- Walk through query lookup: parse → expand → intersect → rank.
- Explain BM25 in plain English; state the role of
k1,b. - Combine keyword + vector with RRF or weighted score fusion.
- Sketch a crawl pipeline respecting
robots.txt+ rate limits. - Solve
design-search-autocomplete-system— trie-based prefix lookup, the autocomplete engine.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.