system designintermediate 12m2026-06-09

Designing a Search Engine — Crawl, Index, Query, Ranking

Session 44 of the 48-session learning series.

Date: Mon, 2026-07-13 · Time: 18:00–20:00 IST · Track: 🏗️ System Design (SYS) · Parent 28-day topic: Day 04 · Est. read: 2 h

Why this session matters

This is Session 44 of 48 in the System Design track. "Search" looks simple until you have to build it. Inverted indices, query parsing, ranking signals, freshness — the layers are well-understood but rarely covered properly. Knowing how Google works at a sketch-level is a system-design interview classic and a real architectural skill.

Agenda

The 4 stages — crawl, index, query, rank
The inverted index — how it actually works, what it costs
Query parsing, tokenisation, normalisation
Ranking — BM25 + signals + ML re-rank
Modern hybrid search — keyword + vector together

Pre-read (skim before the session)

Deep dive

1. The 4-stage pipeline

[ CRAWL ]   →  [ INDEX ]   →   [ QUERY ]    →   [ RANK ]
fetch web      build index     parse user      sort results
respect robots tokenise, store query, expand   BM25 + signals
discover URLs  inverted index  lookup postings ML re-rank

Each stage scales independently. Crawl is the slow, expensive front. Query latency budget is < 200 ms; everything else can be slow.

2. The inverted index

A normal index: docID → content. Inverted index: term → list of (docID, position, frequency).

"transformer" → [(d1, [3, 47]), (d7, [12]), (d99, [1, 8, 33])]
"attention"   → [(d1, [4]), (d12, [2]), (d99, [2])]

Query "transformer attention" = intersect posting lists for both terms, score the intersection.

Critical for: storing efficiently (compressed integers, delta encoding), supporting fast intersection, ranking via term-frequency stats.

3. Tokenisation and normalisation

Text in → terms out. Steps:

Tokenise (split on whitespace, punctuation; language-aware).
Lowercase (or case-preserve where case matters — names, code).
Remove stop words (the, a, is) — or keep with low weight.
Stem (running → run) or lemmatise (better → good).
Normalise unicode (NFC), strip diacritics for tolerance.
Synonyms (tv ↔ television).
N-grams (bigrams for phrase support).

Every choice changes recall + precision tradeoff. Tune by language.

4. Building the index

For a small corpus, one in-memory index. For billions of docs:

Segments — many small immutable indices, merged in background (LSM-tree-like).
Sharding — index split across nodes by document hash.
Replication — each shard on N nodes for HA + read throughput.

Lucene (Elasticsearch / OpenSearch / Solr underneath) is the canonical implementation. New indices: Tantivy (Rust), Vespa (multi-modal first-class), Meilisearch (small-scale UX).

5. The query path

"best laptop for coding 2026"
    ↓
parse → ["best", "laptop", "coding", "2026"]   (stopword "for")
    ↓
synonyms → ["best", "laptop", "notebook", "coding", "programming", "2026"]
    ↓
fetch posting lists for each
    ↓
intersect (AND) or union (OR) per query operator
    ↓
score candidates (BM25 + signals)
    ↓
top-K to ranker
    ↓
return

Lucene-style engines do steps 1–6 in milliseconds for indices in the billions.

6. BM25 — the classic ranker

A scoring function that beats vanilla TF-IDF for most workloads:

score(D, Q) = Σ IDF(qi) * f(qi, D) * (k1+1) / (f(qi, D) + k1 * (1 - b + b * |D|/avgdl))

Where:

f(qi, D) — term frequency.
IDF — inverse document frequency.
|D| — document length; avgdl — average doc length.
k1, b — tunable constants (k1=1.2, b=0.75 typical).

BM25 captures: rare terms matter more (IDF), term frequency helps but saturates (k1), shorter docs scoring higher per term (b).

7. Signals on top of BM25

Real ranking blends BM25 with:

Freshness — recent docs ranked higher for some queries.
Popularity — clickthrough rate, view count.
Quality — author authority, trustworthiness, spam score.
Personalisation — user history match.
Engagement — dwell time, return rate.
Geographic / language — match user locale.

Each signal is a feature. Sum/weight them, or learn via LTR (next).

8. Learning to Rank (LTR)

Modern engines (Google, Bing, e-commerce search) use ML to combine signals:

Pointwise — predict relevance score; sort.
Pairwise — train to order pairs correctly.
Listwise — optimise NDCG directly (LambdaMART).

LightGBM's LambdaRank is the default for many production search teams. Fast at serve, easy to interpret.

9. Hybrid search — keyword + vector

Pure keyword fails on semantic mismatch ("how to make my code faster" vs docs titled "optimization techniques"). Pure vector misses exact matches and rare terms.

Hybrid: run both retrievals, fuse results.

Fusion strategies:

Reciprocal Rank Fusion (RRF) — score = Σ 1/(k + rank) across retrievers. Simple, robust.
Weighted score — final = α * keyword + (1-α) * vector. Needs normalisation.
Rerank — feed top-K from both into an LLM/cross-encoder reranker.

OpenSearch, Elasticsearch, Vespa, Pinecone, Weaviate all support hybrid out of the box now.

10. Crawl politely

If you're the search engine:

Respect robots.txt.
Rate limit per host (don't DDoS).
Identify yourself in User-Agent.
Recrawl based on change frequency (popular news = hourly; static docs = weekly).
Discover URLs from sitemaps + link graph.
Deduplicate (canonical URLs, shingle-based near-duplicate detection).

A polite crawl of the web is a massive project. Most teams crawl their own data only.

11. Query understanding

Hard problems:

Spell correction — tesla cybertuck → tesla cybertruck.
Query expansion — add synonyms.
Intent classification — informational / navigational / transactional.
Entity detection — restaurants in tokyo → location filter on tokyo.
Question answering — modern hybrid: search + LLM summarises top results.

Each is a small ML model. Stacked, they 2× search quality without changing the index.

12. Observability

Latency — p50, p95, p99 query time.
Recall — known good docs in top-N for sample queries.
Click-through rate (CTR) — by position; flat curve = bad ranking.
Zero-result rate — too high = expansion / synonyms broken.
Click-to-action rate — quality at the answer level.

A search engine without click metrics is operating blind.

13. Reality check

A search engine for a SaaS product, day-1:

Postgres tsvector or SQLite FTS5 for < 100k docs.
Elasticsearch / OpenSearch for 100k–1B docs.
Vespa / proprietary for > 1B docs or multi-modal.
Hybrid (keyword + vector via embedding) for any modern UX.
Click logging from day 1.
LTR layer when you have > 6 months of click data.

The hardest part isn't the index; it's the evaluation. Build the click-log + relevance-judgement pipeline early.

Link: https://leetcode.com/problems/design-search-autocomplete-system/
Difficulty: Hard
Why this problem: Trie + frequency tracking — the in-memory data structure behind every search-as-you-type box. (We saw this earlier in S09; different lens.)
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Draw the inverted index data structure with delta encoding.
Walk through query lookup: parse → expand → intersect → rank.
Explain BM25 in plain English; state the role of k1, b.
Combine keyword + vector with RRF or weighted score fusion.
Sketch a crawl pipeline respecting robots.txt + rate limits.
Solve design-search-autocomplete-system — trie-based prefix lookup, the autocomplete engine.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Online Learning, Bandits, Counterfactual Evaluation

LLM Safety — Jailbreaks, Prompt Injection, Output Filtering, Red-Teaming