ai mlintermediate 12m2026-06-09

RAG Part 2 — Retrieval, Re-Ranking, Generation, Evaluation

Session 13 of the 48-session learning series.

Date: Sat, 2026-06-20 · Time: 09:00–11:00 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 06 · Est. read: 2 h

Why this session matters

This is Session 13 of 48 in the LLMs & Agents track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

Retrieval — dense vs sparse vs hybrid; BM25 + embeddings together
Re-ranking — cross-encoders, ColBERT, LLM-as-reranker
Generation — context construction, system prompt, citation grounding
Evaluation — RAGAS, recall@k, faithfulness, answer relevance
Production patterns — multi-hop, hypothetical document embeddings (HyDE), routing

Pre-read (skim before the session)

Deep dive

1. The full RAG pipeline

Session 9 covered ingestion: chunking, embedding, vector store. Now we cover what happens at query time.

user query
   │
   ▼
[query rewrite/expand]  ← optional, big win for short queries
   │
   ▼
[retrieval]            ← dense + sparse, k≈50
   │
   ▼
[re-rank]              ← cross-encoder, k≈5
   │
   ▼
[context construction] ← order, dedupe, fit in token budget
   │
   ▼
[generate]             ← LLM with grounded prompt
   │
   ▼
[post-process]         ← citations, hallucination check

Each stage is independently improvable. Most teams optimise retrieval and never touch the rest — you leave 30% recall on the table.

2. Dense vs sparse vs hybrid

Dense (embeddings + cosine): great for semantic matches. "How do I cancel?" finds "refund policy" even if the words don't overlap.

Sparse (BM25): great for lexical matches. Acronyms, model names, exact phrases. BM25 still wins on entity-heavy queries — it's not obsolete.

Hybrid: run both, fuse the scores. Two common fusion methods:

# Reciprocal Rank Fusion (RRF) — k=60 is the canonical constant
def rrf(dense_ranks, sparse_ranks, k=60):
    scores = {}
    for rank, doc_id in enumerate(dense_ranks):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    for rank, doc_id in enumerate(sparse_ranks):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

RRF is parameter-light and robust. It almost always beats either method alone — I've seen +10-15 points of recall@10 vs dense-only on internal benchmarks.

3. Re-ranking — bi-encoder vs cross-encoder

The retrieval embedding is a bi-encoder: query and doc encoded separately. Fast (one query embed + ANN search) but loses interaction signal.

A cross-encoder takes [CLS] query [SEP] doc [SEP] jointly and outputs a relevance score. Way more accurate, way slower. Solution: use bi-encoder to fetch top-50, cross-encoder to re-rank to top-5.

Typical stack:

Retrieval: text-embedding-3-large or bge-large-en-v1.5 → top-50.
Re-rank: bge-reranker-large or cohere/rerank-3 → top-5.
Latency: ~20 ms retrieval + ~80 ms rerank @ 50 docs.

4. ColBERT — the middle ground

ColBERT stores per-token embeddings instead of one vector per doc. At query time, MaxSim scores each query token against the doc's best matching token. Quality near a cross-encoder, latency near a bi-encoder. Storage cost is ~30× a single-vector index — usually only worth it for high-recall, low-volume corpora.

5. Context construction

Once you have 5 chunks, how do you present them?

Order matters. Models attend more to the start and end of the context (lost-in-the-middle, Liu et al. 2023). Put the most relevant chunk first or last, not in the middle.
Dedupe. Chunks that share >70% overlap waste budget. Hash-based or embedding-clustering dedupe.
Header each chunk with metadata: [Source: docs/billing.md, last_modified: 2026-04-12]. Cheap, often free recall.
Budget. Typical prompt: 4K system + 8K retrieved + 1K query + room for answer. Don't blow the budget — pay the latency for shorter prompts.

6. Grounded prompt template

You are a support assistant. Answer ONLY from the context below.
If the answer isn't in the context, say "I don't have that information."
Cite sources inline as [^1], [^2], …

Context:
[^1] {chunk_1.metadata} → {chunk_1.text}
[^2] {chunk_2.metadata} → {chunk_2.text}
…

Question: {user_query}

The "answer ONLY from the context" + "say I don't know" instructions are non-negotiable. Without them you get confident hallucinations.

7. Evaluation — RAGAS metrics

RAGAS scores a RAG system on four axes, computed by an LLM-as-judge:

Metric	What it measures
Context precision	Of the retrieved chunks, how many are actually relevant?
Context recall	Of the chunks needed to answer, how many did we retrieve?
Faithfulness	Is every claim in the answer supported by the context?
Answer relevance	Does the answer actually address the question?

You need a small (100–500) labelled eval set: (question, ground-truth answer, optional reference docs). RAGAS uses the LLM to synthesise some of these — start with synthetic, validate on a hand-curated subset.

8. HyDE — Hypothetical Document Embeddings

Trick from Gao et al. 2022: have an LLM write a hypothetical answer to the query, embed that, and search. Why? The query is often short and lexically far from the doc; the synthetic answer is doc-shaped.

def hyde_search(query, llm, index):
    hypo_answer = llm.complete(f"Write a paragraph that answers: {query}")
    return index.search(embed(hypo_answer), k=10)

Costs one extra LLM call per query — worth it for short, ambiguous queries.

9. Multi-hop and query routing

For "What did the CEO say about Q3 revenue last earnings call?":

Decompose → ["who is the CEO?", "when was last earnings call?", "what did <CEO> say about Q3?"]
Route → org-chart index, calendar index, transcripts index.
Synthesise → final answer references all three.

LangGraph / LlamaIndex agents do this; or you can hand-code a 50-line orchestrator. The simpler the better — every hop is a chance for the LLM to drift.

10. Production numbers (from a real customer support RAG)

Stage	p50 latency	p99 latency	Failure mode
Embed query	12 ms	40 ms	Embedding API timeout
ANN search	15 ms	60 ms	Index reload during deploy
Cross-encoder re-rank	80 ms	200 ms	OOM on 50-doc batch
LLM generation	1.8 s	6 s	Context window exceeded
End-to-end	2.0 s	6.5 s	—

Quality (human-rated): 87% useful answers, 9% partial, 4% wrong. Most wins came from re-ranking (+8%) and HyDE on short queries (+5%); BM25 hybrid gave +12% recall but only +3% answer quality (re-rank caught up the rest).

Link: https://leetcode.com/problems/merge-k-sorted-lists/
Difficulty: Hard
Why this problem: Min-heap of (val, list-idx); pop, push next from same list. Mirrors merging top-k from N retrievers.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Explain dense vs sparse vs hybrid retrieval and when each wins.
Implement Reciprocal Rank Fusion in 10 lines.
Describe the bi-encoder vs cross-encoder trade-off; when ColBERT helps.
List the 4 RAGAS metrics and what each catches.
Apply HyDE and explain when it pays back its extra LLM call.
Solve merge-k-sorted-lists — same heap-merge pattern as multi-retriever fusion.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

CAP, PACELC, Quorums — How Distributed Systems Actually Trade Off

GBDT Part 2 — XGBoost, LightGBM, Regularisation, In-Practice Tuning