RAG Part 2 — Retrieval, Re-Ranking, Generation, Evaluation
Session 13 of the 48-session learning series.
Date: Sat, 2026-06-20 · Time: 09:00–11:00 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 06 · Est. read: 2 h
Why this session matters
This is Session 13 of 48 in the LLMs & Agents track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.
Agenda
- Retrieval — dense vs sparse vs hybrid; BM25 + embeddings together
- Re-ranking — cross-encoders, ColBERT, LLM-as-reranker
- Generation — context construction, system prompt, citation grounding
- Evaluation — RAGAS, recall@k, faithfulness, answer relevance
- Production patterns — multi-hop, hypothetical document embeddings (HyDE), routing
Pre-read (skim before the session)
- Retrieval-Augmented Generation (Lewis et al., 2020)
- RAGAS — Automated Evaluation of RAG
- Pinecone — Hybrid Search
- LlamaIndex — Re-ranking Guide
Deep dive
1. The full RAG pipeline
Session 9 covered ingestion: chunking, embedding, vector store. Now we cover what happens at query time.
user query
│
▼
[query rewrite/expand] ← optional, big win for short queries
│
▼
[retrieval] ← dense + sparse, k≈50
│
▼
[re-rank] ← cross-encoder, k≈5
│
▼
[context construction] ← order, dedupe, fit in token budget
│
▼
[generate] ← LLM with grounded prompt
│
▼
[post-process] ← citations, hallucination check
Each stage is independently improvable. Most teams optimise retrieval and never touch the rest — you leave 30% recall on the table.
2. Dense vs sparse vs hybrid
Dense (embeddings + cosine): great for semantic matches. "How do I cancel?" finds "refund policy" even if the words don't overlap.
Sparse (BM25): great for lexical matches. Acronyms, model names, exact phrases. BM25 still wins on entity-heavy queries — it's not obsolete.
Hybrid: run both, fuse the scores. Two common fusion methods:
# Reciprocal Rank Fusion (RRF) — k=60 is the canonical constant
def rrf(dense_ranks, sparse_ranks, k=60):
scores = {}
for rank, doc_id in enumerate(dense_ranks):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
for rank, doc_id in enumerate(sparse_ranks):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores.items(), key=lambda x: -x[1])
RRF is parameter-light and robust. It almost always beats either method alone — I've seen +10-15 points of recall@10 vs dense-only on internal benchmarks.
3. Re-ranking — bi-encoder vs cross-encoder
The retrieval embedding is a bi-encoder: query and doc encoded separately. Fast (one query embed + ANN search) but loses interaction signal.
A cross-encoder takes [CLS] query [SEP] doc [SEP] jointly and outputs a relevance score. Way more accurate, way slower. Solution: use bi-encoder to fetch top-50, cross-encoder to re-rank to top-5.
Typical stack:
- Retrieval:
text-embedding-3-largeorbge-large-en-v1.5→ top-50. - Re-rank:
bge-reranker-largeorcohere/rerank-3→ top-5. - Latency: ~20 ms retrieval + ~80 ms rerank @ 50 docs.
4. ColBERT — the middle ground
ColBERT stores per-token embeddings instead of one vector per doc. At query time, MaxSim scores each query token against the doc's best matching token. Quality near a cross-encoder, latency near a bi-encoder. Storage cost is ~30× a single-vector index — usually only worth it for high-recall, low-volume corpora.
5. Context construction
Once you have 5 chunks, how do you present them?
- Order matters. Models attend more to the start and end of the context (lost-in-the-middle, Liu et al. 2023). Put the most relevant chunk first or last, not in the middle.
- Dedupe. Chunks that share >70% overlap waste budget. Hash-based or embedding-clustering dedupe.
- Header each chunk with metadata:
[Source: docs/billing.md, last_modified: 2026-04-12]. Cheap, often free recall. - Budget. Typical prompt: 4K system + 8K retrieved + 1K query + room for answer. Don't blow the budget — pay the latency for shorter prompts.
6. Grounded prompt template
You are a support assistant. Answer ONLY from the context below.
If the answer isn't in the context, say "I don't have that information."
Cite sources inline as [^1], [^2], …
Context:
[^1] {chunk_1.metadata} → {chunk_1.text}
[^2] {chunk_2.metadata} → {chunk_2.text}
…
Question: {user_query}
The "answer ONLY from the context" + "say I don't know" instructions are non-negotiable. Without them you get confident hallucinations.
7. Evaluation — RAGAS metrics
RAGAS scores a RAG system on four axes, computed by an LLM-as-judge:
| Metric | What it measures |
|---|---|
| Context precision | Of the retrieved chunks, how many are actually relevant? |
| Context recall | Of the chunks needed to answer, how many did we retrieve? |
| Faithfulness | Is every claim in the answer supported by the context? |
| Answer relevance | Does the answer actually address the question? |
You need a small (100–500) labelled eval set: (question, ground-truth answer, optional reference docs). RAGAS uses the LLM to synthesise some of these — start with synthetic, validate on a hand-curated subset.
8. HyDE — Hypothetical Document Embeddings
Trick from Gao et al. 2022: have an LLM write a hypothetical answer to the query, embed that, and search. Why? The query is often short and lexically far from the doc; the synthetic answer is doc-shaped.
def hyde_search(query, llm, index):
hypo_answer = llm.complete(f"Write a paragraph that answers: {query}")
return index.search(embed(hypo_answer), k=10)
Costs one extra LLM call per query — worth it for short, ambiguous queries.
9. Multi-hop and query routing
For "What did the CEO say about Q3 revenue last earnings call?":
- Decompose → ["who is the CEO?", "when was last earnings call?", "what did <CEO> say about Q3?"]
- Route → org-chart index, calendar index, transcripts index.
- Synthesise → final answer references all three.
LangGraph / LlamaIndex agents do this; or you can hand-code a 50-line orchestrator. The simpler the better — every hop is a chance for the LLM to drift.
10. Production numbers (from a real customer support RAG)
| Stage | p50 latency | p99 latency | Failure mode |
|---|---|---|---|
| Embed query | 12 ms | 40 ms | Embedding API timeout |
| ANN search | 15 ms | 60 ms | Index reload during deploy |
| Cross-encoder re-rank | 80 ms | 200 ms | OOM on 50-doc batch |
| LLM generation | 1.8 s | 6 s | Context window exceeded |
| End-to-end | 2.0 s | 6.5 s | — |
Quality (human-rated): 87% useful answers, 9% partial, 4% wrong. Most wins came from re-ranking (+8%) and HyDE on short queries (+5%); BM25 hybrid gave +12% recall but only +3% answer quality (re-rank caught up the rest).
Reading material
In-depth research material
- HyDE — Precise Zero-Shot Dense Retrieval
- Cohere Rerank documentation
- LangChain Advanced RAG cookbook
- Anthropic — Contextual Retrieval
Video reference
▶︎ LlamaIndex — Advanced RAG Techniques
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Merge K Sorted Lists
- Link: https://leetcode.com/problems/merge-k-sorted-lists/
- Difficulty: Hard
- Why this problem: Min-heap of (val, list-idx); pop, push next from same list. Mirrors merging top-k from N retrievers.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Explain dense vs sparse vs hybrid retrieval and when each wins.
- Implement Reciprocal Rank Fusion in 10 lines.
- Describe the bi-encoder vs cross-encoder trade-off; when ColBERT helps.
- List the 4 RAGAS metrics and what each catches.
- Apply HyDE and explain when it pays back its extra LLM call.
- Solve
merge-k-sorted-lists— same heap-merge pattern as multi-retriever fusion.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.