Day 06 — Retrieval-Augmented Generation (RAG) End-to-End
RAG is the most-shipped LLM pattern in industry today — every internal knowledge bot, support agent and code-search tool is some flavour of it. Knowing chunking…
RAG is conceptually simple — retrieve relevant chunks, stuff them in the prompt, generate — but every layer hides 5 design decisions. Below is a production-grade mental model.
🧠 Concept
Why it matters & the mental model.
1. The four-stage pipeline
RAG looks simple from the outside ("retrieve, stuff, generate") but each stage is a design surface with its own knobs and failure modes.
2. Indexing — chunking decides ceiling
- Fixed-size by tokens (512-1024 with 10-20% overlap): cheap, decent baseline.
- Semantic chunking (split on heading/section, then merge < 256 tokens): better for docs.
- Sliding window with summary header: prepend
[Doc: "Q4 ops review"] [Section: "Latency"]to each chunk so it stays self-contained — this is what Anthropic's Contextual Retrieval automates by asking an LLM to write a 1-line context per chunk. - Hierarchical / parent-child: embed small (better recall on specifics) but feed parent doc to the LLM (better answer quality).
3. Embeddings — pick by domain, not by leaderboard
text-embedding-3-small(OpenAI, 1536d) — strong general purpose, cheap.bge-large-en-v1.5/bge-m3— open weights, top of MTEB for English/multilingual.cohere-embed-multilingual-v3— best out-of-box multilingual.- For code:
nomic-embed-codeorVoyage-code-3. Domain fine-tuning (contrastive on your own (query, passage) pairs) routinely gives +10-15 pts on recall@10.
4. Hybrid retrieval — BM25 + dense
Pure dense misses exact-match queries ("error code E_42_PARSE"). Pure BM25 misses paraphrase. Hybrid = run both, normalise scores (Reciprocal Rank Fusion 1/(60+rank) is the standard), take union. Almost free, +5-10 pts recall.
🛠 Deep Dive
Internals, code, architecture.
5. Re-ranking — the cheapest big win
Bi-encoders (used for first-pass) embed query and doc independently → fast but coarse. Cross-encoders (e.g. bge-reranker-large, cohere-rerank-3.5) take (query, doc) together and predict a relevance score — much higher fidelity. Take top-50 from vector store, rerank to top-5. Latency ~50-150 ms, quality jump is enormous.
6. Query understanding
- HyDE: ask the LLM to write a hypothetical answer first; embed that (it's closer to documents than the question).
- Multi-query: rewrite the question into 3-5 variants, retrieve for each, fuse.
- Decomposition: break compound questions into sub-questions, RAG each, then combine.
- Routing: classify the query and pick the right index (docs vs code vs tickets).
7. Generation — keep the LLM honest
- Always pass citations (chunk ids) in the prompt and ask the model to cite by id.
- Use structured output (JSON schema) for downstream consumers.
- Set
temperature=0.2-0.4for factual QA, higher only when style matters. - Refuse to answer when retrieved scores are all below a threshold ("I don't know" beats hallucination).
8. Evaluation — without it you're guessing
Two layers:
- Retrieval: recall@k, MRR, nDCG against a gold (query, relevant_chunk_id) set. 50 hand-curated pairs is enough to start.
- Answer: RAGAS (faithfulness, answer relevance, context precision/recall) or LLM-as-judge with a rubric. Pair with 10-20 human-rated examples to calibrate the judge.
🚀 In Practice
Trade-offs, exercises, what to ship today.
9. Failure modes & fixes
| Symptom | Likely cause | Fix |
|---|---|---|
| Off-topic answers | Chunks too big, dilute query | Smaller chunks + re-rank |
| Misses obvious matches | Pure dense | Add BM25 hybrid |
| Hallucinated facts | Low retrieval score, model still answered | Threshold + "I don't know" path |
| Right doc, wrong chunk | No section context | Contextual headers |
| Stale answers | No freshness | Time decay on score, periodic re-index |
10. Production checklist
- Re-embed on schema change (track
embedding_model_version). - Idempotent ingestion (hash chunk → upsert).
- Observability: log query, top-k ids+scores, final answer, latency budget per stage.
- Cost model: embeddings are cheap, reranker and generation are not — budget by p95.
- Security: per-tenant namespaces in vector store, row-level filters on retrieval.
11. Beyond RAG
- Agentic RAG: model decides when to retrieve, with what query, and re-tries.
- GraphRAG: build a knowledge graph from docs and traverse it for multi-hop questions.
- Long-context: feed the whole doc to a 1M-token model and skip retrieval — viable for small corpora.
12. What to take away
"Walk me through your retrieval stack." Strong answers cover: chunking strategy, hybrid retrieval, rerank, eval harness, and one failure mode you fixed.
Resources
- 🎥 Pinecone — RAG Deep Dive (James Briggs)
- 📖 Anthropic — Contextual Retrieval
- 📖 LangChain RAG From Scratch series (with notebooks)
- 📖 Pinecone — RAG handbook
Practice Problem: Top K Frequent Elements (Medium)