ai mladvanced 12m2026-06-04

Day 06 — Retrieval-Augmented Generation (RAG) End-to-End

RAG is the most-shipped LLM pattern in industry today — every internal knowledge bot, support agent and code-search tool is some flavour of it. Knowing chunking…

RAG is conceptually simple — retrieve relevant chunks, stuff them in the prompt, generate — but every layer hides 5 design decisions. Below is a production-grade mental model.

1. The four-stage pipeline

RAG looks simple from the outside ("retrieve, stuff, generate") but each stage is a design surface with its own knobs and failure modes.

2. Indexing — chunking decides ceiling

Fixed-size by tokens (512-1024 with 10-20% overlap): cheap, decent baseline.
Semantic chunking (split on heading/section, then merge < 256 tokens): better for docs.
Sliding window with summary header: prepend [Doc: "Q4 ops review"] [Section: "Latency"] to each chunk so it stays self-contained — this is what Anthropic's Contextual Retrieval automates by asking an LLM to write a 1-line context per chunk.
Hierarchical / parent-child: embed small (better recall on specifics) but feed parent doc to the LLM (better answer quality).

3. Embeddings — pick by domain, not by leaderboard

text-embedding-3-small (OpenAI, 1536d) — strong general purpose, cheap.
bge-large-en-v1.5 / bge-m3 — open weights, top of MTEB for English/multilingual.
cohere-embed-multilingual-v3 — best out-of-box multilingual.
For code: nomic-embed-code or Voyage-code-3. Domain fine-tuning (contrastive on your own (query, passage) pairs) routinely gives +10-15 pts on recall@10.

4. Hybrid retrieval — BM25 + dense

Pure dense misses exact-match queries ("error code E_42_PARSE"). Pure BM25 misses paraphrase. Hybrid = run both, normalise scores (Reciprocal Rank Fusion 1/(60+rank) is the standard), take union. Almost free, +5-10 pts recall.

🛠 Deep Dive

Internals, code, architecture.

5. Re-ranking — the cheapest big win

Bi-encoders (used for first-pass) embed query and doc independently → fast but coarse. Cross-encoders (e.g. bge-reranker-large, cohere-rerank-3.5) take (query, doc) together and predict a relevance score — much higher fidelity. Take top-50 from vector store, rerank to top-5. Latency ~50-150 ms, quality jump is enormous.

6. Query understanding

HyDE: ask the LLM to write a hypothetical answer first; embed that (it's closer to documents than the question).
Multi-query: rewrite the question into 3-5 variants, retrieve for each, fuse.
Decomposition: break compound questions into sub-questions, RAG each, then combine.
Routing: classify the query and pick the right index (docs vs code vs tickets).

7. Generation — keep the LLM honest

Always pass citations (chunk ids) in the prompt and ask the model to cite by id.
Use structured output (JSON schema) for downstream consumers.
Set temperature=0.2-0.4 for factual QA, higher only when style matters.
Refuse to answer when retrieved scores are all below a threshold ("I don't know" beats hallucination).

8. Evaluation — without it you're guessing

Two layers:

Retrieval: recall@k, MRR, nDCG against a gold (query, relevant_chunk_id) set. 50 hand-curated pairs is enough to start.
Answer: RAGAS (faithfulness, answer relevance, context precision/recall) or LLM-as-judge with a rubric. Pair with 10-20 human-rated examples to calibrate the judge.

🚀 In Practice

Trade-offs, exercises, what to ship today.

9. Failure modes & fixes

Symptom	Likely cause	Fix
Off-topic answers	Chunks too big, dilute query	Smaller chunks + re-rank
Misses obvious matches	Pure dense	Add BM25 hybrid
Hallucinated facts	Low retrieval score, model still answered	Threshold + "I don't know" path
Right doc, wrong chunk	No section context	Contextual headers
Stale answers	No freshness	Time decay on score, periodic re-index

10. Production checklist

Re-embed on schema change (track embedding_model_version).
Idempotent ingestion (hash chunk → upsert).
Observability: log query, top-k ids+scores, final answer, latency budget per stage.
Cost model: embeddings are cheap, reranker and generation are not — budget by p95.
Security: per-tenant namespaces in vector store, row-level filters on retrieval.

11. Beyond RAG

Agentic RAG: model decides when to retrieve, with what query, and re-tries.
GraphRAG: build a knowledge graph from docs and traverse it for multi-hop questions.
Long-context: feed the whole doc to a 1M-token model and skip retrieval — viable for small corpora.

12. What to take away

"Walk me through your retrieval stack." Strong answers cover: chunking strategy, hybrid retrieval, rerank, eval harness, and one failure mode you fixed.

Key points

Resources

Practice Problem: Top K Frequent Elements (Medium)

← previous

Day 05 — SOLID Principles + Strategy / Factory / Observer Patterns in Python

Day 07 — Apache Kafka Deep Dive — Partitions, Replication, Consumer Groups, Exactly-Once