Search Tech Journey

Find topics, journeys and posts

back to blog
ai mladvanced 12m2026-06-04

Day 06 — Retrieval-Augmented Generation (RAG) End-to-End

RAG is the most-shipped LLM pattern in industry today — every internal knowledge bot, support agent and code-search tool is some flavour of it. Knowing chunking…

RAG is conceptually simple — retrieve relevant chunks, stuff them in the prompt, generate — but every layer hides 5 design decisions. Below is a production-grade mental model.

🧠 Concept

Why it matters & the mental model.

1. The four-stage pipeline

RAG looks simple from the outside ("retrieve, stuff, generate") but each stage is a design surface with its own knobs and failure modes.

2. Indexing — chunking decides ceiling

  • Fixed-size by tokens (512-1024 with 10-20% overlap): cheap, decent baseline.
  • Semantic chunking (split on heading/section, then merge < 256 tokens): better for docs.
  • Sliding window with summary header: prepend [Doc: "Q4 ops review"] [Section: "Latency"] to each chunk so it stays self-contained — this is what Anthropic's Contextual Retrieval automates by asking an LLM to write a 1-line context per chunk.
  • Hierarchical / parent-child: embed small (better recall on specifics) but feed parent doc to the LLM (better answer quality).

3. Embeddings — pick by domain, not by leaderboard

  • text-embedding-3-small (OpenAI, 1536d) — strong general purpose, cheap.
  • bge-large-en-v1.5 / bge-m3 — open weights, top of MTEB for English/multilingual.
  • cohere-embed-multilingual-v3 — best out-of-box multilingual.
  • For code: nomic-embed-code or Voyage-code-3. Domain fine-tuning (contrastive on your own (query, passage) pairs) routinely gives +10-15 pts on recall@10.

4. Hybrid retrieval — BM25 + dense

Pure dense misses exact-match queries ("error code E_42_PARSE"). Pure BM25 misses paraphrase. Hybrid = run both, normalise scores (Reciprocal Rank Fusion 1/(60+rank) is the standard), take union. Almost free, +5-10 pts recall.

🛠 Deep Dive

Internals, code, architecture.

5. Re-ranking — the cheapest big win

Bi-encoders (used for first-pass) embed query and doc independently → fast but coarse. Cross-encoders (e.g. bge-reranker-large, cohere-rerank-3.5) take (query, doc) together and predict a relevance score — much higher fidelity. Take top-50 from vector store, rerank to top-5. Latency ~50-150 ms, quality jump is enormous.

6. Query understanding

  • HyDE: ask the LLM to write a hypothetical answer first; embed that (it's closer to documents than the question).
  • Multi-query: rewrite the question into 3-5 variants, retrieve for each, fuse.
  • Decomposition: break compound questions into sub-questions, RAG each, then combine.
  • Routing: classify the query and pick the right index (docs vs code vs tickets).

7. Generation — keep the LLM honest

  • Always pass citations (chunk ids) in the prompt and ask the model to cite by id.
  • Use structured output (JSON schema) for downstream consumers.
  • Set temperature=0.2-0.4 for factual QA, higher only when style matters.
  • Refuse to answer when retrieved scores are all below a threshold ("I don't know" beats hallucination).

8. Evaluation — without it you're guessing

Two layers:

  • Retrieval: recall@k, MRR, nDCG against a gold (query, relevant_chunk_id) set. 50 hand-curated pairs is enough to start.
  • Answer: RAGAS (faithfulness, answer relevance, context precision/recall) or LLM-as-judge with a rubric. Pair with 10-20 human-rated examples to calibrate the judge.

🚀 In Practice

Trade-offs, exercises, what to ship today.

9. Failure modes & fixes

SymptomLikely causeFix
Off-topic answersChunks too big, dilute querySmaller chunks + re-rank
Misses obvious matchesPure denseAdd BM25 hybrid
Hallucinated factsLow retrieval score, model still answeredThreshold + "I don't know" path
Right doc, wrong chunkNo section contextContextual headers
Stale answersNo freshnessTime decay on score, periodic re-index

10. Production checklist

  • Re-embed on schema change (track embedding_model_version).
  • Idempotent ingestion (hash chunk → upsert).
  • Observability: log query, top-k ids+scores, final answer, latency budget per stage.
  • Cost model: embeddings are cheap, reranker and generation are not — budget by p95.
  • Security: per-tenant namespaces in vector store, row-level filters on retrieval.

11. Beyond RAG

  • Agentic RAG: model decides when to retrieve, with what query, and re-tries.
  • GraphRAG: build a knowledge graph from docs and traverse it for multi-hop questions.
  • Long-context: feed the whole doc to a 1M-token model and skip retrieval — viable for small corpora.

12. What to take away

"Walk me through your retrieval stack." Strong answers cover: chunking strategy, hybrid retrieval, rerank, eval harness, and one failure mode you fixed.

Key points

    Resources

    Practice Problem: Top K Frequent Elements (Medium)