RAG architecture basics and workings

Retrieval-Augmented Generation from first principles — embeddings, vector databases, chunking, retrieval, prompt construction, evaluation, and common failure modes — enough depth to build one yourself.

RAG: a self-sufficient primer

What you'll leave with: a clear mental model of why RAG works (and when it doesn't), the specific role and math of embeddings and vector stores, a concrete blueprint for a minimal working RAG pipeline in Python, and the evaluation / debugging toolkit you need before putting one in production.

A companion Notion deep-dive is at RAG Basics on Notion; this page is designed to be readable on its own.


1. Why RAG exists — the problem it solves

Large Language Models have three hard limits for enterprise / personal-knowledge use cases:

  1. Frozen knowledge — their weights were trained on data up to a cutoff date. They don't know your company wiki, last week's incidents, or the PDF you uploaded five seconds ago.
  2. Context-window economics — even a 200k-token context is finite and expensive. You cannot paste your entire corpus into every prompt.
  3. Hallucination when under-specified — when a model has no grounded evidence, it confabulates plausible-sounding answers.

RAG = Retrieval-Augmented Generation. Instead of teaching the model new facts (fine-tuning), you retrieve the most relevant snippets from your corpus at query time and paste them into the prompt as evidence. The model's job narrows from "recall from memory" to "read this evidence and answer."

A one-line intuition: RAG turns a language model into a reading-comprehension system over a knowledge base you control.


2. The four components — and what each actually does

flowchart LR U[User question] --> EQ[Embed query] subgraph Offline D[Documents] --> C[Chunk] C --> ED[Embed chunks] ED --> V[(Vector store)] end EQ --> R[Top-k similarity search] V --> R R --> P[Build prompt
question + context] P --> L[LLM] L --> A[Grounded answer + citations]

2.1 Embeddings — what they are, mathematically

An embedding model is a function f: text → ℝᵈ that maps any text into a fixed-length vector (commonly d = 384, 768, 1024, 1536, or 3072) such that semantically similar texts land near each other under cosine similarity.

Two crucial properties:

  • Not a lookup. The model has generalised: texts it has never seen still get sensible vectors.
  • Asymmetric retrieval. Many models (e.g., BGE, E5, text-embedding-3-*) perform best when the query and document are embedded with slightly different instructions ("query:" vs "passage:" prefixes). Not using them is a 5–15% recall hit for free.

Leading embedding models (late 2024 / 2025):

Model Dim Context Notes
text-embedding-3-large (OpenAI) up to 3072 (Matryoshka) 8192 Good general purpose, truncatable
text-embedding-3-small up to 1536 8192 Cheaper, still strong
bge-m3 (BAAI) 1024 8192 Multilingual + multi-vector (dense, sparse, ColBERT)
e5-mistral-7b-instruct 4096 32k SOTA on MTEB, slow & large
nomic-embed-text-v1.5 768 (Matryoshka) 8192 Open-weights, runs locally

2.2 Chunking — where almost all RAG systems go wrong

You cannot embed an entire book as one vector: the model's context is limited, and a 500-page vector is useless for locating the one paragraph the user needs. So you chunk documents first.

Common strategies, in order of increasing sophistication:

  1. Fixed-size chunks — e.g., 500 tokens with 50-token overlap. Simple but breaks mid-sentence and mid-idea.
  2. Recursive character splitting (LangChain's default) — split on \n\n → \n → " " → "" until each piece is ≤ limit. Better, still blind to meaning.
  3. Semantic chunking — embed sentences, merge consecutive ones while similarity stays high, break when it drops. Preserves coherent ideas.
  4. Structure-aware — for Markdown, HTML, or code, split on headings / sections / function boundaries; attach headings as metadata so the retrieved chunk knows what section it came from.
  5. Propositional / statement-level — use an LLM to decompose text into atomic statements, then embed each. Highest recall, highest ingestion cost.

Rules of thumb that survive contact with real data:

  • Target 300–800 tokens per chunk. Smaller → more precise hits but fragmented context. Larger → fewer hits, but each one is noisier.
  • Always keep metadata with each chunk: source URL/path, title, section heading, page number, timestamp, author. You'll need it for citations and for filters.
  • Overlap 10–20%. Prevents the answer falling on a chunk boundary.

2.3 Vector stores — what makes them different from a DB

A vector store is an index tuned for approximate nearest-neighbour (ANN) search over high-dimensional vectors. Doing exact k-NN over a million 1536-dim vectors is milliseconds on a GPU and seconds on a CPU; ANN is sub-millisecond at the cost of a tiny recall loss.

The dominant algorithm is HNSW (Hierarchical Navigable Small World) — a layered graph where each node connects to nearby neighbours; search walks from a coarse top layer down to the dense bottom layer. Parameters you'll see:

  • M — edges per node (default 16). Higher → better recall, more memory.
  • ef_construction — candidate pool during build. Higher → better index, slower build.
  • ef_search — candidate pool during query. Higher → better recall, slower query.

Practical options:

Store When to pick it
FAISS (Meta) In-process, single-node, C++ speed, no server. Great for experiments and small RAGs.
Chroma Dev-friendly, embedded or client/server, persistent.
Qdrant Production-grade, Rust, strong filtering, payload indexing.
Weaviate GraphQL API, hybrid search, modular.
pgvector You already run Postgres. Keep everything transactional. Good up to ~10M vectors with IVFFlat or HNSW.
Azure AI Search Managed, hybrid (BM25 + vector + semantic ranker), RBAC, VNET integration.

2.4 The LLM at the end

The generator's job is to synthesise an answer from the retrieved context, ideally with citations. You constrain it with a system prompt like:

You are a careful assistant. Answer only using the provided CONTEXT. If the context is insufficient, say "I don't know based on the provided documents." Cite sources as [#] referring to the numbered chunks.

The choice between GPT-4o / Claude / Llama-3-70B / Mistral-Large matters less than you'd think if retrieval is good. Bad retrieval can't be rescued by a bigger model.


3. A minimal but real RAG pipeline in Python

This is intentionally library-light so you can see the moving parts. Swap components later.

# pip install openai chromadb tiktoken pypdf
from pathlib import Path
import tiktoken, chromadb
from openai import OpenAI
from pypdf import PdfReader

oai = OpenAI()
enc = tiktoken.get_encoding("cl100k_base")
client = chromadb.PersistentClient(path=".chroma")
col = client.get_or_create_collection(
    name="kb",
    metadata={"hnsw:space": "cosine"}
)

# ---------- 1. Load + chunk ----------
def load_pdf(path: Path) -> list[tuple[str, dict]]:
    reader = PdfReader(str(path))
    out = []
    for i, page in enumerate(reader.pages):
        text = page.extract_text() or ""
        out.append((text, {"source": str(path), "page": i + 1}))
    return out

def chunk(text: str, target=500, overlap=80) -> list[str]:
    toks = enc.encode(text)
    step = target - overlap
    return [enc.decode(toks[i:i + target]) for i in range(0, len(toks), step) if toks[i:i + target]]

# ---------- 2. Embed + index ----------
def embed(texts: list[str], kind: str) -> list[list[float]]:
    # kind is "query" or "passage" — matters for some models
    resp = oai.embeddings.create(model="text-embedding-3-small", input=texts)
    return [d.embedding for d in resp.data]

def ingest(pdf_paths: list[Path]):
    ids, docs, metas = [], [], []
    for p in pdf_paths:
        for page_text, meta in load_pdf(p):
            for j, c in enumerate(chunk(page_text)):
                ids.append(f"{p.stem}-p{meta['page']}-c{j}")
                docs.append(c)
                metas.append({**meta, "chunk": j})
    embs = []
    for i in range(0, len(docs), 100):  # batch
        embs.extend(embed(docs[i:i+100], "passage"))
    col.upsert(ids=ids, documents=docs, metadatas=metas, embeddings=embs)

# ---------- 3. Retrieve ----------
def retrieve(question: str, k: int = 6) -> list[dict]:
    q_emb = embed([question], "query")[0]
    res = col.query(query_embeddings=[q_emb], n_results=k,
                    include=["documents", "metadatas", "distances"])
    hits = []
    for doc, meta, dist in zip(res["documents"][0], res["metadatas"][0], res["distances"][0]):
        hits.append({"text": doc, "meta": meta, "score": 1 - dist})
    return hits

# ---------- 4. Generate ----------
SYSTEM = (
    "You are a careful assistant. Answer ONLY using the provided CONTEXT. "
    "If the context is insufficient, say you don't know. "
    "Cite sources inline as [n] where n is the chunk number."
)

def answer(question: str) -> str:
    hits = retrieve(question)
    context = "\n\n".join(
        f"[{i+1}] (source: {h['meta']['source']}, p{h['meta']['page']})\n{h['text']}"
        for i, h in enumerate(hits)
    )
    resp = oai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": f"CONTEXT:\n{context}\n\nQUESTION: {question}"},
        ],
        temperature=0.1,
    )
    return resp.choices[0].message.content

if __name__ == "__main__":
    # ingest([Path("handbook.pdf")])  # run once
    print(answer("What is our refund policy for enterprise plans?"))

This is ~80 lines and it genuinely works. Everything beyond this — rerankers, hybrid search, query rewriting — is an improvement over this base, not a replacement.


4. Upgrades that actually move the needle

Once the minimal pipeline runs, these are the highest-leverage improvements, in rough ROI order:

4.1 Hybrid search (BM25 + vector)

Vector search is great at paraphrase ("how do I cancel?" ↔ "terminating your subscription"); it's bad at exact tokens like product SKUs, error codes, or function names. BM25 (a classical lexical score) nails those.

Run both, fuse the rankings with Reciprocal Rank Fusion:

Azure AI Search, Qdrant (since 1.10), and Weaviate support this natively.

4.2 Reranking with a cross-encoder

The retriever returns, say, the top-20 by vector score. A cross-encoder (e.g., bge-reranker-v2-m3, Cohere rerank-3) then scores each (query, candidate) pair with a model that has seen both simultaneously — slower, but dramatically more accurate. Keep the top 5 for the LLM. Typical gain: +10–25 points on recall@5.

4.3 Query transformations

  • Query rewriting — turn "What about refunds?" into a standalone question using chat history.
  • HyDE (Hypothetical Document Embeddings) — have the LLM draft a hypothetical answer, embed that instead of the raw question. Surprisingly effective for short/underspecified queries.
  • Multi-query — generate 3–4 rephrasings, retrieve for each, dedupe, then rerank.

4.4 Metadata filtering

If you tag chunks with product, doc_type, date, author, you can filter before (or alongside) the vector search: product = "Foo" AND date > 2024-06-01. This is often the single biggest win for accuracy because it eliminates whole swathes of irrelevant candidates.

4.5 Structured outputs + citations

Ask the model for JSON with { "answer": ..., "citations": [chunk_id, ...] }. Now you can programmatically check that every chunk it cites was actually in the context, and render clickable footnotes in the UI.


5. How to evaluate a RAG system (the part everyone skips)

"It looks good to me" is not evaluation. The minimum you need:

5.1 Retrieval metrics (no LLM needed)

Build a small eval set of (question, gold_chunk_ids):

  • Hit rate @ k — fraction of questions where at least one gold chunk is in the top-k.
  • MRR — mean reciprocal rank of the first gold hit.
  • nDCG @ k — rewards correct order, not just presence.

A hit rate @ 5 below ~85% almost always points at chunking or embedding problems, not the LLM.

5.2 Generation metrics

The widely-used framework is the RAG triad:

  • Faithfulness / Groundedness — does every claim in the answer follow from the retrieved context? (LLM-as-judge: present context + answer, ask "is each claim supported?")
  • Answer relevance — does the answer actually address the question?
  • Context relevance — were the retrieved chunks useful?

Libraries: RAGAS, TruLens, DeepEval, or Azure AI Evaluation SDK. They're all thin wrappers over the same LLM-as-judge prompts — you can roll your own in a day.

5.3 End-to-end human eval

Maintain 30–50 golden questions with hand-written ideal answers. Run them on every change. Track a simple 1–5 Likert score from a human reviewer. This is the only metric that truly correlates with user satisfaction; automated metrics are sanity checks.


6. Common failure modes and how to diagnose them

Symptom Likely cause Fix
Confidently wrong answers LLM hallucinating because retrieval returned irrelevant chunks Lower temperature, add "say I don't know" instruction, improve retrieval, add reranker
"I don't know" when the answer exists in the corpus Recall problem: chunking too coarse, wrong embedding model, missing hybrid search Measure hit-rate @ k first; if low, fix retrieval before touching the prompt
Right chunks retrieved, wrong answer generated Prompt is not actually forcing grounding, or context is too long Tighten system prompt, truncate context, try a better model
Works on short queries, fails on long/conversational ones No query rewriting for multi-turn Add a rewrite step using the last N messages
New documents aren't found Stale index Re-embed on ingest; schedule re-index; expose a force-refresh endpoint
Latency too high Embedding model too large, or no caching, or top-k too big Cache query embeddings, use Matryoshka truncation, use smaller reranker

The diagnostic flow: answer wrong → inspect retrieved chunks → are they relevant? If yes, it's a generation problem. If no, it's a retrieval problem. Don't fix the wrong layer.


7. When not to use RAG

  • The knowledge never changes and is small → fine-tune or just stuff it in the system prompt.
  • The task is purely reasoning/transformation (summarise this text, translate this) → no retrieval needed.
  • You need structured data queries (e.g., "how many orders last month?") → you want text-to-SQL or a tool-calling agent, not RAG.
  • Fresh / real-time data → plain tool use (a web-search or DB query tool) often beats a vector index that's perpetually stale.

A good architecture routes queries: a small classifier decides RAG vs tool-calling vs direct answer.


8. TODO — self-sufficient action list

Everything below can be done with only what's on this page plus a Python environment and an OpenAI (or local Ollama) API key.

Build the base

  • [ ] Clone your own rag-starter repo with the 80-line pipeline in §3. Ingest 3–5 real PDFs you care about (handbook, papers, meeting notes).
  • [ ] Confirm end-to-end: ingest → retrieve → answer with citations on 10 hand-written questions.

Measure

  • [ ] Write 30 eval questions with gold chunk IDs for your corpus.
  • [ ] Report hit@1, hit@5, MRR today. Put the numbers in the blog.
  • [ ] Run RAGAS (or your own LLM-judge) for faithfulness + answer relevance on those 30 questions. Record the numbers.

Improve retrieval

  • [ ] Swap naive chunking for recursive + structure-aware chunking (preserve headings in metadata). Re-measure.
  • [ ] Add BM25 and fuse with RRF. Re-measure.
  • [ ] Add a cross-encoder reranker on top-20 → top-5. Re-measure.
  • [ ] Try two embedding models (text-embedding-3-small vs bge-m3); keep the winner.

Improve generation

  • [ ] Force JSON output with {answer, citations[]} and validate that every citation is in the provided context.
  • [ ] Add query rewriting for multi-turn chats. Write 5 follow-up questions that fail without it.

Operationalise

  • [ ] Put the vector store behind a service (Qdrant or pgvector) and add a simple ingest API.
  • [ ] Log every (question, retrieved_ids, answer) — you'll need this for regression testing.
  • [ ] Add a unit test that fails if hit@5 drops > 3% on the eval set.

Stretch — decide from evidence, not vibes

  • [ ] Add metadata filtering (by doc type or date) and show a 5+ point recall gain on a filtered subset.
  • [ ] A/B a semantic chunker (e.g., using embeddings for split points) vs recursive. Keep what wins.
  • [ ] Try HyDE and multi-query on your top-20 hardest questions; report which strategy wins on which.

When hit@5 ≥ 0.9 on your eval set, faithfulness ≥ 0.9, and all citations validate, flip this post to status: published and link the benchmark numbers.

Back to Blog About the Author
🧘