ai mlintermediate 12m2026-06-09

RAG Part 1 — Why, Chunking, Embeddings, Vector Stores

Session 9 of the 48-session learning series.

Date: Tue, 2026-06-16 · Time: 18:00–20:00 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 06 · Est. read: 2 h

Why this session matters

This is Session 09 of 48 in the LLMs & Agents track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

Why RAG exists — the limits of context windows and fine-tuning
Chunking strategies — fixed, sentence, semantic, structural
Embedding overview — dense vectors, similarity, what 'good' looks like
Vector stores — Faiss, pgvector, Pinecone, Weaviate, Milvus
What we keep for Part 2 (retrieval ranking, re-rank, eval)

Pre-read (skim before the session)

Deep dive

1. The problem RAG solves

LLMs are great at language but they don't know your private docs, last week's PRs, or this quarter's incidents. Two options:

Fine-tune them on your data — expensive, slow, freezes the knowledge.
Retrieve relevant snippets at query time and stuff them into the prompt — cheap, fresh, debuggable.

RAG = retrieve relevant context, then generate. It's now the default architecture for product chatbots, copilots, search, anything that needs to ground a model in your data.

2. The simplest RAG pipeline

     Ingestion (offline)                  Query (online)
     -------------------                  --------------
   docs → chunk → embed → store      q → embed → top-k search →
                       │                                       │
                       ▼                                       ▼
               vector store                     prompt = template(q, top-k) → LLM → answer

Three decisions dominate the quality you ship:

How you chunk the documents.
Which embedding model you use.
How you retrieve (top-k? hybrid? re-rank?).

This session covers 1 and 2 deeply, 3 lightly. Session 13 (RAG Part 2) covers retrieval, re-ranking, generation, and evaluation in depth.

3. Chunking — the most underrated decision

The model can only retrieve what fits in a chunk. Chunks too big → noisy retrieval. Too small → lose context.

Five levels (Greg Kamradt's framing):

Fixed-size — every N characters / tokens. Simple, lossy on structure.
Recursive character splitting — split on \n\n, then \n, then space, then char. Respects paragraph + sentence boundaries.
Semantic chunking — embed sentences, group adjacent sentences when embedding similarity > threshold. Computational but quality bump.
Structural chunking — use the doc structure (Markdown headers, HTML divs, code blocks). Best for technical docs.
Agentic chunking — ask an LLM to chunk it. Expensive at ingest, sometimes worth it.

Defaults that work:

Chunk size: 300–800 tokens (model-dependent).
Overlap: 50–100 tokens so cross-chunk references aren't lost.
Metadata per chunk: (doc_id, chunk_idx, headings[], url, page_no, last_modified).

4. Embeddings (overview — deep dive in Session 17)

An embedding model maps text → a dense vector in R^d (typically d = 384, 768, 1024, 1536, 3072). Two texts with similar meaning map to nearby vectors (cosine similarity high).

For English RAG in 2026:

Model	Dim	Strength
OpenAI `text-embedding-3-small`	1536	Cheap, fast, strong baseline
OpenAI `text-embedding-3-large`	3072	Best quality, more $/req
`bge-large-en-v1.5` (BAAI)	1024	Open-source, strong on MTEB
`nomic-embed-text-v1.5`	768	Open, long-context, fast
`voyage-3` (Voyage AI)	1024	Specialised, leads several MTEB tasks

Multi-lingual: bge-m3, multilingual-e5-large. For code: voyage-code-2, bge-code.

5. Vector stores — picking the right one

Store	Hosting	Best for
Faiss	Library	Embedded, small/medium scale, full control
pgvector	In Postgres	One DB for everything, no new infra
Pinecone	Managed SaaS	Zero-ops, $$$
Weaviate	Self/Managed	Multi-modal, hybrid (vector + keyword)
Milvus / Zilliz	Managed	Billion-scale, multi-tenant
Qdrant	Self/Managed	Rust core, fast, payload filters

Rule of thumb: until ~10M chunks, pgvector (or Faiss in-process) is enough. Past 10M, look at Qdrant / Weaviate / Pinecone for ANN performance + multi-tenant features.

6. ANN — approximate nearest-neighbour

Exact NN over 1B vectors is too slow. ANN trades a tiny recall hit for huge speed:

HNSW — hierarchical small-world graph. The default everywhere (Faiss, Qdrant, pgvector). 90–99% recall at <10 ms for millions of vectors.
IVF — cluster + inverted file. Better for billion-scale; slower to update.
ScaNN (Google), DiskANN (Microsoft) — SOTA for very large indices.

For most product use cases: HNSW with M = 16, ef_construction = 200, ef_search = 50 is a strong default.

7. A minimal RAG in 40 lines

import openai, numpy as np, faiss

docs = [open(p).read() for p in document_paths]
chunks = [c for d in docs for c in split_recursive(d, size=500, overlap=80)]

def embed(texts):
    r = openai.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([e.embedding for e in r.data], dtype="float32")

emb = embed(chunks)
index = faiss.IndexFlatIP(emb.shape[1])  # inner-product = cosine on L2-normalised
faiss.normalize_L2(emb); index.add(emb)

def answer(q, k=4):
    qv = embed([q]); faiss.normalize_L2(qv)
    D, I = index.search(qv, k)
    ctx = "\n\n".join(chunks[i] for i in I[0])
    prompt = f"Context:\n{ctx}\n\nQuestion: {q}\nAnswer:"
    r = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    return r.choices[0].message.content

It works. It's also terrible — no re-ranking, no hybrid, no eval. Sessions 13 and 25 fix that.

8. Metadata filtering

Real RAG is rarely 'search all docs'. Filter to the user's org / language / doc-type before ANN:

index.search_with_filter(qv, k=4, filter={"org_id": user.org, "lang": "en"})

Qdrant and Weaviate make this a first-class feature; pgvector does it via SQL WHERE. Pre-filter is faster than post-filter because ANN scoring only happens on the filtered subset.

9. Common failure modes (RAG “doesn't work”)

Retrieves the right doc, wrong section → chunking is too coarse / overlap too small.
Doesn't retrieve the right doc at all → embedding model isn't strong enough for your domain; try a domain-specific one or fine-tune.
Top-k irrelevant → add a re-ranker (cross-encoder) over top-30 → top-5. Session 13.
Answer cites wrong source → prompt isn't telling the model to cite. Add: "Cite the source for every claim using [doc_id]".
Hallucinated answer even with context → add an evaluation step (Session 25 — LLM evals).

10. What's next (Session 13 — RAG Part 2)

Retrieval depth: BM25 + vector hybrid, ColBERT late interaction
Re-ranking: cross-encoders, LLM-as-reranker
Generation prompts: structured, citation-bound
Evaluation: faithfulness, answer relevancy, context precision (RAGAS)

Link: https://leetcode.com/problems/design-search-autocomplete-system/
Difficulty: Hard
Why this problem: Trie of sentences with frequency at terminal nodes; sort matches by (freq desc, lex asc).
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

State two reasons RAG beats fine-tuning for fresh / private data.
Pick a chunking strategy for a markdown technical doc.
Choose an embedding model for English product docs and defend it.
Compare Faiss vs pgvector vs Pinecone for your scale.
Build the 40-line RAG above and improve one number (recall@5).
Solve design-search-autocomplete-system (Trie + frequency).

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

URL Shortener Part 2 — Cache, CDN, Hot Keys, Abuse

Kafka Part 1 — Brokers, Topics, Partitions, Producers