RAG Part 1 — Why, Chunking, Embeddings, Vector Stores
Session 9 of the 48-session learning series.
Date: Tue, 2026-06-16 · Time: 18:00–20:00 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 06 · Est. read: 2 h
Why this session matters
This is Session 09 of 48 in the LLMs & Agents track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.
Agenda
- Why RAG exists — the limits of context windows and fine-tuning
- Chunking strategies — fixed, sentence, semantic, structural
- Embedding overview — dense vectors, similarity, what 'good' looks like
- Vector stores — Faiss, pgvector, Pinecone, Weaviate, Milvus
- What we keep for Part 2 (retrieval ranking, re-rank, eval)
Pre-read (skim before the session)
- Greg Kamradt — 5 Levels Of Text Splitting (video)
- Lewis et al. 2020 — Retrieval-Augmented Generation paper
- LangChain RAG concepts
- OpenAI cookbook — RAG
Deep dive
1. The problem RAG solves
LLMs are great at language but they don't know your private docs, last week's PRs, or this quarter's incidents. Two options:
- Fine-tune them on your data — expensive, slow, freezes the knowledge.
- Retrieve relevant snippets at query time and stuff them into the prompt — cheap, fresh, debuggable.
RAG = retrieve relevant context, then generate. It's now the default architecture for product chatbots, copilots, search, anything that needs to ground a model in your data.
2. The simplest RAG pipeline
Ingestion (offline) Query (online)
------------------- --------------
docs → chunk → embed → store q → embed → top-k search →
│ │
▼ ▼
vector store prompt = template(q, top-k) → LLM → answer
Three decisions dominate the quality you ship:
- How you chunk the documents.
- Which embedding model you use.
- How you retrieve (top-k? hybrid? re-rank?).
This session covers 1 and 2 deeply, 3 lightly. Session 13 (RAG Part 2) covers retrieval, re-ranking, generation, and evaluation in depth.
3. Chunking — the most underrated decision
The model can only retrieve what fits in a chunk. Chunks too big → noisy retrieval. Too small → lose context.
Five levels (Greg Kamradt's framing):
- Fixed-size — every N characters / tokens. Simple, lossy on structure.
- Recursive character splitting — split on \n\n, then \n, then space, then char. Respects paragraph + sentence boundaries.
- Semantic chunking — embed sentences, group adjacent sentences when embedding similarity > threshold. Computational but quality bump.
- Structural chunking — use the doc structure (Markdown headers, HTML divs, code blocks). Best for technical docs.
- Agentic chunking — ask an LLM to chunk it. Expensive at ingest, sometimes worth it.
Defaults that work:
- Chunk size: 300–800 tokens (model-dependent).
- Overlap: 50–100 tokens so cross-chunk references aren't lost.
- Metadata per chunk:
(doc_id, chunk_idx, headings[], url, page_no, last_modified).
4. Embeddings (overview — deep dive in Session 17)
An embedding model maps text → a dense vector in R^d (typically d = 384, 768, 1024, 1536, 3072). Two texts with similar meaning map to nearby vectors (cosine similarity high).
For English RAG in 2026:
| Model | Dim | Strength |
|---|---|---|
OpenAI text-embedding-3-small | 1536 | Cheap, fast, strong baseline |
OpenAI text-embedding-3-large | 3072 | Best quality, more $/req |
bge-large-en-v1.5 (BAAI) | 1024 | Open-source, strong on MTEB |
nomic-embed-text-v1.5 | 768 | Open, long-context, fast |
voyage-3 (Voyage AI) | 1024 | Specialised, leads several MTEB tasks |
Multi-lingual: bge-m3, multilingual-e5-large. For code: voyage-code-2, bge-code.
5. Vector stores — picking the right one
| Store | Hosting | Best for |
|---|---|---|
| Faiss | Library | Embedded, small/medium scale, full control |
| pgvector | In Postgres | One DB for everything, no new infra |
| Pinecone | Managed SaaS | Zero-ops, $$$ |
| Weaviate | Self/Managed | Multi-modal, hybrid (vector + keyword) |
| Milvus / Zilliz | Managed | Billion-scale, multi-tenant |
| Qdrant | Self/Managed | Rust core, fast, payload filters |
Rule of thumb: until ~10M chunks, pgvector (or Faiss in-process) is enough. Past 10M, look at Qdrant / Weaviate / Pinecone for ANN performance + multi-tenant features.
6. ANN — approximate nearest-neighbour
Exact NN over 1B vectors is too slow. ANN trades a tiny recall hit for huge speed:
- HNSW — hierarchical small-world graph. The default everywhere (Faiss, Qdrant, pgvector). 90–99% recall at <10 ms for millions of vectors.
- IVF — cluster + inverted file. Better for billion-scale; slower to update.
- ScaNN (Google), DiskANN (Microsoft) — SOTA for very large indices.
For most product use cases: HNSW with M = 16, ef_construction = 200, ef_search = 50 is a strong default.
7. A minimal RAG in 40 lines
import openai, numpy as np, faiss
docs = [open(p).read() for p in document_paths]
chunks = [c for d in docs for c in split_recursive(d, size=500, overlap=80)]
def embed(texts):
r = openai.embeddings.create(model="text-embedding-3-small", input=texts)
return np.array([e.embedding for e in r.data], dtype="float32")
emb = embed(chunks)
index = faiss.IndexFlatIP(emb.shape[1]) # inner-product = cosine on L2-normalised
faiss.normalize_L2(emb); index.add(emb)
def answer(q, k=4):
qv = embed([q]); faiss.normalize_L2(qv)
D, I = index.search(qv, k)
ctx = "\n\n".join(chunks[i] for i in I[0])
prompt = f"Context:\n{ctx}\n\nQuestion: {q}\nAnswer:"
r = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
)
return r.choices[0].message.content
It works. It's also terrible — no re-ranking, no hybrid, no eval. Sessions 13 and 25 fix that.
8. Metadata filtering
Real RAG is rarely 'search all docs'. Filter to the user's org / language / doc-type before ANN:
index.search_with_filter(qv, k=4, filter={"org_id": user.org, "lang": "en"})
Qdrant and Weaviate make this a first-class feature; pgvector does it via SQL WHERE. Pre-filter is faster than post-filter because ANN scoring only happens on the filtered subset.
9. Common failure modes (RAG “doesn't work”)
- Retrieves the right doc, wrong section → chunking is too coarse / overlap too small.
- Doesn't retrieve the right doc at all → embedding model isn't strong enough for your domain; try a domain-specific one or fine-tune.
- Top-k irrelevant → add a re-ranker (cross-encoder) over top-30 → top-5. Session 13.
- Answer cites wrong source → prompt isn't telling the model to cite. Add: "Cite the source for every claim using [doc_id]".
- Hallucinated answer even with context → add an evaluation step (Session 25 — LLM evals).
10. What's next (Session 13 — RAG Part 2)
- Retrieval depth: BM25 + vector hybrid, ColBERT late interaction
- Re-ranking: cross-encoders, LLM-as-reranker
- Generation prompts: structured, citation-bound
- Evaluation: faithfulness, answer relevancy, context precision (RAGAS)
Reading material
In-depth research material
- Lewis et al. — RAG paper (2020)
- HNSW (Malkov & Yashunin, 2018)
- ColBERT — efficient late interaction
- Lost in the Middle — long-context retrieval failure
Video reference
▶︎ Greg Kamradt — 5 Levels Of Text Splitting
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Design Search Autocomplete System
- Link: https://leetcode.com/problems/design-search-autocomplete-system/
- Difficulty: Hard
- Why this problem: Trie of sentences with frequency at terminal nodes; sort matches by (freq desc, lex asc).
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- State two reasons RAG beats fine-tuning for fresh / private data.
- Pick a chunking strategy for a markdown technical doc.
- Choose an embedding model for English product docs and defend it.
- Compare Faiss vs pgvector vs Pinecone for your scale.
- Build the 40-line RAG above and improve one number (recall@5).
- Solve
design-search-autocomplete-system(Trie + frequency).
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.