Search Tech Journey

Find topics, journeys and posts

back to blog
ai mlintermediate 12m2026-06-09

RAG Part 1 — Why, Chunking, Embeddings, Vector Stores

Session 9 of the 48-session learning series.

Date: Tue, 2026-06-16 · Time: 18:00–20:00 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 06 · Est. read: 2 h

Why this session matters

This is Session 09 of 48 in the LLMs & Agents track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

  • Why RAG exists — the limits of context windows and fine-tuning
  • Chunking strategies — fixed, sentence, semantic, structural
  • Embedding overview — dense vectors, similarity, what 'good' looks like
  • Vector stores — Faiss, pgvector, Pinecone, Weaviate, Milvus
  • What we keep for Part 2 (retrieval ranking, re-rank, eval)

Pre-read (skim before the session)

Deep dive

1. The problem RAG solves

LLMs are great at language but they don't know your private docs, last week's PRs, or this quarter's incidents. Two options:

  • Fine-tune them on your data — expensive, slow, freezes the knowledge.
  • Retrieve relevant snippets at query time and stuff them into the prompt — cheap, fresh, debuggable.

RAG = retrieve relevant context, then generate. It's now the default architecture for product chatbots, copilots, search, anything that needs to ground a model in your data.

2. The simplest RAG pipeline

     Ingestion (offline)                  Query (online)
     -------------------                  --------------
   docs → chunk → embed → store      q → embed → top-k search →
                       │                                       │
                       ▼                                       ▼
               vector store                     prompt = template(q, top-k) → LLM → answer

Three decisions dominate the quality you ship:

  1. How you chunk the documents.
  2. Which embedding model you use.
  3. How you retrieve (top-k? hybrid? re-rank?).

This session covers 1 and 2 deeply, 3 lightly. Session 13 (RAG Part 2) covers retrieval, re-ranking, generation, and evaluation in depth.

3. Chunking — the most underrated decision

The model can only retrieve what fits in a chunk. Chunks too big → noisy retrieval. Too small → lose context.

Five levels (Greg Kamradt's framing):

  1. Fixed-size — every N characters / tokens. Simple, lossy on structure.
  2. Recursive character splitting — split on \n\n, then \n, then space, then char. Respects paragraph + sentence boundaries.
  3. Semantic chunking — embed sentences, group adjacent sentences when embedding similarity > threshold. Computational but quality bump.
  4. Structural chunking — use the doc structure (Markdown headers, HTML divs, code blocks). Best for technical docs.
  5. Agentic chunking — ask an LLM to chunk it. Expensive at ingest, sometimes worth it.

Defaults that work:

  • Chunk size: 300–800 tokens (model-dependent).
  • Overlap: 50–100 tokens so cross-chunk references aren't lost.
  • Metadata per chunk: (doc_id, chunk_idx, headings[], url, page_no, last_modified).

4. Embeddings (overview — deep dive in Session 17)

An embedding model maps text → a dense vector in R^d (typically d = 384, 768, 1024, 1536, 3072). Two texts with similar meaning map to nearby vectors (cosine similarity high).

For English RAG in 2026:

ModelDimStrength
OpenAI text-embedding-3-small1536Cheap, fast, strong baseline
OpenAI text-embedding-3-large3072Best quality, more $/req
bge-large-en-v1.5 (BAAI)1024Open-source, strong on MTEB
nomic-embed-text-v1.5768Open, long-context, fast
voyage-3 (Voyage AI)1024Specialised, leads several MTEB tasks

Multi-lingual: bge-m3, multilingual-e5-large. For code: voyage-code-2, bge-code.

5. Vector stores — picking the right one

StoreHostingBest for
FaissLibraryEmbedded, small/medium scale, full control
pgvectorIn PostgresOne DB for everything, no new infra
PineconeManaged SaaSZero-ops, $$$
WeaviateSelf/ManagedMulti-modal, hybrid (vector + keyword)
Milvus / ZillizManagedBillion-scale, multi-tenant
QdrantSelf/ManagedRust core, fast, payload filters

Rule of thumb: until ~10M chunks, pgvector (or Faiss in-process) is enough. Past 10M, look at Qdrant / Weaviate / Pinecone for ANN performance + multi-tenant features.

6. ANN — approximate nearest-neighbour

Exact NN over 1B vectors is too slow. ANN trades a tiny recall hit for huge speed:

  • HNSW — hierarchical small-world graph. The default everywhere (Faiss, Qdrant, pgvector). 90–99% recall at <10 ms for millions of vectors.
  • IVF — cluster + inverted file. Better for billion-scale; slower to update.
  • ScaNN (Google), DiskANN (Microsoft) — SOTA for very large indices.

For most product use cases: HNSW with M = 16, ef_construction = 200, ef_search = 50 is a strong default.

7. A minimal RAG in 40 lines

import openai, numpy as np, faiss

docs = [open(p).read() for p in document_paths]
chunks = [c for d in docs for c in split_recursive(d, size=500, overlap=80)]

def embed(texts):
    r = openai.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([e.embedding for e in r.data], dtype="float32")

emb = embed(chunks)
index = faiss.IndexFlatIP(emb.shape[1])  # inner-product = cosine on L2-normalised
faiss.normalize_L2(emb); index.add(emb)

def answer(q, k=4):
    qv = embed([q]); faiss.normalize_L2(qv)
    D, I = index.search(qv, k)
    ctx = "\n\n".join(chunks[i] for i in I[0])
    prompt = f"Context:\n{ctx}\n\nQuestion: {q}\nAnswer:"
    r = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    return r.choices[0].message.content

It works. It's also terrible — no re-ranking, no hybrid, no eval. Sessions 13 and 25 fix that.

8. Metadata filtering

Real RAG is rarely 'search all docs'. Filter to the user's org / language / doc-type before ANN:

index.search_with_filter(qv, k=4, filter={"org_id": user.org, "lang": "en"})

Qdrant and Weaviate make this a first-class feature; pgvector does it via SQL WHERE. Pre-filter is faster than post-filter because ANN scoring only happens on the filtered subset.

9. Common failure modes (RAG “doesn't work”)

  • Retrieves the right doc, wrong section → chunking is too coarse / overlap too small.
  • Doesn't retrieve the right doc at all → embedding model isn't strong enough for your domain; try a domain-specific one or fine-tune.
  • Top-k irrelevant → add a re-ranker (cross-encoder) over top-30 → top-5. Session 13.
  • Answer cites wrong source → prompt isn't telling the model to cite. Add: "Cite the source for every claim using [doc_id]".
  • Hallucinated answer even with context → add an evaluation step (Session 25 — LLM evals).

10. What's next (Session 13 — RAG Part 2)

  • Retrieval depth: BM25 + vector hybrid, ColBERT late interaction
  • Re-ranking: cross-encoders, LLM-as-reranker
  • Generation prompts: structured, citation-bound
  • Evaluation: faithfulness, answer relevancy, context precision (RAGAS)

Reading material

In-depth research material

Video reference

▶︎ Greg Kamradt — 5 Levels Of Text Splitting

Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.

LeetCode — Design Search Autocomplete System

Post-session checklist

By the end of this session you should be able to:

  • State two reasons RAG beats fine-tuning for fresh / private data.
  • Pick a chunking strategy for a markdown technical doc.
  • Choose an embedding model for English product docs and defend it.
  • Compare Faiss vs pgvector vs Pinecone for your scale.
  • Build the 40-line RAG above and improve one number (recall@5).
  • Solve design-search-autocomplete-system (Trie + frequency).

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.