Going deeper into RAG: Vector Databases, Embeddings, and their workings
More details on the learning resources publishing on notion link: Notion
RAG, which stands for Retrieval-Augmented Generation, is an advanced technique in natural language processing that combines information retrieval with text generation. This approach enhances the capabilities of large language models by allowing them to access and utilize external knowledge sources.
Key Components of RAG:
-
Vector Databases: These are specialized databases designed to store and efficiently query high-dimensional vectors, which represent semantic information about text.
-
Embeddings: These are dense vector representations of text that capture semantic meaning, allowing for efficient similarity comparisons.
-
Retrieval Mechanism: This component searches the vector database to find relevant information based on the input query.
-
Language Model: A large language model that generates responses based on the retrieved information and the original query.
How RAG Works:
- The input query is converted into an embedding.
- Similar embeddings are retrieved from the vector database.
- The retrieved information is combined with the original query.
- The language model generates a response based on this combined input.
This blog will explore each of these components in depth, discussing their implementations, challenges, and best practices.
Stay tuned for the full article!
Background & Prerequisites — What You Need to Know Before Writing This Blog
To write this blog comprehensively, the following topics must be studied in depth. Each section explains the concept, why it matters for RAG, and what depth is needed.
1. Embeddings — The Foundation of Semantic Search
Why: Embeddings convert text into numerical vectors. Without understanding them, you cannot explain how RAG retrieves relevant documents.
- What are embeddings — Dense vector representations (e.g., 768 or 1536 dimensions) of text that capture semantic meaning. Similar texts have vectors that are close together in vector space.
- How embeddings are created — Neural networks (transformers) encode text into fixed-size vectors. The model is trained so that semantically similar inputs produce similar vectors.
- Popular embedding models —
- OpenAI text-embedding-ada-002 (1536 dimensions, good general-purpose)
- OpenAI text-embedding-3-small / text-embedding-3-large (newer, supports dimension reduction)
- Sentence-BERT (all-MiniLM-L6-v2, 384 dims, open-source, fast)
- Cohere embed-english-v3.0
- BGE (BAAI General Embedding, top of MTEB leaderboard)
- Distance metrics — Cosine similarity (angle between vectors, most common), Euclidean distance (L2 norm), Dot product (unnormalized cosine). Understand when to use each.
- Embedding drift — If you change your embedding model, all previously stored embeddings must be regenerated.
2. Chunking Strategies
Why: Documents must be split into chunks before embedding. Chunk size and strategy directly impact retrieval quality. - Fixed-size chunking — Split by character count or token count (e.g., 500 tokens with 50 token overlap). Simple but can break sentences/paragraphs mid-thought. - Sentence-based chunking — Split by sentence boundaries using NLP tokenizers (spaCy, NLTK). Preserves grammatical units. - Paragraph/section chunking — Split by markdown headers, HTML sections, or double newlines. Best for structured documents. - Semantic chunking — Use embeddings to detect topic shifts within a document and split at transition points. Most sophisticated but expensive. - Recursive chunking — LangChain's approach: try to split by paragraphs, then sentences, then characters, with a max chunk size. - Overlap — Chunks should overlap (typically 10-20%) to avoid losing context at boundaries. - Metadata enrichment — Attach source document name, page number, section heading, etc. to each chunk for citation and filtering.
3. Vector Databases
Why: You need a database optimized for storing and searching dense vectors. - What they do — Store embedding vectors alongside metadata. Support approximate nearest neighbor (ANN) search to find the most similar vectors to a query. - ANN algorithms — - HNSW (Hierarchical Navigable Small World) — Most popular. Builds a multi-layer graph for fast traversal. High recall, high memory. - IVF (Inverted File) — Clusters vectors, searches only nearby clusters. Lower memory, needs training. - PQ (Product Quantization) — Compresses vectors for lower memory. Trades accuracy for space. - Popular options — - Pinecone — Fully managed, simple API, serverless option. Best for getting started. - Weaviate — Open-source, supports hybrid search (vector + keyword), GraphQL API. - Qdrant — Open-source, Rust-based, excellent performance, rich filtering. - ChromaDB — Lightweight, Python-native, good for prototyping. Runs in-process. - Azure AI Search — Enterprise-grade, integrates with Azure ecosystem, supports hybrid search. - pgvector — PostgreSQL extension. Use existing Postgres infra for vector search. - FAISS — Facebook's library. Not a database (no persistence by default), but extremely fast for batch operations. - Filtering — Pre-filter (filter metadata before vector search) vs post-filter (search first, filter after). Pre-filter is more efficient.
4. Retrieval Strategies
Why: How you retrieve context determines the quality of the generated answer. - Naive retrieval — Embed the query, find top-K nearest chunks. Simple but can miss nuance. - HyDE (Hypothetical Document Embeddings) — Have the LLM generate a hypothetical answer first, embed that, then search. Often retrieves better results because the hypothetical answer is closer in embedding space to actual documents. - Multi-query retrieval — Generate multiple reformulations of the user's query, retrieve for each, merge and deduplicate results. - Parent document retrieval — Index small chunks for precision, but return the parent (larger) chunk for context. - Hybrid search — Combine vector search (semantic) with BM25/keyword search (lexical). Reciprocal Rank Fusion (RRF) merges the two result lists. - Re-ranking — After initial retrieval, use a cross-encoder model (e.g., Cohere Rerank, BGE Reranker) to re-score results. Cross-encoders are more accurate than bi-encoders but slower.
5. Prompt Engineering for RAG
Why: How you inject retrieved context into the LLM prompt determines answer quality. - System prompt design — Instruct the model to only use provided context, cite sources, say "I don't know" when context is insufficient. - Context window management — LLMs have token limits (4K, 8K, 32K, 128K). Retrieved chunks must fit within the context window minus prompt and response tokens. - Stuffing vs Map-Reduce vs Refine — - Stuffing — Put all retrieved chunks in one prompt. Simple, but limited by context window. - Map-Reduce — Summarize each chunk independently, then combine summaries. Good for large document sets. - Refine — Iterate through chunks, refining the answer with each. Good for detail, but slow. - Citation/attribution — Include chunk metadata in the prompt so the LLM can cite sources ("[Source: document.pdf, page 5]").
6. Evaluation of RAG Systems
Why: You need to measure whether your RAG system is working well. - Retrieval metrics — - Recall@K: What fraction of relevant documents appear in the top-K results? - MRR (Mean Reciprocal Rank): How high is the first relevant result? - NDCG: Are results ordered by relevance? - Generation metrics — - Faithfulness: Does the answer only use information from the retrieved context? (No hallucination) - Answer relevance: Is the answer relevant to the question? - Context relevance: Was the retrieved context relevant? - Frameworks — RAGAS (automated RAG evaluation), DeepEval, TruLens.
7. End-to-End RAG Frameworks
Why: Practical tools that wire everything together. - LangChain — Most popular. Provides chains, retrievers, memory, agents. Python and JS. - LlamaIndex — Focused on data ingestion and indexing. Strong document parsing, node relationships. - Semantic Kernel — Microsoft's framework. Integrates with Azure services. - Haystack — By deepset. Pipeline-based, supports both extractive and generative QA.
TODO / Remaining Work
- [ ] Write detailed explanation of embeddings with visual diagram (2D projection)
- [ ] Implement a chunking comparison — same document, different strategies, compare retrieval quality
- [ ] Set up a vector database (ChromaDB for local, Azure AI Search for cloud)
- [ ] Build an end-to-end RAG pipeline: ingest → chunk → embed → store → retrieve → generate
- [ ] Add code examples in Python (LangChain or LlamaIndex)
- [ ] Write evaluation section with RAGAS metrics on a sample dataset
- [ ] Add architecture diagram (Mermaid) showing the full RAG flow
- [ ] Compare naive retrieval vs hybrid search vs re-ranking
- [ ] Add cost/latency analysis for different embedding models and vector DBs
- [ ] Add section on production considerations (caching, streaming, monitoring)