Embeddings, Vector Spaces, Contrastive Learning
Session 17 of the 48-session learning series.
Date: Mon, 2026-06-22 · Time: 18:00–20:00 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 08 · Est. read: 2 h
Why this session matters
This is Session 17 of 48 in the LLMs & Agents track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.
Agenda
- What is an embedding — geometry of meaning
- Word2vec → Sentence-BERT → modern instruction-tuned embeddings
- Contrastive learning — InfoNCE, in-batch negatives, hard negatives
- Similarity metrics — cosine vs dot vs euclidean; when each makes sense
- ANN indexes — HNSW, IVF, PQ, ScaNN; choosing recall vs latency vs memory
Pre-read (skim before the session)
- Word2Vec — Mikolov et al. 2013
- Sentence-BERT — Reimers & Gurevych 2019
- SimCSE — Gao et al. 2021
- HNSW — Malkov & Yashunin 2018
Deep dive
1. The geometry of meaning
An embedding maps a piece of content (word, sentence, image, user) to a fixed-length vector. The trick: train so that meaning maps to geometry. Similar things → similar vectors. Then you can do retrieval, clustering, classification, recommendation as nearest-neighbour search.
king ≈ [+0.42, -0.18, ..., +0.07]
queen ≈ [+0.41, -0.20, ..., +0.09] ← close to king
banana ≈ [-0.32, +0.55, ..., -0.91] ← far from king
cos(king, queen) ≈ 0.92
cos(king, banana) ≈ 0.05
The Word2vec classic — king - man + woman ≈ queen — was a striking demonstration that semantic relationships became linear directions in the embedding space.
2. The progression
Word2vec (2013): predict surrounding words from the centre word (CBOW) or vice versa (skip-gram). Trained on Wikipedia; one vector per word.
GloVe (2014): factorise the co-occurrence matrix. Same regime, different objective.
Sentence-BERT (2019): take BERT, fine-tune with siamese architecture on labelled sentence pairs (NLI, STS). One vector per sentence.
SimCSE (2021): contrastive learning with dropout as augmentation. Hugely improved sentence embeddings without needing labelled pairs.
Modern (2024+): text-embedding-3-large (OpenAI), voyage-3, bge-m3, nv-embed-v2 — instruction-tuned, multilingual, often trained with hundreds of millions of contrastive pairs. Top of MTEB leaderboards hover near 70 mAP.
3. Contrastive learning — the modern recipe
You want anchor and positive close; anchor and negatives far. The InfoNCE loss:
L = -log [ exp(sim(a, p) / τ) / Σ_n exp(sim(a, n) / τ) ]
Where τ (temperature, ~0.05–0.1) sharpens or softens the distribution. Lower τ → harder negatives matter more.
The cleverest practical trick: in-batch negatives. Stack a batch of (anchor, positive) pairs; treat every other example's positive as a negative. Batch of 256 → 255 free negatives per anchor.
Hard negative mining — random negatives are too easy. Pull negatives that are semantically close but wrong (BM25 hits that didn't match the gold answer, sibling categories, paraphrases of wrong answers). Boosts margin and quality.
4. Similarity metrics
| Metric | Formula | Use when |
|---|---|---|
| Cosine | (a · b) / (‖a‖‖b‖) | Embeddings of varying magnitude; default for normalised embeddings |
| Dot product | a · b | Embeddings already unit-normalised → equivalent to cosine; popular because of FAISS support |
| Euclidean | ‖a - b‖ | When magnitude matters (rare for text) |
Always L2-normalise modern embeddings to unit length, then use dot product (= cosine). Avoids two extra divisions per comparison.
5. ANN indexes — exact search doesn't scale
Linear scan of 100 M vectors at 1024 dims = ~400 GB and seconds per query. Approximate Nearest Neighbour:
HNSW (Hierarchical Navigable Small World) — graph-based. Each node connects to log-N neighbours across multiple hierarchy levels. Query traverses top-down. Recall 95–99% at sub-ms latency. Memory: ~2× raw vectors. The default for most workloads.
IVF (Inverted File Index) — k-means cluster the vectors into nlist cells; query scans nprobe nearest cells. Smaller memory than HNSW but slower or lower recall.
PQ (Product Quantisation) — split vector into M subvectors, quantise each to one of 256 codebook entries. 1024-dim float32 (4096 B) → 1024/8 × 1 B = 128 B. 32× compression, lossy. Almost always combined as IVF+PQ for billion-scale.
ScaNN (Google) — anisotropic loss + asymmetric hashing. Best recall/throughput today, slightly trickier to tune.
6. Choosing an index
| Vectors | Latency | Recall | Memory | Pick |
|---|---|---|---|---|
| < 1 M | < 5 ms | >99% | OK | HNSW or flat |
| 1–100 M | < 50 ms | 95–98% | ~2× vectors | HNSW |
| 100 M – 1 B | < 100 ms | 90–95% | ~0.1× vectors | IVF + PQ |
| > 1 B | < 200 ms | 90–95% | small | ScaNN or IVF+PQ sharded |
7. Vector databases — what to pick
| DB | Stand-out | Trade-off |
|---|---|---|
| Pinecone | Fully managed, serverless | $$$, no self-host |
| Weaviate | Open-source, hybrid search built-in | More moving parts |
| Qdrant | Rust, fast, payload filtering | Smaller ecosystem |
| Milvus / Zilliz | Scale to billions; cloud version | Complex ops if self-hosted |
| pgvector | "Just Postgres" — joins, ACID, ops you already know | Slower at scale (>10 M) |
| Vespa | Powerful filter + ranking | Steep learning curve |
| FAISS (lib, not DB) | Fastest in benchmarks, no built-in persistence | You build the service |
Default rec for greenfield: pgvector until you outgrow it. You almost never do.
8. Dimensionality — bigger is not better
Higher d → more capacity but more storage, slower compute, more curse-of-dimensionality. Practical sweet spots:
- General-purpose text: 768–1024 dims.
- Specialised (one language, one domain): 384 often enough.
- Multimodal / cross-encoder: 1024+ helps.
text-embedding-3-large shipped a clever trick — Matryoshka embeddings. One 3072-dim vector you can truncate to 1024, 512, 256 without retraining. Re-rank with longer prefix, store the shorter — saves storage 6× with negligible quality loss.
9. Failure modes
- Embedding drift — if your embedder version changes, every vector in the index is stale. Re-embed everything. Plan for this.
- Distribution mismatch — train embedder on Wikipedia, query with chat-style. Recall tanks. Use a domain-fine-tuned model.
- Numerical issues — fp16 OK for inference, fp32 for index storage; mixing can drop recall.
- Tokenisation gap — embedder truncates at 512 tokens; your "doc" is 50K. Chunk before embedding (session 9).
10. Beyond text
- Image — CLIP, OpenCLIP, SigLIP. Same contrastive recipe.
- User/item recsys — two-tower, in-batch negatives, billion-scale (session 27).
- Code —
voyage-code-2,nv-embed-code. Code-specific embedders dominate StackOverflow-like retrieval. - Audio — Whisper hidden states, CLAP for audio↔text.
The recipe transfers; the data changes.
Reading material
- Word2Vec paper
- SimCSE — Contrastive sentence embeddings
- Matryoshka Representation Learning
- FAISS docs
In-depth research material
Video reference
▶︎ Jay Alammar — The Illustrated Word2vec
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Find K Closest Points To Origin
- Link: https://leetcode.com/problems/find-k-closest-points-to-origin/
- Difficulty: Medium
- Why this problem: Heap of size k by negative distance; mirrors ANN candidate selection.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Explain why cosine and dot are equivalent on unit-norm vectors.
- Describe InfoNCE loss and the role of temperature.
- Pick an ANN index for 10M, 1B, and 1M scale workloads.
- Explain Matryoshka embeddings in one paragraph.
- Plan an embedding-version migration without downtime.
- Solve
find-k-closest-points-to-origin— heap of size k, exactly the ANN scan kernel.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.