ai mlintermediate 12m2026-06-09

Embeddings, Vector Spaces, Contrastive Learning

Session 17 of the 48-session learning series.

Date: Mon, 2026-06-22 · Time: 18:00–20:00 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 08 · Est. read: 2 h

Why this session matters

This is Session 17 of 48 in the LLMs & Agents track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

What is an embedding — geometry of meaning
Word2vec → Sentence-BERT → modern instruction-tuned embeddings
Contrastive learning — InfoNCE, in-batch negatives, hard negatives
Similarity metrics — cosine vs dot vs euclidean; when each makes sense
ANN indexes — HNSW, IVF, PQ, ScaNN; choosing recall vs latency vs memory

Pre-read (skim before the session)

An embedding maps a piece of content (word, sentence, image, user) to a fixed-length vector. The trick: train so that meaning maps to geometry. Similar things → similar vectors. Then you can do retrieval, clustering, classification, recommendation as nearest-neighbour search.

king        ≈ [+0.42, -0.18, ..., +0.07]
queen       ≈ [+0.41, -0.20, ..., +0.09]   ← close to king
banana      ≈ [-0.32, +0.55, ..., -0.91]   ← far from king

cos(king, queen)  ≈ 0.92
cos(king, banana) ≈ 0.05

The Word2vec classic — king - man + woman ≈ queen — was a striking demonstration that semantic relationships became linear directions in the embedding space.

2. The progression

Word2vec (2013): predict surrounding words from the centre word (CBOW) or vice versa (skip-gram). Trained on Wikipedia; one vector per word.

GloVe (2014): factorise the co-occurrence matrix. Same regime, different objective.

Sentence-BERT (2019): take BERT, fine-tune with siamese architecture on labelled sentence pairs (NLI, STS). One vector per sentence.

SimCSE (2021): contrastive learning with dropout as augmentation. Hugely improved sentence embeddings without needing labelled pairs.

Modern (2024+): text-embedding-3-large (OpenAI), voyage-3, bge-m3, nv-embed-v2 — instruction-tuned, multilingual, often trained with hundreds of millions of contrastive pairs. Top of MTEB leaderboards hover near 70 mAP.

3. Contrastive learning — the modern recipe

You want anchor and positive close; anchor and negatives far. The InfoNCE loss:

L = -log [ exp(sim(a, p) / τ)  /  Σ_n exp(sim(a, n) / τ) ]

Where τ (temperature, ~0.05–0.1) sharpens or softens the distribution. Lower τ → harder negatives matter more.

The cleverest practical trick: in-batch negatives. Stack a batch of (anchor, positive) pairs; treat every other example's positive as a negative. Batch of 256 → 255 free negatives per anchor.

Hard negative mining — random negatives are too easy. Pull negatives that are semantically close but wrong (BM25 hits that didn't match the gold answer, sibling categories, paraphrases of wrong answers). Boosts margin and quality.

4. Similarity metrics

Metric	Formula	Use when
Cosine	`(a · b) / (‖a‖‖b‖)`	Embeddings of varying magnitude; default for normalised embeddings
Dot product	`a · b`	Embeddings already unit-normalised → equivalent to cosine; popular because of FAISS support
Euclidean	`‖a - b‖`	When magnitude matters (rare for text)

Always L2-normalise modern embeddings to unit length, then use dot product (= cosine). Avoids two extra divisions per comparison.

5. ANN indexes — exact search doesn't scale

Linear scan of 100 M vectors at 1024 dims = ~400 GB and seconds per query. Approximate Nearest Neighbour:

HNSW (Hierarchical Navigable Small World) — graph-based. Each node connects to log-N neighbours across multiple hierarchy levels. Query traverses top-down. Recall 95–99% at sub-ms latency. Memory: ~2× raw vectors. The default for most workloads.

IVF (Inverted File Index) — k-means cluster the vectors into nlist cells; query scans nprobe nearest cells. Smaller memory than HNSW but slower or lower recall.

PQ (Product Quantisation) — split vector into M subvectors, quantise each to one of 256 codebook entries. 1024-dim float32 (4096 B) → 1024/8 × 1 B = 128 B. 32× compression, lossy. Almost always combined as IVF+PQ for billion-scale.

ScaNN (Google) — anisotropic loss + asymmetric hashing. Best recall/throughput today, slightly trickier to tune.

6. Choosing an index

Vectors	Latency	Recall	Memory	Pick
< 1 M	< 5 ms	>99%	OK	HNSW or flat
1–100 M	< 50 ms	95–98%	~2× vectors	HNSW
100 M – 1 B	< 100 ms	90–95%	~0.1× vectors	IVF + PQ
> 1 B	< 200 ms	90–95%	small	ScaNN or IVF+PQ sharded

7. Vector databases — what to pick

DB	Stand-out	Trade-off
Pinecone	Fully managed, serverless	$$$, no self-host
Weaviate	Open-source, hybrid search built-in	More moving parts
Qdrant	Rust, fast, payload filtering	Smaller ecosystem
Milvus / Zilliz	Scale to billions; cloud version	Complex ops if self-hosted
pgvector	"Just Postgres" — joins, ACID, ops you already know	Slower at scale (>10 M)
Vespa	Powerful filter + ranking	Steep learning curve
FAISS (lib, not DB)	Fastest in benchmarks, no built-in persistence	You build the service

Default rec for greenfield: pgvector until you outgrow it. You almost never do.

8. Dimensionality — bigger is not better

Higher d → more capacity but more storage, slower compute, more curse-of-dimensionality. Practical sweet spots:

General-purpose text: 768–1024 dims.
Specialised (one language, one domain): 384 often enough.
Multimodal / cross-encoder: 1024+ helps.

text-embedding-3-large shipped a clever trick — Matryoshka embeddings. One 3072-dim vector you can truncate to 1024, 512, 256 without retraining. Re-rank with longer prefix, store the shorter — saves storage 6× with negligible quality loss.

9. Failure modes

Embedding drift — if your embedder version changes, every vector in the index is stale. Re-embed everything. Plan for this.
Distribution mismatch — train embedder on Wikipedia, query with chat-style. Recall tanks. Use a domain-fine-tuned model.
Numerical issues — fp16 OK for inference, fp32 for index storage; mixing can drop recall.
Tokenisation gap — embedder truncates at 512 tokens; your "doc" is 50K. Chunk before embedding (session 9).

10. Beyond text

Image — CLIP, OpenCLIP, SigLIP. Same contrastive recipe.
User/item recsys — two-tower, in-batch negatives, billion-scale (session 27).
Code — voyage-code-2, nv-embed-code. Code-specific embedders dominate StackOverflow-like retrieval.
Audio — Whisper hidden states, CLAP for audio↔text.

The recipe transfers; the data changes.

Link: https://leetcode.com/problems/find-k-closest-points-to-origin/
Difficulty: Medium
Why this problem: Heap of size k by negative distance; mirrors ANN candidate selection.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Explain why cosine and dot are equivalent on unit-norm vectors.
Describe InfoNCE loss and the role of temperature.
Pick an ANN index for 10M, 1B, and 1M scale workloads.
Explain Matryoshka embeddings in one paragraph.
Plan an embedding-version migration without downtime.
Solve find-k-closest-points-to-origin — heap of size k, exactly the ANN scan kernel.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Concurrency Models — Threads, Asyncio, GIL, Actors

Sharding & Replication — Partition Keys, Hot Spots, Multi-Region