Day 08 — Embeddings, Vector Spaces & Contrastive Learning
Embeddings power search, RAG, recsys, clustering, deduplication and anomaly detection. Understanding *why* a contrastive objective produces useful vectors (vs s…
An embedding is a learned map f: X → R^d such that semantic similarity in X corresponds to proximity in R^d (usually cosine). Same idea powers BERT sentence embeddings, CLIP image-text, code embeddings, and recsys two-tower models.
🧠 Concept
Why it matters & the mental model.
1. Why vector spaces work
Two implicit assumptions:
- The model has been trained so that related inputs cluster and unrelated inputs push apart.
- The downstream metric (cosine or dot product) matches the training objective. A model trained with cosine should be queried with cosine; one trained with Euclidean (rare today) should not.
2. From classification to contrastive
A naive way to get vectors is to take the penultimate layer of a classifier. It works (ImageNet features) but is suboptimal because the loss never explicitly forces similar things close and dissimilar things far. Contrastive learning does exactly that:
- For each anchor x, pick a positive x⁺ (semantically similar) and a set of negatives {x⁻}.
- Loss:
-log( exp(sim(f(x), f(x⁺))/τ) / Σ exp(sim(f(x), f(x_j))/τ) )(InfoNCE). - Temperature τ controls sharpness; typical 0.05-0.1.
3. In-batch negatives — free training signal
Compute embeddings for a batch of (q, d⁺) pairs. For each q, the other d⁺ in the batch are negatives. This is why batch size matters: at B=256 each example sees 255 negatives. Memory bank (MoCo) and queue tricks push effective negative count to 65k+.
4. Hard negative mining
In-batch negatives are too easy after a while (random unrelated items). Mine hard negatives: items the current model ranks high but aren't actually relevant. This is the single biggest quality lever on production retrievers. Periodically re-mine as the model improves (curriculum).
🛠 Deep Dive
Internals, code, architecture.
5. Two-tower architecture
- Query encoder and document encoder can share or differ.
- At inference, encode all documents once → store in vector index. Encode each query online → ANN search.
- Cheap to scale to billions of docs. Used by Google, Pinterest, YouTube, all of TikTok.
6. Multimodal — CLIP & friends
CLIP trains a text encoder and an image encoder jointly with contrastive loss on 400M (image, caption) pairs. Result: a shared space where f_text("a dog on a beach") ≈ f_image(<that image>). Drives zero-shot classification, retrieval, and modern T2I/T2V models. Same recipe scaled to text-audio (CLAP), text-video, text-code.
7. Normalisation, dimensionality, distance
- L2-normalise embeddings before storing → cosine ≡ dot product, ANN libraries optimise dot.
- Higher d = more capacity but more storage and slower ANN. 384-1536 is the sweet spot.
- For huge corpora, train Matryoshka embeddings (OpenAI v3): one model produces vectors usable at multiple truncations (e.g. 256/512/1024) for a quality/cost trade-off at query time.
8. ANN — making it serve fast
Exact nearest neighbour is O(Nd). At a billion docs, use:
- HNSW (hierarchical navigable small world): graph-based, ~95-99% recall, sub-ms latency. Default in FAISS, Qdrant, Weaviate, pgvector.
- IVF + PQ: quantise vectors into codebook ids; massive memory savings.
- ScaNN, DiskANN: Google / Microsoft variants for trillion-scale.
🚀 In Practice
Trade-offs, exercises, what to ship today.
9. Evaluation
- Retrieval: Recall@k, MRR, nDCG on a labelled (q, relevant_docs) set.
- Clustering: Silhouette, purity vs a gold labelling.
- Probing: train a linear classifier on top — measures how separable concepts are in the space.
10. Failure modes
- Domain shift: a model trained on web text underperforms on legal/medical without fine-tuning.
- Anisotropy: vectors collapse into a narrow cone → cosine becomes uninformative. Mitigate with whitening or contrastive re-training.
- Stale embeddings: change the model and forget to re-index → silent quality regression. Always tag stored vectors with
model_version.
11. Practical recipes
- From-scratch retriever: SBERT base + MultipleNegativesRankingLoss + 100k (q, d⁺) pairs + 1-2 epochs → +20 pts recall vs zero-shot.
- Cold start: use a strong off-the-shelf (
bge-m3) and add a cross-encoder reranker rather than fine-tuning the bi-encoder. Cheaper, faster wins. - Multi-tenant: shard by tenant in the index, normalise per-tenant.
12. What to take away
"How would you build product search?" Strong answers: two-tower retriever with in-batch + hard negatives, ANN index (HNSW), cross-encoder reranker, A/B harness with CTR / purchase as ground truth. Bonus: mention cold-start with BM25 hybrid.
Resources
- 🎥 Cohere — A Visual Guide to Embeddings
- 📖 SBERT — Sentence Transformers tutorial
- 📖 OpenAI — text-embedding-3 announcement
- 📖 CLIP paper — Radford et al.
Practice Problem: Maximum Subarray (Easy)