ai mladvanced 12m2026-06-06

Day 08 — Embeddings, Vector Spaces & Contrastive Learning

Embeddings power search, RAG, recsys, clustering, deduplication and anomaly detection. Understanding *why* a contrastive objective produces useful vectors (vs s…

An embedding is a learned map f: X → R^d such that semantic similarity in X corresponds to proximity in R^d (usually cosine). Same idea powers BERT sentence embeddings, CLIP image-text, code embeddings, and recsys two-tower models.

🧠 Concept

Why it matters & the mental model.

1. Why vector spaces work

Two implicit assumptions:

The model has been trained so that related inputs cluster and unrelated inputs push apart.
The downstream metric (cosine or dot product) matches the training objective. A model trained with cosine should be queried with cosine; one trained with Euclidean (rare today) should not.

2. From classification to contrastive

A naive way to get vectors is to take the penultimate layer of a classifier. It works (ImageNet features) but is suboptimal because the loss never explicitly forces similar things close and dissimilar things far. Contrastive learning does exactly that:

For each anchor x, pick a positive x⁺ (semantically similar) and a set of negatives {x⁻}.
Loss: -log( exp(sim(f(x), f(x⁺))/τ) / Σ exp(sim(f(x), f(x_j))/τ) ) (InfoNCE).
Temperature τ controls sharpness; typical 0.05-0.1.

3. In-batch negatives — free training signal

Compute embeddings for a batch of (q, d⁺) pairs. For each q, the other d⁺ in the batch are negatives. This is why batch size matters: at B=256 each example sees 255 negatives. Memory bank (MoCo) and queue tricks push effective negative count to 65k+.

4. Hard negative mining

In-batch negatives are too easy after a while (random unrelated items). Mine hard negatives: items the current model ranks high but aren't actually relevant. This is the single biggest quality lever on production retrievers. Periodically re-mine as the model improves (curriculum).

🛠 Deep Dive

Internals, code, architecture.

5. Two-tower architecture

Query encoder and document encoder can share or differ.
At inference, encode all documents once → store in vector index. Encode each query online → ANN search.
Cheap to scale to billions of docs. Used by Google, Pinterest, YouTube, all of TikTok.

6. Multimodal — CLIP & friends

CLIP trains a text encoder and an image encoder jointly with contrastive loss on 400M (image, caption) pairs. Result: a shared space where f_text("a dog on a beach") ≈ f_image(<that image>). Drives zero-shot classification, retrieval, and modern T2I/T2V models. Same recipe scaled to text-audio (CLAP), text-video, text-code.

7. Normalisation, dimensionality, distance

L2-normalise embeddings before storing → cosine ≡ dot product, ANN libraries optimise dot.
Higher d = more capacity but more storage and slower ANN. 384-1536 is the sweet spot.
For huge corpora, train Matryoshka embeddings (OpenAI v3): one model produces vectors usable at multiple truncations (e.g. 256/512/1024) for a quality/cost trade-off at query time.

8. ANN — making it serve fast

Exact nearest neighbour is O(Nd). At a billion docs, use:

HNSW (hierarchical navigable small world): graph-based, ~95-99% recall, sub-ms latency. Default in FAISS, Qdrant, Weaviate, pgvector.
IVF + PQ: quantise vectors into codebook ids; massive memory savings.
ScaNN, DiskANN: Google / Microsoft variants for trillion-scale.

🚀 In Practice

Trade-offs, exercises, what to ship today.

9. Evaluation

Retrieval: Recall@k, MRR, nDCG on a labelled (q, relevant_docs) set.
Clustering: Silhouette, purity vs a gold labelling.
Probing: train a linear classifier on top — measures how separable concepts are in the space.

10. Failure modes

Domain shift: a model trained on web text underperforms on legal/medical without fine-tuning.
Anisotropy: vectors collapse into a narrow cone → cosine becomes uninformative. Mitigate with whitening or contrastive re-training.
Stale embeddings: change the model and forget to re-index → silent quality regression. Always tag stored vectors with model_version.

11. Practical recipes

From-scratch retriever: SBERT base + MultipleNegativesRankingLoss + 100k (q, d⁺) pairs + 1-2 epochs → +20 pts recall vs zero-shot.
Cold start: use a strong off-the-shelf (bge-m3) and add a cross-encoder reranker rather than fine-tuning the bi-encoder. Cheaper, faster wins.
Multi-tenant: shard by tenant in the index, normalise per-tenant.

12. What to take away

"How would you build product search?" Strong answers: two-tower retriever with in-batch + hard negatives, ANN index (HNSW), cross-encoder reranker, A/B harness with CTR / purchase as ground truth. Bonus: mention cold-start with BM25 hybrid.

Key points

Resources

Practice Problem: Maximum Subarray (Easy)

← previous

Day 07 — Apache Kafka Deep Dive — Partitions, Replication, Consumer Groups, Exactly-Once

Day 09 — CAP, PACELC, Consensus — Raft, Quorums, and Realistic Trade-offs