Day 18 — Recommender Systems — Two-Tower, Multi-Stage Ranking
Recsys drives YouTube, TikTok, Amazon, Spotify, Pinterest — and is one of the highest-ROI ML problems anywhere. The two-tower retriever + multi-stage ranker is…
Modern recsys is multi-stage because the catalogue is huge (millions of items) but the model must respond in 100 ms. Each stage trades recall for precision.
🧠 Concept
Why it matters & the mental model.
1. The three stages
- Retrieval (recall): from millions → 1k. Cheap, optimised for not missing relevant items.
- Ranker (precision): from 1k → 100. Heavy features, expensive model.
- Re-ranker (UX): diversity, freshness, fairness, business rules.
2. Retrieval methods
- Collaborative filtering: matrix factorisation (ALS) — historical, still strong baseline.
- Two-tower (DSSM): user tower + item tower → embeddings → ANN. Industry default.
- Graph (GNN): PinSage, GraphSAGE — captures multi-hop similarity.
- Sequence models (SASRec, BERT4Rec): treat user history as a sequence, predict next item.
- LLM-augmented: use LLM to summarise user intent → embed → retrieve.
3. Two-tower in detail
- User tower: f(user features, recent history) → user_emb.
- Item tower: g(item features) → item_emb.
- Train with in-batch negatives + sampled negatives + hard negatives mined from previous model.
- Loss: InfoNCE / sampled softmax / pairwise BPR.
- Index item embeddings in HNSW / ScaNN; query at runtime with user_emb.
🛠 Deep Dive
Internals, code, architecture.
4. Ranker
Heavy feature engineering:
- User × item cross features (user's avg ctr on this item's category).
- Counters (impression count last 7d, click rate last 1h).
- Embeddings (user, item, category, author).
- Context (time of day, device, locale). Model: GBDT (LightGBM) for tabular dominance, or DLRM-style deep + wide for embedding-heavy. Loss: logistic for click, multi-task for click+watch_time+like.
5. Multi-task & multi-objective
Real feeds optimise multiple objectives: click, watch time, like, save, follow. Architectures:
- MMoE (multi-gate mixture of experts) — shared experts, task-specific gates.
- PLE (progressive layered extraction). Final score = weighted sum (weights tuned via online A/B or constrained optimisation).
6. Negative sampling
Implicit feedback (clicks only, no ratings): every non-click is a potential negative, but most are uninformative.
- Random: cheap, biased toward popular items.
- Popularity-weighted: counters head bias.
- Hard: items the current model ranks high but user didn't engage with.
7. Cold start
- New user: rely on context features (location, signup source, day-1 actions).
- New item: content features (text, image, taxonomy); contextual bandit for exploration.
🚀 In Practice
Trade-offs, exercises, what to ship today.
8. Online learning & exploration
Pure exploit = filter bubble + no learning on new content. Inject exploration:
- ε-greedy: 5% random in slot N.
- Contextual bandits (LinUCB, Thompson Sampling) for the exploration slot.
- Off-policy evaluation (IPS, doubly robust) to measure new policies on logged data.
9. Diversity, freshness, fairness
Re-ranker uses DPP / MMR for diversity, decay for freshness, post-processing for fair exposure. Business rules: "no two items from same author in top 5".
10. Evaluation
- Offline: NDCG, recall@k on logged interactions (with care — biased).
- Online: A/B on CTR, watch time, retention. The truth.
- Counterfactual: replay logged actions under new policy with importance sampling.
11. What to take away
"How would you build YouTube?" Strong answers: retrieve with two-tower, rank with multi-task GBDT or DLRM, re-rank for diversity, A/B with retention as north star. Bonus: cold-start strategy + exploration.
Resources
- 🎥 Stanford CS246 — Recommender Systems
- 📖 Eugene Yan — System design for recommendations
- 📖 YouTube — Deep Neural Networks for YouTube Recommendations (paper)
- 📖 Pinterest — Pinnability ranking
Practice Problem: Find K Closest Elements (Medium)