Recommender Systems — Two-Tower, Multi-Stage Ranking
Session 27 of the 48-session learning series.
Date: Tue, 2026-06-30 · Time: 18:00–20:00 IST · Track: 📈 Machine Learning (ML) · Parent 28-day topic: Day 18 · Est. read: 2 h
Why this session matters
This is Session 27 of 48 in the ML track. Modern recsys is a 4-stage funnel — candidate retrieval, filtering, ranking, re-ranking — and almost every consumer surface (YouTube, TikTok, Pinterest, Spotify) is some flavour of this pipeline. Understanding the shape is more valuable than memorising any single model.
Agenda
- Why CF and matrix factorisation aren't enough at scale
- The two-tower architecture — query encoder + item encoder + ANN
- Multi-stage funnel — retrieve → filter → rank → re-rank
- Loss functions — pointwise, pairwise, listwise; sampled softmax
- Evaluation — offline (NDCG, recall@K) and online (CTR, dwell, A/B)
Pre-read (skim before the session)
- YouTube Recommendations DNN (Covington et al., 2016)
- Pinterest — PinSage (KDD 2018)
- Facebook — Embedding-based Retrieval at FB
- Eugene Yan — Real-time recommendations
Deep dive
1. The recsys problem
Given a user + context (time, location, recent activity), produce an ordered list of N items that maximises some business objective. Catalog is millions to billions, latency budget is < 200 ms, candidates served per second can hit millions.
You cannot score every (user, item) pair. You need a funnel.
2. The 4-stage funnel
[ ~1B items ]
│ retrieve (recall focus, very cheap)
▼
[ ~1000 candidates ]
│ filter (eligibility, business rules, blocked content)
▼
[ ~200 candidates ]
│ rank (deep model, expensive)
▼
[ ~50 ranked ]
│ re-rank (diversity, freshness, exploration, business)
▼
[ Top 10 shown ]
Each stage is a different optimisation problem with different latency/cost/quality budget. Don't conflate them.
3. Retrieval — the two-tower architecture
[ user features ] ──► [ Query Tower DNN ] ──► query embedding u
│
cos sim │ → ANN over items
│
[ item features ] ──► [ Item Tower DNN ] ──► item embedding v
Train so that cos(u, v) is high for (user, item) pairs they engaged with, low for non-engaged. At serve time:
- Pre-compute all item embeddings, push to ANN index (Faiss, ScaNN, vector DB).
- Online: encode user; ANN-search top-K items.
- Latency ~5–20 ms for 1B items.
The two towers don't share weights. The dot product is the only meeting point — that's what makes it ANN-able.
4. Sampled softmax — making it tractable
Naive softmax over a billion items is impossible. Use:
- In-batch negatives — for batch of B (user, item) pairs, use the other B-1 items in the batch as negatives. Free, biased toward popular items.
- Negative mining — sample hard negatives (similar but unclicked). Better gradient, more compute.
- Log-Q correction — adjust for in-batch popularity bias:
score -= log(P(item)). YouTube uses this.
5. Multi-stage ranker
After retrieval, you have ~1000 candidates. Now you can afford a heavier model. Inputs include rich features:
- User features — age, country, language, tenure, recent watch history embedding.
- Item features — topic, language, age, view count, creator, embedding.
- Cross features — user-item interaction history, same-creator engagement.
- Context — device, time, session length, what they just saw.
Common ranker architectures:
- Wide & Deep (Google) — wide linear + deep DNN.
- DCN / DCN-V2 (Google) — cross network for explicit feature interactions.
- DLRM (Meta) — embedding tables + DNN, scales to TBs of params.
- Transformer-based — recent shift; user history as sequence.
6. Loss functions
- Pointwise — predict P(click | user, item). Treat each example independently. BCE loss. Simple but loses ranking signal.
- Pairwise — for (item_pos, item_neg), score(pos) > score(neg). Margin loss, RankNet. Good ranking signal.
- Listwise — score the whole list; NDCG-aware loss (LambdaRank, ListNet). Hardest, often best.
In practice: train a multi-task model (CTR + dwell + share), use pointwise BCE per head, combine at serve with learned or hand-tuned weights.
7. Re-ranking — the often-skipped stage
The top-50 from the ranker is sorted by predicted relevance. But:
- All 5 top picks might be from the same creator → boring feed.
- All 5 might be 3 days old → stale.
- All 5 might be ad-adjacent → ads tank.
Re-ranking applies:
- Diversity (MMR, DPP) — penalise similarity to already-picked items.
- Freshness boost — exponential decay on item age.
- Exploration — ε-greedy inject of low-confidence items to gather data.
- Business rules — slot ads, demote restricted content, surface promoted creator.
8. The cold-start nightmare
Three flavours:
- New user — no history. Solve with: onboarding survey, demographic priors, item popularity, contextual bandits.
- New item — no engagement. Solve with: content-based features (text, image), creator priors, exploration slots in the feed.
- New platform — no anything. Solve with: editorial picks, popularity, manual ranking.
Two-tower handles new items well if item tower uses only content features (no item ID). Pure ID-embedding towers can't generalise.
9. Evaluation
Offline:
- Recall@K for retrieval (did we get the clicked item in top-K?).
- NDCG@K for ranking (positional discount).
- MAP, MRR for binary relevance.
- Hit rate — quick sanity.
Watch out: offline metrics correlate poorly with online. Always shadow + A/B before shipping.
Online:
- CTR — easy to game (clickbait).
- Dwell time — better signal of satisfaction.
- Long-term metrics — DAU, retention, time-to-second-action. Slow but real.
- Surveys / explicit feedback — gold but small sample.
10. The popularity trap
Most metrics reward recommending popular items. You end up showing everyone the same 100 items, killing long-tail discovery, killing creator economy, killing future training signal.
Mitigations: IPS (inverse propensity sampling) in training, exploration bonuses in re-rank, diversity loss, explicit "discovery" slots.
11. Feedback loops are everywhere
Train model → recommend → users engage with what's recommended → log → retrain. The model trains on data it caused. This causes:
- Filter bubbles.
- Stale catalog blind spots.
- Self-reinforcing bias.
Mitigate with: held-out random slots, off-policy correction (IPS), occasional batch refresh from broad sources.
12. Reality check
A minimum viable recsys for a startup:
- ALS or matrix factorisation as baseline.
- Two-tower with item content features for retrieval.
- LightGBM ranker with hand-crafted features for ranking.
- Hand-tuned re-rank rules (recency, dedup by author).
- A/B test framework with statistical power planning.
You can serve millions of users with this stack. Add neural rankers, sequential models, embedding tables when business need is clear.
Reading material
- YouTube DNN (Covington et al., 2016)
- PinSage (KDD 2018)
- Embedding-based Retrieval at Facebook
- Eugene Yan — System design for recsys
In-depth research material
- Recsys 2023 keynotes — papers worth reading
- Sampling-Bias-Corrected Neural Modeling (Yi et al., 2019)
- Deep Cross Network V2 (Wang et al., 2020)
- DLRM (Meta, 2019)
Video reference
▶︎ How YouTube Recommendations Actually Work (Stanford)
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Design Twitter
- Link: https://leetcode.com/problems/design-twitter/
- Difficulty: Medium
- Why this problem: Pulling top-K most-recent items across followed users — same shape as merging candidate sources in recsys retrieval.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Draw the 4-stage recsys funnel and explain the latency/cost tradeoff at each stage.
- Describe the two-tower architecture and why dot-product separation enables ANN.
- Explain sampled softmax + log-Q correction.
- Pick between pointwise / pairwise / listwise loss for a given scenario.
- List 3 cold-start strategies for new items.
- Solve
design-twitter— k-way merge of per-user feeds, the recsys retrieval primitive.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.