ai mlintermediate 12m2026-06-09

Recommender Systems — Two-Tower, Multi-Stage Ranking

Session 27 of the 48-session learning series.

Date: Tue, 2026-06-30 · Time: 18:00–20:00 IST · Track: 📈 Machine Learning (ML) · Parent 28-day topic: Day 18 · Est. read: 2 h

Why this session matters

This is Session 27 of 48 in the ML track. Modern recsys is a 4-stage funnel — candidate retrieval, filtering, ranking, re-ranking — and almost every consumer surface (YouTube, TikTok, Pinterest, Spotify) is some flavour of this pipeline. Understanding the shape is more valuable than memorising any single model.

Agenda

Why CF and matrix factorisation aren't enough at scale
The two-tower architecture — query encoder + item encoder + ANN
Multi-stage funnel — retrieve → filter → rank → re-rank
Loss functions — pointwise, pairwise, listwise; sampled softmax
Evaluation — offline (NDCG, recall@K) and online (CTR, dwell, A/B)

Pre-read (skim before the session)

Deep dive

1. The recsys problem

Given a user + context (time, location, recent activity), produce an ordered list of N items that maximises some business objective. Catalog is millions to billions, latency budget is < 200 ms, candidates served per second can hit millions.

You cannot score every (user, item) pair. You need a funnel.

2. The 4-stage funnel

[ ~1B items ]
    │  retrieve (recall focus, very cheap)
    ▼
[ ~1000 candidates ]
    │  filter (eligibility, business rules, blocked content)
    ▼
[ ~200 candidates ]
    │  rank (deep model, expensive)
    ▼
[ ~50 ranked ]
    │  re-rank (diversity, freshness, exploration, business)
    ▼
[ Top 10 shown ]

Each stage is a different optimisation problem with different latency/cost/quality budget. Don't conflate them.

3. Retrieval — the two-tower architecture

[ user features ] ──► [ Query Tower DNN ] ──► query embedding u
                                                  │
                                          cos sim │ → ANN over items
                                                  │
[ item features ] ──► [ Item Tower DNN ] ──► item embedding v

Train so that cos(u, v) is high for (user, item) pairs they engaged with, low for non-engaged. At serve time:

Pre-compute all item embeddings, push to ANN index (Faiss, ScaNN, vector DB).
Online: encode user; ANN-search top-K items.
Latency ~5–20 ms for 1B items.

The two towers don't share weights. The dot product is the only meeting point — that's what makes it ANN-able.

4. Sampled softmax — making it tractable

Naive softmax over a billion items is impossible. Use:

In-batch negatives — for batch of B (user, item) pairs, use the other B-1 items in the batch as negatives. Free, biased toward popular items.
Negative mining — sample hard negatives (similar but unclicked). Better gradient, more compute.
Log-Q correction — adjust for in-batch popularity bias: score -= log(P(item)). YouTube uses this.

5. Multi-stage ranker

After retrieval, you have ~1000 candidates. Now you can afford a heavier model. Inputs include rich features:

User features — age, country, language, tenure, recent watch history embedding.
Item features — topic, language, age, view count, creator, embedding.
Cross features — user-item interaction history, same-creator engagement.
Context — device, time, session length, what they just saw.

Common ranker architectures:

Wide & Deep (Google) — wide linear + deep DNN.
DCN / DCN-V2 (Google) — cross network for explicit feature interactions.
DLRM (Meta) — embedding tables + DNN, scales to TBs of params.
Transformer-based — recent shift; user history as sequence.

6. Loss functions

Pointwise — predict P(click | user, item). Treat each example independently. BCE loss. Simple but loses ranking signal.
Pairwise — for (item_pos, item_neg), score(pos) > score(neg). Margin loss, RankNet. Good ranking signal.
Listwise — score the whole list; NDCG-aware loss (LambdaRank, ListNet). Hardest, often best.

In practice: train a multi-task model (CTR + dwell + share), use pointwise BCE per head, combine at serve with learned or hand-tuned weights.

7. Re-ranking — the often-skipped stage

The top-50 from the ranker is sorted by predicted relevance. But:

All 5 top picks might be from the same creator → boring feed.
All 5 might be 3 days old → stale.
All 5 might be ad-adjacent → ads tank.

Re-ranking applies:

Diversity (MMR, DPP) — penalise similarity to already-picked items.
Freshness boost — exponential decay on item age.
Exploration — ε-greedy inject of low-confidence items to gather data.
Business rules — slot ads, demote restricted content, surface promoted creator.

8. The cold-start nightmare

Three flavours:

New user — no history. Solve with: onboarding survey, demographic priors, item popularity, contextual bandits.
New item — no engagement. Solve with: content-based features (text, image), creator priors, exploration slots in the feed.
New platform — no anything. Solve with: editorial picks, popularity, manual ranking.

Two-tower handles new items well if item tower uses only content features (no item ID). Pure ID-embedding towers can't generalise.

9. Evaluation

Offline:

Recall@K for retrieval (did we get the clicked item in top-K?).
NDCG@K for ranking (positional discount).
MAP, MRR for binary relevance.
Hit rate — quick sanity.

Watch out: offline metrics correlate poorly with online. Always shadow + A/B before shipping.

Online:

CTR — easy to game (clickbait).
Dwell time — better signal of satisfaction.
Long-term metrics — DAU, retention, time-to-second-action. Slow but real.
Surveys / explicit feedback — gold but small sample.

10. The popularity trap

Most metrics reward recommending popular items. You end up showing everyone the same 100 items, killing long-tail discovery, killing creator economy, killing future training signal.

Mitigations: IPS (inverse propensity sampling) in training, exploration bonuses in re-rank, diversity loss, explicit "discovery" slots.

11. Feedback loops are everywhere

Train model → recommend → users engage with what's recommended → log → retrain. The model trains on data it caused. This causes:

Filter bubbles.
Stale catalog blind spots.
Self-reinforcing bias.

Mitigate with: held-out random slots, off-policy correction (IPS), occasional batch refresh from broad sources.

12. Reality check

A minimum viable recsys for a startup:

ALS or matrix factorisation as baseline.
Two-tower with item content features for retrieval.
LightGBM ranker with hand-crafted features for ranking.
Hand-tuned re-rank rules (recency, dedup by author).
A/B test framework with statistical power planning.

You can serve millions of users with this stack. Add neural rankers, sequential models, embedding tables when business need is clear.

Link: https://leetcode.com/problems/design-twitter/
Difficulty: Medium
Why this problem: Pulling top-K most-recent items across followed users — same shape as merging candidate sources in recsys retrieval.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Draw the 4-stage recsys funnel and explain the latency/cost tradeoff at each stage.
Describe the two-tower architecture and why dot-product separation enables ANN.
Explain sampled softmax + log-Q correction.
Pick between pointwise / pairwise / listwise loss for a given scenario.
List 3 cold-start strategies for new items.
Solve design-twitter — k-way merge of per-user feeds, the recsys retrieval primitive.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Caching Strategies — CDN, Application Cache, Cache-Aside, Read-Through

Data Modelling — Dimensional, Data Vault, OBT for the Lakehouse