Search Tech Journey

Find topics, journeys and posts

back to blog
ai mlintermediate 12m2026-06-09

Recommender Systems — Two-Tower, Multi-Stage Ranking

Session 27 of the 48-session learning series.

Date: Tue, 2026-06-30 · Time: 18:00–20:00 IST · Track: 📈 Machine Learning (ML) · Parent 28-day topic: Day 18 · Est. read: 2 h

Why this session matters

This is Session 27 of 48 in the ML track. Modern recsys is a 4-stage funnel — candidate retrieval, filtering, ranking, re-ranking — and almost every consumer surface (YouTube, TikTok, Pinterest, Spotify) is some flavour of this pipeline. Understanding the shape is more valuable than memorising any single model.

Agenda

  • Why CF and matrix factorisation aren't enough at scale
  • The two-tower architecture — query encoder + item encoder + ANN
  • Multi-stage funnel — retrieve → filter → rank → re-rank
  • Loss functions — pointwise, pairwise, listwise; sampled softmax
  • Evaluation — offline (NDCG, recall@K) and online (CTR, dwell, A/B)

Pre-read (skim before the session)

Deep dive

1. The recsys problem

Given a user + context (time, location, recent activity), produce an ordered list of N items that maximises some business objective. Catalog is millions to billions, latency budget is < 200 ms, candidates served per second can hit millions.

You cannot score every (user, item) pair. You need a funnel.

2. The 4-stage funnel

[ ~1B items ]
    │  retrieve (recall focus, very cheap)
    ▼
[ ~1000 candidates ]
    │  filter (eligibility, business rules, blocked content)
    ▼
[ ~200 candidates ]
    │  rank (deep model, expensive)
    ▼
[ ~50 ranked ]
    │  re-rank (diversity, freshness, exploration, business)
    ▼
[ Top 10 shown ]

Each stage is a different optimisation problem with different latency/cost/quality budget. Don't conflate them.

3. Retrieval — the two-tower architecture

[ user features ] ──► [ Query Tower DNN ] ──► query embedding u
                                                  │
                                          cos sim │ → ANN over items
                                                  │
[ item features ] ──► [ Item Tower DNN ] ──► item embedding v

Train so that cos(u, v) is high for (user, item) pairs they engaged with, low for non-engaged. At serve time:

  1. Pre-compute all item embeddings, push to ANN index (Faiss, ScaNN, vector DB).
  2. Online: encode user; ANN-search top-K items.
  3. Latency ~5–20 ms for 1B items.

The two towers don't share weights. The dot product is the only meeting point — that's what makes it ANN-able.

4. Sampled softmax — making it tractable

Naive softmax over a billion items is impossible. Use:

  • In-batch negatives — for batch of B (user, item) pairs, use the other B-1 items in the batch as negatives. Free, biased toward popular items.
  • Negative mining — sample hard negatives (similar but unclicked). Better gradient, more compute.
  • Log-Q correction — adjust for in-batch popularity bias: score -= log(P(item)). YouTube uses this.

5. Multi-stage ranker

After retrieval, you have ~1000 candidates. Now you can afford a heavier model. Inputs include rich features:

  • User features — age, country, language, tenure, recent watch history embedding.
  • Item features — topic, language, age, view count, creator, embedding.
  • Cross features — user-item interaction history, same-creator engagement.
  • Context — device, time, session length, what they just saw.

Common ranker architectures:

  • Wide & Deep (Google) — wide linear + deep DNN.
  • DCN / DCN-V2 (Google) — cross network for explicit feature interactions.
  • DLRM (Meta) — embedding tables + DNN, scales to TBs of params.
  • Transformer-based — recent shift; user history as sequence.

6. Loss functions

  • Pointwise — predict P(click | user, item). Treat each example independently. BCE loss. Simple but loses ranking signal.
  • Pairwise — for (item_pos, item_neg), score(pos) > score(neg). Margin loss, RankNet. Good ranking signal.
  • Listwise — score the whole list; NDCG-aware loss (LambdaRank, ListNet). Hardest, often best.

In practice: train a multi-task model (CTR + dwell + share), use pointwise BCE per head, combine at serve with learned or hand-tuned weights.

7. Re-ranking — the often-skipped stage

The top-50 from the ranker is sorted by predicted relevance. But:

  • All 5 top picks might be from the same creator → boring feed.
  • All 5 might be 3 days old → stale.
  • All 5 might be ad-adjacent → ads tank.

Re-ranking applies:

  • Diversity (MMR, DPP) — penalise similarity to already-picked items.
  • Freshness boost — exponential decay on item age.
  • Exploration — ε-greedy inject of low-confidence items to gather data.
  • Business rules — slot ads, demote restricted content, surface promoted creator.

8. The cold-start nightmare

Three flavours:

  • New user — no history. Solve with: onboarding survey, demographic priors, item popularity, contextual bandits.
  • New item — no engagement. Solve with: content-based features (text, image), creator priors, exploration slots in the feed.
  • New platform — no anything. Solve with: editorial picks, popularity, manual ranking.

Two-tower handles new items well if item tower uses only content features (no item ID). Pure ID-embedding towers can't generalise.

9. Evaluation

Offline:

  • Recall@K for retrieval (did we get the clicked item in top-K?).
  • NDCG@K for ranking (positional discount).
  • MAP, MRR for binary relevance.
  • Hit rate — quick sanity.

Watch out: offline metrics correlate poorly with online. Always shadow + A/B before shipping.

Online:

  • CTR — easy to game (clickbait).
  • Dwell time — better signal of satisfaction.
  • Long-term metrics — DAU, retention, time-to-second-action. Slow but real.
  • Surveys / explicit feedback — gold but small sample.

10. The popularity trap

Most metrics reward recommending popular items. You end up showing everyone the same 100 items, killing long-tail discovery, killing creator economy, killing future training signal.

Mitigations: IPS (inverse propensity sampling) in training, exploration bonuses in re-rank, diversity loss, explicit "discovery" slots.

11. Feedback loops are everywhere

Train model → recommend → users engage with what's recommended → log → retrain. The model trains on data it caused. This causes:

  • Filter bubbles.
  • Stale catalog blind spots.
  • Self-reinforcing bias.

Mitigate with: held-out random slots, off-policy correction (IPS), occasional batch refresh from broad sources.

12. Reality check

A minimum viable recsys for a startup:

  • ALS or matrix factorisation as baseline.
  • Two-tower with item content features for retrieval.
  • LightGBM ranker with hand-crafted features for ranking.
  • Hand-tuned re-rank rules (recency, dedup by author).
  • A/B test framework with statistical power planning.

You can serve millions of users with this stack. Add neural rankers, sequential models, embedding tables when business need is clear.

Reading material

In-depth research material

Video reference

▶︎ How YouTube Recommendations Actually Work (Stanford)

Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.

LeetCode — Design Twitter

  • Link: https://leetcode.com/problems/design-twitter/
  • Difficulty: Medium
  • Why this problem: Pulling top-K most-recent items across followed users — same shape as merging candidate sources in recsys retrieval.
  • Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

  • Draw the 4-stage recsys funnel and explain the latency/cost tradeoff at each stage.
  • Describe the two-tower architecture and why dot-product separation enables ANN.
  • Explain sampled softmax + log-Q correction.
  • Pick between pointwise / pairwise / listwise loss for a given scenario.
  • List 3 cold-start strategies for new items.
  • Solve design-twitter — k-way merge of per-user feeds, the recsys retrieval primitive.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.