ai mlintermediate 12m2026-06-09

Online Learning, Bandits, Counterfactual Evaluation

Session 43 of the 48-session learning series.

Date: Sun, 2026-07-12 · Time: 14:30–16:30 IST · Track: 📈 Machine Learning (ML) · Parent 28-day topic: Day 18 · Est. read: 2 h

Why this session matters

This is Session 43 of 48 in the ML track. Most ML systems retrain weekly and live with that staleness. The frontier — recsys, ranking, ad bidding, dynamic pricing — needs to learn and adapt within minutes, or to estimate what would have happened if it had decided differently. Bandits and counterfactual eval are the toolkit.

Agenda

Online learning — incremental training; SGD on a stream
Multi-armed bandits — ε-greedy, UCB, Thompson sampling
Contextual bandits — when the right action depends on context
Counterfactual / off-policy evaluation — IPS, doubly robust
Practical deployment — exploration cost, safety nets, monitoring

Pre-read (skim before the session)

Deep dive

1. Why most ML is "offline" and what's wrong with that

Typical lifecycle: collect data → train batch model → deploy → repeat next week.

Problems:

Stale to new content / new users.
Cold-start for new items takes a full cycle.
Slow to react to distribution shift (trend, season, breaking news).

Online learning: model updates as data flows. Adaptable, responsive, harder.

2. Online vs incremental vs full retrain

Full retrain — train from scratch on full historical data. Most accurate; slowest.
Incremental / warm-start — start from last model; train on new data only. Faster; may drift over time.
Streaming / online — update parameters per-example or mini-batch as data arrives. Real-time; risks instability.

Most production ML is full + incremental. Streaming is for the cases where minutes-of-freshness actually move the needle.

3. Tools for online learning

Vowpal Wabbit — VW is the canonical streaming learner; tens of GB/s throughput.
River — Python streaming-ML lib; clean API, slower.
TensorFlow Extended (TFX) — supports incremental.
Custom: SGD with checkpoint + warm-start; works fine for most cases.

4. The multi-armed bandit problem

You have K arms (actions, items, recommendations). Each has unknown reward distribution. Pull arms over time to maximise total reward. Exploration (try arms you're uncertain about) vs exploitation (pick the best one so far).

Algorithm	How	Pros / Cons
ε-greedy	best most of the time; ε fraction random	simple; wastes pulls on bad arms
UCB	upper confidence bound — pull arm with highest "optimistic" estimate	tight regret bounds; assumes stationary
Thompson Sampling	sample reward from posterior; pull arg-max	best empirical perf; Bayesian; easy

For real systems: Thompson Sampling is the modern default. Easy to extend to contextual.

5. Contextual bandits

Instead of K fixed arms with fixed rewards, you observe a context x and choose an action a; reward r ~ p(r | x, a).

Examples:

News headline ranking: context = user features; action = headline; reward = click.
Email subject A/B test: context = recipient features; action = subject variant; reward = open.
Ad bidding: context = page + user; action = creative; reward = revenue.

Algorithms:

LinUCB — linear model + UCB.
Thompson Sampling with linear regression — Bayesian linear with posterior sampling.
Neural bandits — DNN with output uncertainty estimate.

VW makes most of this 1-liner; do not implement yourself unless you must.

6. The cost of exploration

Every exploration shows a user something the model thinks isn't best. Costs:

Lost short-term revenue / engagement.
Bad UX for the user shown the "exploration" pick.
Bias in your collected data toward exploration arms.

Mitigations:

Cap exploration rate (e.g. 5% of traffic).
Apply only to suitable surfaces (not the lead slot).
Explore more for new items (Bayesian uncertainty), less for known.
Constrain exploration to a "safe" candidate set.

7. Counterfactual / off-policy evaluation

You logged: state → action → reward, using policy π_0 (current production). You want: estimate the reward of policy π_1 (a new model), without rolling it out.

Inverse Propensity Scoring (IPS):

estimate(π_1) = mean( r * π_1(a|x) / π_0(a|x) )

Weighting each historical reward by how much more (or less) the new policy would have chosen that action. Unbiased if π_0 had support everywhere π_1 has support (no deterministic policies).

Problems:

High variance for rare actions (large weights blow up).
Brittle if logged policy was deterministic (π_0(a|x) = 1 for one action, 0 for others; divisions break).

Fixes:

Clip weights (cap at some max).
Doubly Robust — combine IPS with a regression-based estimator; robust if either is correct.
Direct method — pure regression on logged rewards; biased but stable.

8. Logging requirements

To do off-policy evaluation, your production system must log:

The features / context.
The action taken.
The probability the production policy chose that action (propensity).
The reward observed.

Without propensity, you can't compute IPS — your historical data is useless for counterfactual. Add propensity logging before you need it.

9. Safety and guardrails

When deploying online learners:

Action whitelist — model can only choose from approved items.
Reward sanity check — clip extreme rewards (one viral post outlier can dominate).
Rate limit changes — parameters can only move X% per hour.
Fall-back model — if online model goes haywire, switch back to last stable batch.
Shadow mode — run new online learner in shadow; compare predictions before serving.

Most "AI gone wrong" incidents in news come from missing safety nets like these.

10. Non-stationarity

The real world drifts. Trends change, fashion changes, users grow up.

Handling:

Time-decay — weight recent examples more.
Sliding window — only train on last X days.
Concept drift detection — track residual; retrain when distribution shifts.
Cold start re-exploration — periodically re-explore "dead" arms; user preferences change.

11. A/B vs bandit

When you have 2 fixed options and want statistical significance → A/B test. When you have many options and want to maximise reward while learning → bandit.

A/B gives you a confidence interval on the difference. Bandit gives you the best option without bothering to prove which is which.

12. Reality check

A pragmatic online-learning stack:

Batch-trained baseline (LightGBM, neural ranker) — gold standard.
Contextual bandit (VW or Thompson Sampling) on top of baseline predictions — explore variants.
Always log propensities.
Off-policy eval with doubly-robust before any new policy ships.
Guardrails: whitelists, clipping, fallbacks.

Bandits are not a replacement for ML; they are a smart exploration policy on top. Most teams misunderstand this and end up with a worse system than plain offline ML.

Link: https://leetcode.com/problems/random-pick-with-weight/
Difficulty: Medium
Why this problem: Weighted sampling is the primitive Thompson Sampling and probability-matching bandit policies rely on.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Pick between batch retrain, incremental, and streaming for a given freshness need.
Explain ε-greedy, UCB, Thompson Sampling and pick one per scenario.
Define contextual bandits and 3 production use-cases.
Compute IPS estimate of a new policy from logged data with propensities.
List 4 safety guardrails for an online-learning system.
Solve random-pick-with-weight — prefix-sum + binary search, the weighted-sampling primitive.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Prompt Engineering at Production Scale — Templates, Caching, Drift

Designing a Search Engine — Crawl, Index, Query, Ranking