ai mlintermediate 12m2026-06-09

MLOps — Experiment Tracking, Model Registry, CI/CD for Models

Session 24 of the 48-session learning series.

Date: Sun, 2026-06-28 · Time: 09:00–11:00 IST · Track: 📈 Machine Learning (ML) · Parent 28-day topic: Day 13 · Est. read: 2 h

Why this session matters

This is Session 24 of 48 in the Machine Learning track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

What MLOps actually means — the lifecycle map
Experiment tracking — MLflow, W&B, Neptune; what to log
Model registry — versions, stages, lineage, governance
CI/CD for models — tests, validation gates, canary, shadow
Monitoring in production — drift, performance, alerting, retraining

Pre-read (skim before the session)

Deep dive

1. The MLOps lifecycle

[Data] → [Features] → [Train] → [Eval] → [Register] → [Deploy] → [Monitor] → [Retrain]
   ▲                                                                                  │
   └──────────────────────────────────────────────────────────────────────────────────┘

Each arrow is a place to fail. MLOps is the practice of automating + observing + auditing every arrow, so the loop is reliable enough to put a model in front of paying users.

2. Why this is harder than software DevOps

Software CI/CD: deterministic code → deterministic build → deterministic deploy. Bugs are reproducible.

ML CI/CD: stochastic training → metric-bounded model → deploy with regression risk. Bugs may only appear on a specific data slice 3 weeks later. You need:

Data versioning (DVC, lakeFS, Delta time travel).
Code versioning (git).
Hyperparameter + metric tracking.
Model artefacts versioning.
Statistical evaluation gates.
Live monitoring of model behaviour, not just system health.

3. Experiment tracking — what to log

Every training run, log:

Item	Why
Git commit hash	Reproducibility of code
Data version (Delta version / DVC hash)	Reproducibility of data
Hyperparameters	Find best config; rerun
Library versions (`pip freeze`)	Reproducibility of env
Metrics (loss, AUC, F1, per-slice)	Compare runs
Confusion matrix / ROC curves	Debug failures
Sample predictions	Sanity check
Hardware (GPU type, count)	Compare cost/perf
Wall-clock + GPU hours	Cost tracking
Model artefact + signature	Downstream deploy

import mlflow
with mlflow.start_run():
    mlflow.log_params(params)
    mlflow.log_metric("auc", auc)
    mlflow.log_artifact("model.pkl")
    mlflow.set_tag("git_sha", git_sha)

4. Tools

MLflow — open source, runs in any cloud, model registry built-in. Standard choice for self-host.
Weights & Biases (W&B) — SaaS, richer UI, great for deep learning, $$.
Neptune — lightweight, good metadata model.
Vertex AI / SageMaker / Azure ML experiments — cloud-native; bundled with their pipelines.

For a startup: MLflow + Postgres + S3, runs anywhere. Migrate later if needed.

5. Model registry

Centralised catalogue of trained models. Each model has:

Versions (1, 2, 3, ...).
Stages (None → Staging → Production → Archived).
Lineage (which run produced it, on which data).
Metrics snapshot.
Approval (who promoted to Production, when).

The registry is the single source of truth for "what is in production". Your serving stack pulls from registry.get_latest_versions(name='ranker', stages=['Production']).

6. CI/CD pipeline

git push (training code)
   ↓
[ CI: lint + unit tests + smoke train ]
   ↓
[ Trigger full training job ]
   ↓
[ Eval gates: AUC > 0.85, latency < 50ms, slice perf OK ]
   ↓
[ Register new model version ]
   ↓
[ Deploy to Staging endpoint ]
   ↓
[ Shadow test against Production for N hours ]
   ↓
[ Promote to Production: canary 1% → 10% → 100% ]
   ↓
[ Monitor; auto-rollback if metrics regress ]

Eval gates are non-negotiable. The gate is what stops a worse model from shipping. Common gates:

Headline metric ≥ current production - epsilon.
No slice regresses by > X%.
Inference latency p99 within budget.
Bias / fairness metrics within bounds.

7. Validation gates — the hard part

The naïve gate (new_auc > old_auc) misses:

Slice regression — overall AUC up 0.5%, but it tanked on the high-value user segment by 5%.
Stratified eval — perf by country, device, age group; not just global.
Counterfactual — does the new model behave reasonably on edge cases (zero history, fresh signup, abusive content)?
Calibration — predicted probabilities match actual rates?

Maintain a fixed eval suite (think "test set with personality"), version it, run every candidate.

8. Deployment patterns

Blue/green — deploy new alongside old; instantly switch traffic; instant rollback.
Canary — small % to new, monitor, ramp up. Catches issues without full blast radius.
Shadow — send 100% of traffic to both; compare predictions; don't actually use new yet. Risk-free A/B.
A/B test — split users; measure business outcome (CTR, revenue, retention). The only honest answer to "is this model better?".

Most teams: canary for tech metrics → A/B for business metrics.

9. Monitoring in production

Three layers:

System — latency, error rate, saturation. Same as any service.
Model inputs — feature distributions vs training. PSI, KS test. Drift alerts.
Model outputs — prediction distribution, confidence, calibration.
Outcomes (when label arrives) — actual accuracy, business KPIs.

Latency from prediction to label is the hardest part. Some labels are instant (CTR); some take weeks (fraud, refund). You need an eventual eval loop, separate from the immediate monitoring.

10. Retraining triggers

Scheduled — every week / day / hour. Simplest, often enough.
Drift-triggered — PSI > threshold → trigger retrain.
Performance-triggered — labelled-batch metric drops below SLA → retrain.
Manual — for major changes.

Always have a rollback path before automating retraining-and-deploy. The first auto-retrain bug ships a worse model; you want to undo in one click.

11. Feature store consistency (preview of S38)

The classic ML production bug: training/serving skew. Feature computed differently online vs offline → model trained on one distribution, sees another. Mitigate with:

Single source of truth for feature logic (feature store).
Logging online features → reuse for retraining.

12. Governance and audit

For regulated industries:

Every prediction traceable to a model version + input features + score.
Every model traceable to training data + code + approver.
Right to explanation (GDPR, AI Act).
Periodic bias audits.

Bundle with the registry; this isn't optional in finance, healthcare, lending.

13. Reality check

Most teams don't need a 10-tool platform. A minimal viable MLOps:

Git for code.
MLflow for tracking + registry.
A pipeline orchestrator (Airflow, Prefect, Dagster).
A serving stack (BentoML, Triton, FastAPI).
Prometheus + Grafana for monitoring.
A dashboard / notebook for drift + perf.

You can ship that in 2 weeks. Scale the platform when team size or model count demands it, not before.

Link: https://leetcode.com/problems/design-in-memory-file-system/
Difficulty: Hard
Why this problem: Tree of inodes with create/read/ls/cat — same vocabulary as a model artifact registry.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

List 8 items every training run should log.
Describe the model registry lifecycle (None → Staging → Production → Archived).
Design 4 validation gates that block bad models from production.
Compare canary, shadow, and A/B test deployments.
List 3 retraining triggers and the risks of each.
Solve design-in-memory-file-system — directory tree + file blobs; mirrors a model artefact registry.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Streaming with Flink/Spark — Watermarks, Windows, State

LLM Evaluation — Benchmarks, LLM-as-Judge, RAGAS, Inspect