MLOps — Experiment Tracking, Model Registry, CI/CD for Models
Session 24 of the 48-session learning series.
Date: Sun, 2026-06-28 · Time: 09:00–11:00 IST · Track: 📈 Machine Learning (ML) · Parent 28-day topic: Day 13 · Est. read: 2 h
Why this session matters
This is Session 24 of 48 in the Machine Learning track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.
Agenda
- What MLOps actually means — the lifecycle map
- Experiment tracking — MLflow, W&B, Neptune; what to log
- Model registry — versions, stages, lineage, governance
- CI/CD for models — tests, validation gates, canary, shadow
- Monitoring in production — drift, performance, alerting, retraining
Pre-read (skim before the session)
- Hidden Technical Debt in ML Systems (Sculley et al., 2015)
- MLflow — Concepts overview
- ML Test Score (Breck et al., 2017)
- Chip Huyen — Designing ML Systems (book)
Deep dive
1. The MLOps lifecycle
[Data] → [Features] → [Train] → [Eval] → [Register] → [Deploy] → [Monitor] → [Retrain]
▲ │
└──────────────────────────────────────────────────────────────────────────────────┘
Each arrow is a place to fail. MLOps is the practice of automating + observing + auditing every arrow, so the loop is reliable enough to put a model in front of paying users.
2. Why this is harder than software DevOps
Software CI/CD: deterministic code → deterministic build → deterministic deploy. Bugs are reproducible.
ML CI/CD: stochastic training → metric-bounded model → deploy with regression risk. Bugs may only appear on a specific data slice 3 weeks later. You need:
- Data versioning (DVC, lakeFS, Delta time travel).
- Code versioning (git).
- Hyperparameter + metric tracking.
- Model artefacts versioning.
- Statistical evaluation gates.
- Live monitoring of model behaviour, not just system health.
3. Experiment tracking — what to log
Every training run, log:
| Item | Why |
|---|---|
| Git commit hash | Reproducibility of code |
| Data version (Delta version / DVC hash) | Reproducibility of data |
| Hyperparameters | Find best config; rerun |
Library versions (pip freeze) | Reproducibility of env |
| Metrics (loss, AUC, F1, per-slice) | Compare runs |
| Confusion matrix / ROC curves | Debug failures |
| Sample predictions | Sanity check |
| Hardware (GPU type, count) | Compare cost/perf |
| Wall-clock + GPU hours | Cost tracking |
| Model artefact + signature | Downstream deploy |
import mlflow
with mlflow.start_run():
mlflow.log_params(params)
mlflow.log_metric("auc", auc)
mlflow.log_artifact("model.pkl")
mlflow.set_tag("git_sha", git_sha)
4. Tools
- MLflow — open source, runs in any cloud, model registry built-in. Standard choice for self-host.
- Weights & Biases (W&B) — SaaS, richer UI, great for deep learning, $$.
- Neptune — lightweight, good metadata model.
- Vertex AI / SageMaker / Azure ML experiments — cloud-native; bundled with their pipelines.
For a startup: MLflow + Postgres + S3, runs anywhere. Migrate later if needed.
5. Model registry
Centralised catalogue of trained models. Each model has:
- Versions (1, 2, 3, ...).
- Stages (None → Staging → Production → Archived).
- Lineage (which run produced it, on which data).
- Metrics snapshot.
- Approval (who promoted to Production, when).
The registry is the single source of truth for "what is in production". Your serving stack pulls from registry.get_latest_versions(name='ranker', stages=['Production']).
6. CI/CD pipeline
git push (training code)
↓
[ CI: lint + unit tests + smoke train ]
↓
[ Trigger full training job ]
↓
[ Eval gates: AUC > 0.85, latency < 50ms, slice perf OK ]
↓
[ Register new model version ]
↓
[ Deploy to Staging endpoint ]
↓
[ Shadow test against Production for N hours ]
↓
[ Promote to Production: canary 1% → 10% → 100% ]
↓
[ Monitor; auto-rollback if metrics regress ]
Eval gates are non-negotiable. The gate is what stops a worse model from shipping. Common gates:
- Headline metric ≥ current production - epsilon.
- No slice regresses by > X%.
- Inference latency p99 within budget.
- Bias / fairness metrics within bounds.
7. Validation gates — the hard part
The naïve gate (new_auc > old_auc) misses:
- Slice regression — overall AUC up 0.5%, but it tanked on the high-value user segment by 5%.
- Stratified eval — perf by country, device, age group; not just global.
- Counterfactual — does the new model behave reasonably on edge cases (zero history, fresh signup, abusive content)?
- Calibration — predicted probabilities match actual rates?
Maintain a fixed eval suite (think "test set with personality"), version it, run every candidate.
8. Deployment patterns
- Blue/green — deploy new alongside old; instantly switch traffic; instant rollback.
- Canary — small % to new, monitor, ramp up. Catches issues without full blast radius.
- Shadow — send 100% of traffic to both; compare predictions; don't actually use new yet. Risk-free A/B.
- A/B test — split users; measure business outcome (CTR, revenue, retention). The only honest answer to "is this model better?".
Most teams: canary for tech metrics → A/B for business metrics.
9. Monitoring in production
Three layers:
- System — latency, error rate, saturation. Same as any service.
- Model inputs — feature distributions vs training. PSI, KS test. Drift alerts.
- Model outputs — prediction distribution, confidence, calibration.
- Outcomes (when label arrives) — actual accuracy, business KPIs.
Latency from prediction to label is the hardest part. Some labels are instant (CTR); some take weeks (fraud, refund). You need an eventual eval loop, separate from the immediate monitoring.
10. Retraining triggers
- Scheduled — every week / day / hour. Simplest, often enough.
- Drift-triggered — PSI > threshold → trigger retrain.
- Performance-triggered — labelled-batch metric drops below SLA → retrain.
- Manual — for major changes.
Always have a rollback path before automating retraining-and-deploy. The first auto-retrain bug ships a worse model; you want to undo in one click.
11. Feature store consistency (preview of S38)
The classic ML production bug: training/serving skew. Feature computed differently online vs offline → model trained on one distribution, sees another. Mitigate with:
- Single source of truth for feature logic (feature store).
- Logging online features → reuse for retraining.
12. Governance and audit
For regulated industries:
- Every prediction traceable to a model version + input features + score.
- Every model traceable to training data + code + approver.
- Right to explanation (GDPR, AI Act).
- Periodic bias audits.
Bundle with the registry; this isn't optional in finance, healthcare, lending.
13. Reality check
Most teams don't need a 10-tool platform. A minimal viable MLOps:
- Git for code.
- MLflow for tracking + registry.
- A pipeline orchestrator (Airflow, Prefect, Dagster).
- A serving stack (BentoML, Triton, FastAPI).
- Prometheus + Grafana for monitoring.
- A dashboard / notebook for drift + perf.
You can ship that in 2 weeks. Scale the platform when team size or model count demands it, not before.
Reading material
- Hidden Technical Debt in ML Systems
- Designing ML Systems (Chip Huyen)
- MLflow docs
- ML Test Score paper
In-depth research material
- Eugene Yan — Real-time ML systems
- Google — Rules of ML
- Made With ML — MLOps
- Continuous Delivery for ML (Sato et al.)
Video reference
▶︎ MLflow — End-to-End ML Lifecycle (Databricks)
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Design In Memory File System
- Link: https://leetcode.com/problems/design-in-memory-file-system/
- Difficulty: Hard
- Why this problem: Tree of inodes with create/read/ls/cat — same vocabulary as a model artifact registry.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- List 8 items every training run should log.
- Describe the model registry lifecycle (None → Staging → Production → Archived).
- Design 4 validation gates that block bad models from production.
- Compare canary, shadow, and A/B test deployments.
- List 3 retraining triggers and the risks of each.
- Solve
design-in-memory-file-system— directory tree + file blobs; mirrors a model artefact registry.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.