Day 13 — MLOps — Experiment Tracking, Model Registry, CI/CD for Models
Models that don't ship don't matter. MLOps is the engineering wrapper that turns notebook experiments into versioned, monitored, retrainable production assets.
MLOps takes the DevOps playbook (version control, CI/CD, monitoring, IaC) and adapts it to the two extra moving parts: data and models. Without that adaptation, "it works in the notebook" becomes "it broke in production and nobody knows why".
🧠 Concept
Why it matters & the mental model.
1. The MLOps stack
2. Experiment tracking — log everything, automatically
For every run capture: code commit, data hash, hyperparams, environment (requirements.txt / Docker image), metrics, artifacts (model file + sample predictions), system info (GPU, memory). Tools: MLflow, W&B, Comet, Neptune. Pick one and standardise; the worst tracker is "five different ones".
3. Data versioning — the missing half
Code alone doesn't reproduce a model; you need the exact training data. Options:
- DVC: git-like for data on top of S3.
- lakeFS / Nessie: git branches over your lake.
- Delta/Iceberg time travel: pin a version id.
- Plain hashes: cheap and works for small datasets.
🛠 Deep Dive
Internals, code, architecture.
4. Model registry & staged promotion
A model has a lineage (training run → artifact → versioned model). Promote through stages (None → Staging → Production → Archived). Each stage carries metadata (eval metrics, slice metrics, fairness checks). Promotion is a deliberate event, not a notebook commit.
5. CI/CD for ML — the test pyramid
- Unit: data validation (Great Expectations), feature transforms.
- Integration: train on 1% sample, assert metrics in a sane range.
- Eval gate: on PR, run held-out eval and compare to prod baseline; block if regression > threshold.
- Shadow deploy: new model serves alongside old, predictions logged but not used, compare offline.
- Canary: route 5% → new, monitor.
- Full rollout + auto-rollback on metric drop.
6. Serving patterns
- Online: low-latency, request/response (BentoML, FastAPI, Triton, KServe).
- Batch: scheduled scoring of millions of rows (Spark / Beam).
- Streaming: score events from Kafka with Flink / Faust.
- Edge: on-device (ONNX, CoreML, TFLite).
🚀 In Practice
Trade-offs, exercises, what to ship today.
7. Feature stores
For real-time models: online store (low-latency KV, e.g. Redis/DynamoDB) + offline store (warehouse/lake). Same feature definitions, both populated by the same pipeline → no training/serving skew. Tools: Feast, Tecton, Databricks Feature Store.
8. Monitoring — the often-skipped part
- Drift: PSI / KS on input features; drop in input quality often precedes drop in output quality.
- Performance: rolling AUC / precision against labels (may lag if labels are delayed).
- Latency / cost: p50/p95/p99 per request, $/1k requests.
- Slices: monitor by segment (country, customer tier) — aggregate can hide failure.
9. Reproducibility checklist
Code commit + data hash + environment + seed + hardware. Any missing piece = irreproducible.
10. What to take away
"Walk me from notebook to production." Strong answers cover: tracker, registry, eval gate in CI, shadow/canary deploy, drift monitor, retraining trigger. Bonus: feature store + offline/online consistency.
Resources
- 🎥 Made With ML — MLOps Course (Goku Mohandas)
- 📖 MLflow docs — Tracking + Registry
- 📖 Google — MLOps maturity levels
- 📖 Chip Huyen — Designing Machine Learning Systems (notes)
Practice Problem: Longest Substring Without Repeating Characters (Medium)