Search Tech Journey

Find topics, journeys and posts

back to blog
ai mladvanced 12m2026-06-11

Day 13 — MLOps — Experiment Tracking, Model Registry, CI/CD for Models

Models that don't ship don't matter. MLOps is the engineering wrapper that turns notebook experiments into versioned, monitored, retrainable production assets.

MLOps takes the DevOps playbook (version control, CI/CD, monitoring, IaC) and adapts it to the two extra moving parts: data and models. Without that adaptation, "it works in the notebook" becomes "it broke in production and nobody knows why".

🧠 Concept

Why it matters & the mental model.

1. The MLOps stack

2. Experiment tracking — log everything, automatically

For every run capture: code commit, data hash, hyperparams, environment (requirements.txt / Docker image), metrics, artifacts (model file + sample predictions), system info (GPU, memory). Tools: MLflow, W&B, Comet, Neptune. Pick one and standardise; the worst tracker is "five different ones".

3. Data versioning — the missing half

Code alone doesn't reproduce a model; you need the exact training data. Options:

  • DVC: git-like for data on top of S3.
  • lakeFS / Nessie: git branches over your lake.
  • Delta/Iceberg time travel: pin a version id.
  • Plain hashes: cheap and works for small datasets.

🛠 Deep Dive

Internals, code, architecture.

4. Model registry & staged promotion

A model has a lineage (training run → artifact → versioned model). Promote through stages (None → Staging → Production → Archived). Each stage carries metadata (eval metrics, slice metrics, fairness checks). Promotion is a deliberate event, not a notebook commit.

5. CI/CD for ML — the test pyramid

  • Unit: data validation (Great Expectations), feature transforms.
  • Integration: train on 1% sample, assert metrics in a sane range.
  • Eval gate: on PR, run held-out eval and compare to prod baseline; block if regression > threshold.
  • Shadow deploy: new model serves alongside old, predictions logged but not used, compare offline.
  • Canary: route 5% → new, monitor.
  • Full rollout + auto-rollback on metric drop.

6. Serving patterns

  • Online: low-latency, request/response (BentoML, FastAPI, Triton, KServe).
  • Batch: scheduled scoring of millions of rows (Spark / Beam).
  • Streaming: score events from Kafka with Flink / Faust.
  • Edge: on-device (ONNX, CoreML, TFLite).

🚀 In Practice

Trade-offs, exercises, what to ship today.

7. Feature stores

For real-time models: online store (low-latency KV, e.g. Redis/DynamoDB) + offline store (warehouse/lake). Same feature definitions, both populated by the same pipeline → no training/serving skew. Tools: Feast, Tecton, Databricks Feature Store.

8. Monitoring — the often-skipped part

  • Drift: PSI / KS on input features; drop in input quality often precedes drop in output quality.
  • Performance: rolling AUC / precision against labels (may lag if labels are delayed).
  • Latency / cost: p50/p95/p99 per request, $/1k requests.
  • Slices: monitor by segment (country, customer tier) — aggregate can hide failure.

9. Reproducibility checklist

Code commit + data hash + environment + seed + hardware. Any missing piece = irreproducible.

10. What to take away

"Walk me from notebook to production." Strong answers cover: tracker, registry, eval gate in CI, shadow/canary deploy, drift monitor, retraining trigger. Bonus: feature store + offline/online consistency.

Key points

    Resources

    Practice Problem: Longest Substring Without Repeating Characters (Medium)