Observability for Data Pipelines — SLAs, SLOs, Freshness, Data Tests
Session 46 of the 48-session learning series.
Date: Wed, 2026-07-15 · Time: 18:00–20:00 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 24 · Est. read: 2 h
Why this session matters
This is Session 46 of 48 in the DE track. Software observability matured a decade ago; data observability is catching up. The discipline that takes you from "I'll check the dashboard tomorrow" to "alert fired at 03:12, runbook executed, no humans woken" is what makes pipelines you can sleep through.
Agenda
- SLA vs SLO vs SLI — definitions that matter
- The 5 pillars revisited — freshness, volume, distribution, schema, lineage
- Instrumentation — what to emit from every pipeline run
- Alerting that doesn't burn out the team
- Incident postmortems for data systems
Pre-read (skim before the session)
- Google SRE Book — Service Level Objectives
- Monte Carlo — 5 Pillars of Data Observability
- Maxime Beauchemin — The rise of the data engineer
- Liz Fong-Jones — Observability
Deep dive
1. SLA, SLO, SLI
- SLI — Service Level Indicator. The metric: "% of orders_fact updates completed within 30 min of source commit".
- SLO — Service Level Objective. The target: "99.5% of updates within 30 min, monthly".
- SLA — Service Level Agreement. The contract with a customer (internal or external): "if we miss SLO by > X, we credit Y".
Most teams need SLOs (internal targets); few real SLAs (formal commitments). Confusing them is common.
2. SLOs for data — what to measure
Common SLIs:
- Freshness —
now - max(event_ts)for each dataset. - Completeness —
count(seen_today) / count(expected_today). - Accuracy —
count(rows passing quality test) / count(rows). - Availability —
uptime of dataset endpoint. - Latency —
time from source event → available in target.
Per dataset, declare 1–3 SLIs with concrete SLOs. Track them weekly.
3. Error budget
Software SRE concept that applies cleanly to data:
Error budget = (1 - SLO) per month
SLO 99.5% → 0.5% × 30 days × 24 h = 3.6 hours/month of allowed outage
If you've burned the budget, freeze feature work; focus on reliability. If you have budget left, take risks; ship faster.
Counter-intuitive for data teams used to 100% perfection mindset. Once adopted, transforms the team's relationship with risk.
4. The 5 pillars instrumented
| Pillar | What to emit | Sample alert |
|---|---|---|
| Freshness | latest event ts, ingestion lag | lag > 30 min |
| Volume | rows in / rows out per run | \< 50% of 7d median |
| Distribution | mean, p95, null %, distinct count | null % > 5% |
| Schema | columns + types per run | schema diff vs baseline |
| Lineage | upstream tables touched | (no alert; UI surface) |
Most pipelines emit only "success/fail". You need to emit metrics — Prometheus, OpenLineage, or proprietary.
5. Instrumentation patterns
Per pipeline run, emit:
- Start time, end time, duration.
- Source tables read (with version/snapshot ID).
- Destination tables written (with new version/snapshot ID).
- Row counts (in / out per stage).
- Schema fingerprint.
- Job-level success / failure / error class.
- Custom business metrics (revenue total, distinct users, ...).
OpenLineage emits the lineage + run metadata; pair with Statsd/Prometheus for metrics.
6. Alerts — sustainable design
Alert characteristics:
- Actionable — someone can do something about it. Otherwise, dashboard, not alert.
- Symptom-based — alert on user impact (freshness SLO breach), not on every internal hiccup.
- De-duped — one alert per incident, not 50.
- Routable — to the team that owns the dataset.
- Escalation — if not acked in N minutes, page next-on-call.
Anti-pattern: 200 alerts per day, all "info-level"; team mutes the channel; real incident missed.
7. SLO-based alerting
Better than threshold alerts:
- Threshold: "alert if freshness > 30 min". Fires constantly during normal jitter.
- SLO-based: "alert if you're burning error budget faster than X% per hour". Fires only when a real incident is brewing.
Burn-rate alerts (Google SRE) — fewer false positives, catches real incidents earlier. Modern monitoring (Datadog, Grafana SLO, Honeycomb) supports natively.
8. Runbooks
For every alert, a runbook:
- Symptoms — what does the alert mean?
- Quick checks — what to look at first.
- Common causes — top 3 historical root causes.
- Remediation — copy-pasteable commands.
- Escalation — who to wake up if you can't fix.
Runbooks live in your wiki, linked from the alert payload. Saves 30 min per page; sometimes saves the data.
9. Postmortems for data incidents
Same shape as service postmortems:
- Timeline.
- Impact (which downstream consumers were affected, for how long).
- Root cause (5 whys).
- Action items with owners + dates.
- Distribute broadly; learn collectively.
Blameless postmortem culture: focus on systems, not people. "How did our system allow X?" not "who deployed Y?"
10. Data contracts (preview of S40 / recap of S32)
Each contract has an embedded SLA:
sla:
freshness: 30m
completeness: 99.9%
on_breach: page producer
When the contract is broken, the producer is paged — not the consumer. Pushes ownership to the right team.
11. Tooling
- dbt — tests + freshness checks in pipelines.
- Great Expectations / Soda — rich expectations, scheduled or in-pipeline.
- Monte Carlo / Bigeye / Anomalo — auto-anomaly + lineage + impact graph; SaaS.
- OpenLineage — emit lineage from any orchestrator (Airflow, Dagster, Spark).
- DataHub / OpenMetadata — catalog + freshness display.
- Prometheus + Grafana — generic metrics; SLO-aware add-ons.
Start with dbt tests + Prometheus + Grafana + scheduled freshness check. Buy a Monte Carlo when scale + cross-system warrants it.
12. On-call for data
Yes, data teams should have an on-call rotation:
- Top-10 critical datasets get the page.
- Tier-2 datasets: business-hours response only.
- Tier-3: investigated weekly.
- Quarterly review of incident frequency, action-item completion.
Without on-call: a broken Sunday-morning pipeline blocks Monday's exec meeting. With: a sleepy DE acks the page, kicks the job, goes back to bed.
13. Reality check
A 6-week observability rollout for an existing pipeline stack:
- Week 1–2: catalog top-10 datasets; assign owners.
- Week 3: define SLIs + SLOs per dataset; baseline current performance.
- Week 4: emit OpenLineage from orchestrator; ingest into DataHub.
- Week 5: build SLO-based burn-rate alerts; runbooks per alert.
- Week 6: stand up on-call rotation; first round of postmortems.
After this: data outages have a process. Teams stop discovering brokenness via Slack from execs.
Reading material
- Google SRE Book — SLOs chapter
- Site Reliability Engineering — full book free online
- Implementing Service Level Objectives (Alex Hidalgo)
- Monte Carlo — Data Reliability Engineering
In-depth research material
Video reference
▶︎ Data Observability in Practice (Barr Moses)
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Design Logger Rate Limiter
- Link: https://leetcode.com/problems/design-logger-rate-limiter/
- Difficulty: Easy
- Why this problem: Suppress duplicate messages within a window — the exact primitive behind sane alert de-duplication.
- Time-box: 20 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Define SLI, SLO, SLA and write one of each for a real dataset.
- Compute and explain an error budget; argue what to do when burned.
- Wire OpenLineage + Prometheus to a typical dbt + Airflow stack.
- Design burn-rate alerts that don't page on noise.
- Run a blameless postmortem for a data incident.
- Solve
design-logger-rate-limiter— sliding-window dedup, the alert hygiene primitive.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.