data engineeringintermediate 12m2026-06-09

Observability for Data Pipelines — SLAs, SLOs, Freshness, Data Tests

Session 46 of the 48-session learning series.

Date: Wed, 2026-07-15 · Time: 18:00–20:00 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 24 · Est. read: 2 h

Why this session matters

This is Session 46 of 48 in the DE track. Software observability matured a decade ago; data observability is catching up. The discipline that takes you from "I'll check the dashboard tomorrow" to "alert fired at 03:12, runbook executed, no humans woken" is what makes pipelines you can sleep through.

Agenda

SLA vs SLO vs SLI — definitions that matter
The 5 pillars revisited — freshness, volume, distribution, schema, lineage
Instrumentation — what to emit from every pipeline run
Alerting that doesn't burn out the team
Incident postmortems for data systems

Pre-read (skim before the session)

Deep dive

1. SLA, SLO, SLI

SLI — Service Level Indicator. The metric: "% of orders_fact updates completed within 30 min of source commit".
SLO — Service Level Objective. The target: "99.5% of updates within 30 min, monthly".
SLA — Service Level Agreement. The contract with a customer (internal or external): "if we miss SLO by > X, we credit Y".

Most teams need SLOs (internal targets); few real SLAs (formal commitments). Confusing them is common.

2. SLOs for data — what to measure

Common SLIs:

Freshness — now - max(event_ts) for each dataset.
Completeness — count(seen_today) / count(expected_today).
Accuracy — count(rows passing quality test) / count(rows).
Availability — uptime of dataset endpoint.
Latency — time from source event → available in target.

Per dataset, declare 1–3 SLIs with concrete SLOs. Track them weekly.

3. Error budget

Software SRE concept that applies cleanly to data:

Error budget = (1 - SLO) per month
SLO 99.5% → 0.5% × 30 days × 24 h = 3.6 hours/month of allowed outage

If you've burned the budget, freeze feature work; focus on reliability. If you have budget left, take risks; ship faster.

Counter-intuitive for data teams used to 100% perfection mindset. Once adopted, transforms the team's relationship with risk.

4. The 5 pillars instrumented

Pillar	What to emit	Sample alert
Freshness	latest event ts, ingestion lag	`lag > 30 min`
Volume	rows in / rows out per run	`\< 50% of 7d median`
Distribution	mean, p95, null %, distinct count	`null % > 5%`
Schema	columns + types per run	`schema diff vs baseline`
Lineage	upstream tables touched	(no alert; UI surface)

Most pipelines emit only "success/fail". You need to emit metrics — Prometheus, OpenLineage, or proprietary.

5. Instrumentation patterns

Per pipeline run, emit:

Start time, end time, duration.
Source tables read (with version/snapshot ID).
Destination tables written (with new version/snapshot ID).
Row counts (in / out per stage).
Schema fingerprint.
Job-level success / failure / error class.
Custom business metrics (revenue total, distinct users, ...).

OpenLineage emits the lineage + run metadata; pair with Statsd/Prometheus for metrics.

6. Alerts — sustainable design

Alert characteristics:

Actionable — someone can do something about it. Otherwise, dashboard, not alert.
Symptom-based — alert on user impact (freshness SLO breach), not on every internal hiccup.
De-duped — one alert per incident, not 50.
Routable — to the team that owns the dataset.
Escalation — if not acked in N minutes, page next-on-call.

Anti-pattern: 200 alerts per day, all "info-level"; team mutes the channel; real incident missed.

7. SLO-based alerting

Better than threshold alerts:

Threshold: "alert if freshness > 30 min". Fires constantly during normal jitter.
SLO-based: "alert if you're burning error budget faster than X% per hour". Fires only when a real incident is brewing.

Burn-rate alerts (Google SRE) — fewer false positives, catches real incidents earlier. Modern monitoring (Datadog, Grafana SLO, Honeycomb) supports natively.

8. Runbooks

For every alert, a runbook:

Symptoms — what does the alert mean?
Quick checks — what to look at first.
Common causes — top 3 historical root causes.
Remediation — copy-pasteable commands.
Escalation — who to wake up if you can't fix.

Runbooks live in your wiki, linked from the alert payload. Saves 30 min per page; sometimes saves the data.

9. Postmortems for data incidents

Same shape as service postmortems:

Timeline.
Impact (which downstream consumers were affected, for how long).
Root cause (5 whys).
Action items with owners + dates.
Distribute broadly; learn collectively.

Blameless postmortem culture: focus on systems, not people. "How did our system allow X?" not "who deployed Y?"

10. Data contracts (preview of S40 / recap of S32)

Each contract has an embedded SLA:

sla:
  freshness: 30m
  completeness: 99.9%
  on_breach: page producer

When the contract is broken, the producer is paged — not the consumer. Pushes ownership to the right team.

11. Tooling

dbt — tests + freshness checks in pipelines.
Great Expectations / Soda — rich expectations, scheduled or in-pipeline.
Monte Carlo / Bigeye / Anomalo — auto-anomaly + lineage + impact graph; SaaS.
OpenLineage — emit lineage from any orchestrator (Airflow, Dagster, Spark).
DataHub / OpenMetadata — catalog + freshness display.
Prometheus + Grafana — generic metrics; SLO-aware add-ons.

Start with dbt tests + Prometheus + Grafana + scheduled freshness check. Buy a Monte Carlo when scale + cross-system warrants it.

12. On-call for data

Yes, data teams should have an on-call rotation:

Top-10 critical datasets get the page.
Tier-2 datasets: business-hours response only.
Tier-3: investigated weekly.
Quarterly review of incident frequency, action-item completion.

Without on-call: a broken Sunday-morning pipeline blocks Monday's exec meeting. With: a sleepy DE acks the page, kicks the job, goes back to bed.

13. Reality check

A 6-week observability rollout for an existing pipeline stack:

Week 1–2: catalog top-10 datasets; assign owners.
Week 3: define SLIs + SLOs per dataset; baseline current performance.
Week 4: emit OpenLineage from orchestrator; ingest into DataHub.
Week 5: build SLO-based burn-rate alerts; runbooks per alert.
Week 6: stand up on-call rotation; first round of postmortems.

After this: data outages have a process. Teams stop discovering brokenness via Slack from execs.

Link: https://leetcode.com/problems/design-logger-rate-limiter/
Difficulty: Easy
Why this problem: Suppress duplicate messages within a window — the exact primitive behind sane alert de-duplication.
Time-box: 20 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Define SLI, SLO, SLA and write one of each for a real dataset.
Compute and explain an error budget; argue what to do when burned.
Wire OpenLineage + Prometheus to a typical dbt + Airflow stack.
Design burn-rate alerts that don't page on noise.
Run a blameless postmortem for a data incident.
Solve design-logger-rate-limiter — sliding-window dedup, the alert hygiene primitive.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

LLM Safety — Jailbreaks, Prompt Injection, Output Filtering, Red-Teaming

Production Error Handling — Retries, Circuit Breakers, Timeouts, Bulkheads