Search Tech Journey

Find topics, journeys and posts

back to blog
data engineeringintermediate 12m2026-06-09

Observability for Data Pipelines — SLAs, SLOs, Freshness, Data Tests

Session 46 of the 48-session learning series.

Date: Wed, 2026-07-15 · Time: 18:00–20:00 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 24 · Est. read: 2 h

Why this session matters

This is Session 46 of 48 in the DE track. Software observability matured a decade ago; data observability is catching up. The discipline that takes you from "I'll check the dashboard tomorrow" to "alert fired at 03:12, runbook executed, no humans woken" is what makes pipelines you can sleep through.

Agenda

  • SLA vs SLO vs SLI — definitions that matter
  • The 5 pillars revisited — freshness, volume, distribution, schema, lineage
  • Instrumentation — what to emit from every pipeline run
  • Alerting that doesn't burn out the team
  • Incident postmortems for data systems

Pre-read (skim before the session)

Deep dive

1. SLA, SLO, SLI

  • SLI — Service Level Indicator. The metric: "% of orders_fact updates completed within 30 min of source commit".
  • SLO — Service Level Objective. The target: "99.5% of updates within 30 min, monthly".
  • SLA — Service Level Agreement. The contract with a customer (internal or external): "if we miss SLO by > X, we credit Y".

Most teams need SLOs (internal targets); few real SLAs (formal commitments). Confusing them is common.

2. SLOs for data — what to measure

Common SLIs:

  • Freshnessnow - max(event_ts) for each dataset.
  • Completenesscount(seen_today) / count(expected_today).
  • Accuracycount(rows passing quality test) / count(rows).
  • Availabilityuptime of dataset endpoint.
  • Latencytime from source event → available in target.

Per dataset, declare 1–3 SLIs with concrete SLOs. Track them weekly.

3. Error budget

Software SRE concept that applies cleanly to data:

Error budget = (1 - SLO) per month
SLO 99.5% → 0.5% × 30 days × 24 h = 3.6 hours/month of allowed outage

If you've burned the budget, freeze feature work; focus on reliability. If you have budget left, take risks; ship faster.

Counter-intuitive for data teams used to 100% perfection mindset. Once adopted, transforms the team's relationship with risk.

4. The 5 pillars instrumented

PillarWhat to emitSample alert
Freshnesslatest event ts, ingestion laglag > 30 min
Volumerows in / rows out per run\< 50% of 7d median
Distributionmean, p95, null %, distinct countnull % > 5%
Schemacolumns + types per runschema diff vs baseline
Lineageupstream tables touched(no alert; UI surface)

Most pipelines emit only "success/fail". You need to emit metrics — Prometheus, OpenLineage, or proprietary.

5. Instrumentation patterns

Per pipeline run, emit:

  • Start time, end time, duration.
  • Source tables read (with version/snapshot ID).
  • Destination tables written (with new version/snapshot ID).
  • Row counts (in / out per stage).
  • Schema fingerprint.
  • Job-level success / failure / error class.
  • Custom business metrics (revenue total, distinct users, ...).

OpenLineage emits the lineage + run metadata; pair with Statsd/Prometheus for metrics.

6. Alerts — sustainable design

Alert characteristics:

  • Actionable — someone can do something about it. Otherwise, dashboard, not alert.
  • Symptom-based — alert on user impact (freshness SLO breach), not on every internal hiccup.
  • De-duped — one alert per incident, not 50.
  • Routable — to the team that owns the dataset.
  • Escalation — if not acked in N minutes, page next-on-call.

Anti-pattern: 200 alerts per day, all "info-level"; team mutes the channel; real incident missed.

7. SLO-based alerting

Better than threshold alerts:

  • Threshold: "alert if freshness > 30 min". Fires constantly during normal jitter.
  • SLO-based: "alert if you're burning error budget faster than X% per hour". Fires only when a real incident is brewing.

Burn-rate alerts (Google SRE) — fewer false positives, catches real incidents earlier. Modern monitoring (Datadog, Grafana SLO, Honeycomb) supports natively.

8. Runbooks

For every alert, a runbook:

  • Symptoms — what does the alert mean?
  • Quick checks — what to look at first.
  • Common causes — top 3 historical root causes.
  • Remediation — copy-pasteable commands.
  • Escalation — who to wake up if you can't fix.

Runbooks live in your wiki, linked from the alert payload. Saves 30 min per page; sometimes saves the data.

9. Postmortems for data incidents

Same shape as service postmortems:

  • Timeline.
  • Impact (which downstream consumers were affected, for how long).
  • Root cause (5 whys).
  • Action items with owners + dates.
  • Distribute broadly; learn collectively.

Blameless postmortem culture: focus on systems, not people. "How did our system allow X?" not "who deployed Y?"

10. Data contracts (preview of S40 / recap of S32)

Each contract has an embedded SLA:

sla:
  freshness: 30m
  completeness: 99.9%
  on_breach: page producer

When the contract is broken, the producer is paged — not the consumer. Pushes ownership to the right team.

11. Tooling

  • dbt — tests + freshness checks in pipelines.
  • Great Expectations / Soda — rich expectations, scheduled or in-pipeline.
  • Monte Carlo / Bigeye / Anomalo — auto-anomaly + lineage + impact graph; SaaS.
  • OpenLineage — emit lineage from any orchestrator (Airflow, Dagster, Spark).
  • DataHub / OpenMetadata — catalog + freshness display.
  • Prometheus + Grafana — generic metrics; SLO-aware add-ons.

Start with dbt tests + Prometheus + Grafana + scheduled freshness check. Buy a Monte Carlo when scale + cross-system warrants it.

12. On-call for data

Yes, data teams should have an on-call rotation:

  • Top-10 critical datasets get the page.
  • Tier-2 datasets: business-hours response only.
  • Tier-3: investigated weekly.
  • Quarterly review of incident frequency, action-item completion.

Without on-call: a broken Sunday-morning pipeline blocks Monday's exec meeting. With: a sleepy DE acks the page, kicks the job, goes back to bed.

13. Reality check

A 6-week observability rollout for an existing pipeline stack:

  • Week 1–2: catalog top-10 datasets; assign owners.
  • Week 3: define SLIs + SLOs per dataset; baseline current performance.
  • Week 4: emit OpenLineage from orchestrator; ingest into DataHub.
  • Week 5: build SLO-based burn-rate alerts; runbooks per alert.
  • Week 6: stand up on-call rotation; first round of postmortems.

After this: data outages have a process. Teams stop discovering brokenness via Slack from execs.

Reading material

In-depth research material

Video reference

▶︎ Data Observability in Practice (Barr Moses)

Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.

LeetCode — Design Logger Rate Limiter

Post-session checklist

By the end of this session you should be able to:

  • Define SLI, SLO, SLA and write one of each for a real dataset.
  • Compute and explain an error budget; argue what to do when burned.
  • Wire OpenLineage + Prometheus to a typical dbt + Airflow stack.
  • Design burn-rate alerts that don't page on noise.
  • Run a blameless postmortem for a data incident.
  • Solve design-logger-rate-limiter — sliding-window dedup, the alert hygiene primitive.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.