Data Governance — Lineage, Quality, Catalogs, Contracts, Observability
Session 32 of the 48-session learning series.
Why this session matters
This is Session 32 of 48 in the DE track. Pipelines that run are easy. Pipelines that you trust six months later, that other teams trust, that regulators trust — that's governance. The data-mesh / lakehouse era turned this from a back-office checkbox into a first-class engineering discipline.
Agenda
- Lineage — how, why, and what column-level lineage gives you
- Data quality — tests, expectations, anomaly detection
- Catalogs — DataHub, OpenMetadata, Unity, Glue
- Data contracts — the producer-side schema agreement
- Data observability — SLAs, SLOs, freshness, the 5 pillars
Pre-read (skim before the session)
- Monte Carlo — 5 Pillars of Data Observability
- Andrew Jones — Data Contracts: A Practical Guide
- DataHub — Why metadata matters
- Great Expectations — Core concepts
Deep dive
1. Why governance, why now
The volume of data in a typical company has tripled in 3 years. The number of producers and consumers has 10x'd. The result: silent breakage everywhere.
Symptoms:
- "Why are revenue numbers different in Dashboard A and B?"
- "Production ran a query against
users.email; the column was dropped 6 months ago." - "This ML model trained on a buggy join — we shipped predictions for a quarter."
- "Auditor asks: which models touched PII X in 2025?"
Governance = answer those questions before they're asked.
2. The five pillars of data observability
| Pillar | What | Why |
|---|---|---|
| Freshness | Is data arriving on time? | Stale data ≠ data |
| Volume | Is row count in expected range? | Massive drop = upstream broke |
| Distribution | Are values in expected range? | NULL spike, type drift |
| Schema | Did columns/types change? | Silent break |
| Lineage | What's upstream and downstream? | Blast radius |
Tools (Monte Carlo, Bigeye, Soda, Anomalo) instrument these automatically. You can DIY 80% with dbt tests + alerts.
3. Lineage — the spine of governance
Lineage = the graph of "this column came from those columns came from that source".
Levels:
- Dataset-level —
orders_factcame fromorders_raw + dim_customer. - Column-level —
total_revenueinorders_fact=SUM(unit_price * qty)fromorders_raw. - Transformation-level — the SQL statement that produced each column.
Use cases:
- Impact analysis — "I want to drop this column; what breaks?"
- Root cause — "Dashboard is wrong; trace back to source."
- Audit — "Show me all data lineage touching PII."
Generate from: SQL parsing (sqlglot, dbt graph), runtime hooks (OpenLineage), or job orchestrator (Airflow, Dagster). Don't try to maintain by hand.
4. Data quality testing
The dbt-style minimum:
models:
- name: orders_fact
columns:
- name: order_id
tests: [not_null, unique]
- name: customer_id
tests:
- not_null
- relationships:
to: ref('dim_customer')
field: customer_id
- name: total
tests:
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000000
Tiers of tests:
- Existence — column present, not null.
- Uniqueness — primary keys.
- Referential integrity — foreign keys exist in dim.
- Range / value — accepted values, min/max.
- Distribution — mean, p95, cardinality within X% of last week.
- Business invariant —
revenue = SUM(item.price * item.qty).
Wire to alerting. A failing test that nobody sees is theatre.
5. Great Expectations / Soda / dbt tests
- dbt tests — run during transformations; SQL-based. Free, in-pipeline. Best for shape/value checks.
- Great Expectations — Python-based; richer expectations; can run on any data source; heavier setup.
- Soda — YAML-driven; cloud + OSS; nice UI for tracking violations.
- Anomalo / Monte Carlo — ML-driven anomaly detection on top of metadata.
Start with dbt tests. Add GE/Soda if you need cross-source tests or need to test outside the warehouse.
6. Data contracts
The producer agrees what the schema and semantic looks like. The consumer trusts it.
A contract specifies:
- Schema — column names, types, nullability.
- Semantics — "amount is always in USD cents, integer".
- SLAs — freshness, completeness.
- Owners — who to ping when broken.
- Change policy — additive only? deprecation window?
Tooling:
- Define in YAML or JSON.
- Producer's CI fails if a change violates the contract.
- Versioned in git; reviewed like API contracts.
dataset: orders
owner: orders-team
schema:
- name: order_id
type: string
description: uuid
sla:
freshness: 30m
completeness: 99.9%
Push the validation upstream. Test the producer's output, not the consumer's input.
7. Catalogs — discoverability layer
DataHub, OpenMetadata, Unity Catalog, AWS Glue Data Catalog. Capabilities:
- Auto-crawl warehouses, lakes, dashboards.
- Show schemas, sample data, top users.
- Capture lineage (often via OpenLineage).
- Glossary — business definitions tied to physical tables.
- Tags (PII, PCI, restricted).
- Reviews / ratings — humans annotate trust.
For a startup: DataHub or OpenMetadata (OSS, self-hostable). On Databricks: Unity Catalog. On AWS: Glue.
8. Owners and stewards
Every dataset needs an owner (engineer who maintains) and a steward (business person who defines). Without ownership:
- No one fixes broken pipelines.
- No one approves schema changes.
- No one answers "what does
customer_tier_v3mean?"
Bake owner into the catalog. Page the owner when their dataset SLA breaks.
9. PII, classification, masking
Tag columns with sensitivity (Public, Internal, Confidential, PII, PCI). Enforce via:
- Row-level security — different users see different rows.
- Column-level security — analysts see hashed
email, support sees plain. - Dynamic data masking — same query, different output by role.
- Lineage propagation — if PII flows into a downstream table, downstream is automatically tagged.
Modern warehouses (Snowflake, BigQuery, Databricks) support all of this natively. Use it.
10. Data observability vs data quality
- Quality — a set of rules: "X must be > 0", "Y must not be NULL".
- Observability — the running picture: what is happening across freshness, volume, schema, distribution, lineage.
You need both. Quality catches what you can articulate. Observability surfaces the surprises you didn't think to test for.
11. Incidents and postmortems for data
Treat data outages like service outages:
- Severity levels.
- On-call rotation for critical datasets.
- War room / Slack channel during incident.
- Postmortem with action items.
- Public dashboard of dataset health.
Cultural shift. Data engineers learn from ops; ops doesn't tolerate "the dashboard is wrong, we'll fix it Monday".
12. Reality check
A first-90-days governance plan:
- Top-10 critical datasets — explicit owners, contracts, SLAs.
- dbt tests on every Gold model — basic shape + value.
- OpenLineage emission from Airflow/Dagster → DataHub.
- PII tags on user-facing tables.
- Slack alerts on failed tests, missed SLAs.
- Weekly review of broken-test trends.
You don't need a Monte Carlo subscription on day 1. You do need the discipline.
Reading material
Books:
- Driving Data Quality with Data Contracts — Andrew Jones (2023; the canonical book on data contracts)
- Fundamentals of Data Engineering — Joe Reis & Matt Housley (the governance / DataOps chapter)
- Data Management at Scale, 2nd ed. — Piethein Strengholt (the enterprise data-mesh + governance book)
- The DAMA Guide to the Data Management Body of Knowledge (DMBOK), 2nd ed. — DAMA International (the encyclopaedia; reference, not cover-to-cover)
Papers:
- Goods: Organizing Google's Datasets — Halevy et al. 2016 (SIGMOD) — Google's internal catalog paper; the origin of modern lineage thinking.
- Apache Atlas: Data Governance Made Possible — Hortonworks — the Hadoop-era catalog; still influential.
Official docs:
- OpenLineage spec — the open standard for emitting lineage from any tool.
- dbt — Data tests —
unique,not_null,relationships,accepted_values; the starter kit. - Great Expectations docs — the OG Python data-validation library.
- Soda Core docs — the YAML-driven alternative.
- DataHub documentation — LinkedIn's open-source catalog.
- Databricks Unity Catalog — the lakehouse-native catalog + access control.
Blog posts:
- Chad Sanderson — Data Contracts: The Mesh in Practice — the canonical practitioner essay series.
- Monte Carlo — What is Data Observability? The 5 Pillars — freshness, distribution, volume, schema, lineage.
- Airbnb Engineering — Data Quality at Airbnb (Midas) — the certification programme that defined a generation of DQ.
- Lyft Engineering — Amundsen — the discovery / catalog system that inspired DataHub.
In-depth research material
- OpenLineage — github.com/OpenLineage/OpenLineage — ~1.9k ★, the cross-tool lineage standard supported by Airflow, dbt, Spark.
- DataHub — github.com/datahub-project/datahub — ~10k ★, LinkedIn's metadata + lineage + quality platform.
- Amundsen — github.com/amundsen-io/amundsen — ~4.5k ★, Lyft's discovery engine.
- OpenMetadata — github.com/open-metadata/OpenMetadata — ~6k ★, modern unified metadata.
- Marquez — github.com/MarquezProject/marquez — ~1.9k ★, the OpenLineage reference store.
- Great Expectations — github.com/great-expectations/great_expectations — ~10k ★, the canonical Python data validation framework.
- Soda Core — github.com/sodadata/soda-core — ~1.9k ★, declarative YAML data checks.
- Uber Engineering — DataMesh: How Uber transformed data governance — the petabyte-scale data-mesh implementation.
- Netflix Tech Blog — Metacat: Making big data discoverable and meaningful — Netflix's federated metadata service.
- Convoy / Chad Sanderson — Shifting Left on Data Quality (Data Council talk) — the contract-driven approach at Convoy.
Videos
- Data Contracts: From Theory to Implementation — Chad Sanderson — Chad Sanderson · 41 min — the canonical talk on data contracts.
- Data Mesh: Principles and Logical Architecture — Zhamak Dehghani — Zhamak Dehghani · 44 min — the creator of the data mesh paradigm explaining it.
- Data Observability: The Next Frontier of Data Engineering — Barr Moses — Monte Carlo CEO · 27 min — the 5 pillars in a single talk.
- Apache Atlas: Data Governance for Hadoop — Hortonworks — 32 min — the Hadoop-era origin of modern catalog tools.
- Building a Self-Service Data Platform at Netflix — Joey Lynch — 47 min — how Netflix scaled discovery + quality across thousands of users.
LeetCode — Graph Valid Tree
- Link: https://leetcode.com/problems/graph-valid-tree/
- Difficulty: Medium
- Why this problem: Verify acyclic + connected — the structural invariant of lineage graphs.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Name the 5 pillars of data observability.
- Explain dataset- vs column-level lineage and one use-case for each.
- Write 4 dbt tests covering existence, uniqueness, referential integrity, range.
- Sketch a data contract YAML with schema, SLA, and owner.
- Choose between dbt tests, Great Expectations, and a SaaS observability tool.
- Solve
graph-valid-tree— union-find or BFS for tree validation (lineage graph sanity check).
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.