data engineeringintermediate 12m2026-06-09

Data Governance — Lineage, Quality, Catalogs, Contracts, Observability

Session 32 of the 48-session learning series.

Why this session matters

This is Session 32 of 48 in the DE track. Pipelines that run are easy. Pipelines that you trust six months later, that other teams trust, that regulators trust — that's governance. The data-mesh / lakehouse era turned this from a back-office checkbox into a first-class engineering discipline.

Agenda

Lineage — how, why, and what column-level lineage gives you
Data quality — tests, expectations, anomaly detection
Catalogs — DataHub, OpenMetadata, Unity, Glue
Data contracts — the producer-side schema agreement
Data observability — SLAs, SLOs, freshness, the 5 pillars

Pre-read (skim before the session)

Deep dive

1. Why governance, why now

The volume of data in a typical company has tripled in 3 years. The number of producers and consumers has 10x'd. The result: silent breakage everywhere.

Symptoms:

"Why are revenue numbers different in Dashboard A and B?"
"Production ran a query against users.email; the column was dropped 6 months ago."
"This ML model trained on a buggy join — we shipped predictions for a quarter."
"Auditor asks: which models touched PII X in 2025?"

Governance = answer those questions before they're asked.

2. The five pillars of data observability

Pillar	What	Why
Freshness	Is data arriving on time?	Stale data ≠ data
Volume	Is row count in expected range?	Massive drop = upstream broke
Distribution	Are values in expected range?	NULL spike, type drift
Schema	Did columns/types change?	Silent break
Lineage	What's upstream and downstream?	Blast radius

Tools (Monte Carlo, Bigeye, Soda, Anomalo) instrument these automatically. You can DIY 80% with dbt tests + alerts.

3. Lineage — the spine of governance

Lineage = the graph of "this column came from those columns came from that source".

Levels:

Dataset-level — orders_fact came from orders_raw + dim_customer.
Column-level — total_revenue in orders_fact = SUM(unit_price * qty) from orders_raw.
Transformation-level — the SQL statement that produced each column.

Use cases:

Impact analysis — "I want to drop this column; what breaks?"
Root cause — "Dashboard is wrong; trace back to source."
Audit — "Show me all data lineage touching PII."

Generate from: SQL parsing (sqlglot, dbt graph), runtime hooks (OpenLineage), or job orchestrator (Airflow, Dagster). Don't try to maintain by hand.

4. Data quality testing

The dbt-style minimum:

models:
  - name: orders_fact
    columns:
      - name: order_id
        tests: [not_null, unique]
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customer')
              field: customer_id
      - name: total
        tests:
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000

Tiers of tests:

Existence — column present, not null.
Uniqueness — primary keys.
Referential integrity — foreign keys exist in dim.
Range / value — accepted values, min/max.
Distribution — mean, p95, cardinality within X% of last week.
Business invariant — revenue = SUM(item.price * item.qty).

Wire to alerting. A failing test that nobody sees is theatre.

5. Great Expectations / Soda / dbt tests

dbt tests — run during transformations; SQL-based. Free, in-pipeline. Best for shape/value checks.
Great Expectations — Python-based; richer expectations; can run on any data source; heavier setup.
Soda — YAML-driven; cloud + OSS; nice UI for tracking violations.
Anomalo / Monte Carlo — ML-driven anomaly detection on top of metadata.

Start with dbt tests. Add GE/Soda if you need cross-source tests or need to test outside the warehouse.

6. Data contracts

The producer agrees what the schema and semantic looks like. The consumer trusts it.

A contract specifies:

Schema — column names, types, nullability.
Semantics — "amount is always in USD cents, integer".
SLAs — freshness, completeness.
Owners — who to ping when broken.
Change policy — additive only? deprecation window?

Tooling:

Define in YAML or JSON.
Producer's CI fails if a change violates the contract.
Versioned in git; reviewed like API contracts.

dataset: orders
owner: orders-team
schema:
  - name: order_id
    type: string
    description: uuid
sla:
  freshness: 30m
  completeness: 99.9%

Push the validation upstream. Test the producer's output, not the consumer's input.

7. Catalogs — discoverability layer

DataHub, OpenMetadata, Unity Catalog, AWS Glue Data Catalog. Capabilities:

Auto-crawl warehouses, lakes, dashboards.
Show schemas, sample data, top users.
Capture lineage (often via OpenLineage).
Glossary — business definitions tied to physical tables.
Tags (PII, PCI, restricted).
Reviews / ratings — humans annotate trust.

For a startup: DataHub or OpenMetadata (OSS, self-hostable). On Databricks: Unity Catalog. On AWS: Glue.

8. Owners and stewards

Every dataset needs an owner (engineer who maintains) and a steward (business person who defines). Without ownership:

No one fixes broken pipelines.
No one approves schema changes.
No one answers "what does customer_tier_v3 mean?"

Bake owner into the catalog. Page the owner when their dataset SLA breaks.

9. PII, classification, masking

Tag columns with sensitivity (Public, Internal, Confidential, PII, PCI). Enforce via:

Row-level security — different users see different rows.
Column-level security — analysts see hashed email, support sees plain.
Dynamic data masking — same query, different output by role.
Lineage propagation — if PII flows into a downstream table, downstream is automatically tagged.

Modern warehouses (Snowflake, BigQuery, Databricks) support all of this natively. Use it.

10. Data observability vs data quality

Quality — a set of rules: "X must be > 0", "Y must not be NULL".
Observability — the running picture: what is happening across freshness, volume, schema, distribution, lineage.

You need both. Quality catches what you can articulate. Observability surfaces the surprises you didn't think to test for.

11. Incidents and postmortems for data

Treat data outages like service outages:

Severity levels.
On-call rotation for critical datasets.
War room / Slack channel during incident.
Postmortem with action items.
Public dashboard of dataset health.

Cultural shift. Data engineers learn from ops; ops doesn't tolerate "the dashboard is wrong, we'll fix it Monday".

12. Reality check

A first-90-days governance plan:

Top-10 critical datasets — explicit owners, contracts, SLAs.
dbt tests on every Gold model — basic shape + value.
OpenLineage emission from Airflow/Dagster → DataHub.
PII tags on user-facing tables.
Slack alerts on failed tests, missed SLAs.
Weekly review of broken-test trends.

You don't need a Monte Carlo subscription on day 1. You do need the discipline.

Reading material

Books:

Driving Data Quality with Data Contracts — Andrew Jones (2023; the canonical book on data contracts)
Fundamentals of Data Engineering — Joe Reis & Matt Housley (the governance / DataOps chapter)
Data Management at Scale, 2nd ed. — Piethein Strengholt (the enterprise data-mesh + governance book)
The DAMA Guide to the Data Management Body of Knowledge (DMBOK), 2nd ed. — DAMA International (the encyclopaedia; reference, not cover-to-cover)

Papers:

Goods: Organizing Google's Datasets — Halevy et al. 2016 (SIGMOD) — Google's internal catalog paper; the origin of modern lineage thinking.
Apache Atlas: Data Governance Made Possible — Hortonworks — the Hadoop-era catalog; still influential.

Official docs:

OpenLineage spec — the open standard for emitting lineage from any tool.
dbt — Data tests — unique, not_null, relationships, accepted_values; the starter kit.
Great Expectations docs — the OG Python data-validation library.
Soda Core docs — the YAML-driven alternative.
DataHub documentation — LinkedIn's open-source catalog.
Databricks Unity Catalog — the lakehouse-native catalog + access control.

Blog posts:

Chad Sanderson — Data Contracts: The Mesh in Practice — the canonical practitioner essay series.
Monte Carlo — What is Data Observability? The 5 Pillars — freshness, distribution, volume, schema, lineage.
Airbnb Engineering — Data Quality at Airbnb (Midas) — the certification programme that defined a generation of DQ.
Lyft Engineering — Amundsen — the discovery / catalog system that inspired DataHub.

In-depth research material

OpenLineage — github.com/OpenLineage/OpenLineage — ~1.9k ★, the cross-tool lineage standard supported by Airflow, dbt, Spark.
DataHub — github.com/datahub-project/datahub — ~10k ★, LinkedIn's metadata + lineage + quality platform.
Amundsen — github.com/amundsen-io/amundsen — ~4.5k ★, Lyft's discovery engine.
OpenMetadata — github.com/open-metadata/OpenMetadata — ~6k ★, modern unified metadata.
Marquez — github.com/MarquezProject/marquez — ~1.9k ★, the OpenLineage reference store.
Great Expectations — github.com/great-expectations/great_expectations — ~10k ★, the canonical Python data validation framework.
Soda Core — github.com/sodadata/soda-core — ~1.9k ★, declarative YAML data checks.
Uber Engineering — DataMesh: How Uber transformed data governance — the petabyte-scale data-mesh implementation.
Netflix Tech Blog — Metacat: Making big data discoverable and meaningful — Netflix's federated metadata service.
Convoy / Chad Sanderson — Shifting Left on Data Quality (Data Council talk) — the contract-driven approach at Convoy.

Videos

Data Contracts: From Theory to Implementation — Chad Sanderson — Chad Sanderson · 41 min — the canonical talk on data contracts.
Data Mesh: Principles and Logical Architecture — Zhamak Dehghani — Zhamak Dehghani · 44 min — the creator of the data mesh paradigm explaining it.
Data Observability: The Next Frontier of Data Engineering — Barr Moses — Monte Carlo CEO · 27 min — the 5 pillars in a single talk.
Apache Atlas: Data Governance for Hadoop — Hortonworks — 32 min — the Hadoop-era origin of modern catalog tools.
Building a Self-Service Data Platform at Netflix — Joey Lynch — 47 min — how Netflix scaled discovery + quality across thousands of users.

LeetCode — Graph Valid Tree

Link: https://leetcode.com/problems/graph-valid-tree/
Difficulty: Medium
Why this problem: Verify acyclic + connected — the structural invariant of lineage graphs.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Name the 5 pillars of data observability.
Explain dataset- vs column-level lineage and one use-case for each.
Write 4 dbt tests covering existence, uniqueness, referential integrity, range.
Sketch a data contract YAML with schema, SLA, and owner.
Choose between dbt tests, Great Expectations, and a SaaS observability tool.
Solve graph-valid-tree — union-find or BFS for tree validation (lineage graph sanity check).

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

API Design — REST, GraphQL, gRPC, Versioning, Pagination, Errors

Practical Fine-Tuning — LoRA, QLoRA, PEFT, Instruction Datasets