Search Tech Journey

Find topics, journeys and posts

back to blog
data engineeringintermediate 12m2026-06-09

Data Governance — Lineage, Quality, Catalogs, Contracts, Observability

Session 32 of the 48-session learning series.

Why this session matters

This is Session 32 of 48 in the DE track. Pipelines that run are easy. Pipelines that you trust six months later, that other teams trust, that regulators trust — that's governance. The data-mesh / lakehouse era turned this from a back-office checkbox into a first-class engineering discipline.

Agenda

  • Lineage — how, why, and what column-level lineage gives you
  • Data quality — tests, expectations, anomaly detection
  • Catalogs — DataHub, OpenMetadata, Unity, Glue
  • Data contracts — the producer-side schema agreement
  • Data observability — SLAs, SLOs, freshness, the 5 pillars

Pre-read (skim before the session)

Deep dive

1. Why governance, why now

The volume of data in a typical company has tripled in 3 years. The number of producers and consumers has 10x'd. The result: silent breakage everywhere.

Symptoms:

  • "Why are revenue numbers different in Dashboard A and B?"
  • "Production ran a query against users.email; the column was dropped 6 months ago."
  • "This ML model trained on a buggy join — we shipped predictions for a quarter."
  • "Auditor asks: which models touched PII X in 2025?"

Governance = answer those questions before they're asked.

2. The five pillars of data observability

PillarWhatWhy
FreshnessIs data arriving on time?Stale data ≠ data
VolumeIs row count in expected range?Massive drop = upstream broke
DistributionAre values in expected range?NULL spike, type drift
SchemaDid columns/types change?Silent break
LineageWhat's upstream and downstream?Blast radius

Tools (Monte Carlo, Bigeye, Soda, Anomalo) instrument these automatically. You can DIY 80% with dbt tests + alerts.

3. Lineage — the spine of governance

Lineage = the graph of "this column came from those columns came from that source".

Levels:

  • Dataset-levelorders_fact came from orders_raw + dim_customer.
  • Column-leveltotal_revenue in orders_fact = SUM(unit_price * qty) from orders_raw.
  • Transformation-level — the SQL statement that produced each column.

Use cases:

  • Impact analysis — "I want to drop this column; what breaks?"
  • Root cause — "Dashboard is wrong; trace back to source."
  • Audit — "Show me all data lineage touching PII."

Generate from: SQL parsing (sqlglot, dbt graph), runtime hooks (OpenLineage), or job orchestrator (Airflow, Dagster). Don't try to maintain by hand.

4. Data quality testing

The dbt-style minimum:

models:
  - name: orders_fact
    columns:
      - name: order_id
        tests: [not_null, unique]
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customer')
              field: customer_id
      - name: total
        tests:
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000

Tiers of tests:

  • Existence — column present, not null.
  • Uniqueness — primary keys.
  • Referential integrity — foreign keys exist in dim.
  • Range / value — accepted values, min/max.
  • Distribution — mean, p95, cardinality within X% of last week.
  • Business invariantrevenue = SUM(item.price * item.qty).

Wire to alerting. A failing test that nobody sees is theatre.

5. Great Expectations / Soda / dbt tests

  • dbt tests — run during transformations; SQL-based. Free, in-pipeline. Best for shape/value checks.
  • Great Expectations — Python-based; richer expectations; can run on any data source; heavier setup.
  • Soda — YAML-driven; cloud + OSS; nice UI for tracking violations.
  • Anomalo / Monte Carlo — ML-driven anomaly detection on top of metadata.

Start with dbt tests. Add GE/Soda if you need cross-source tests or need to test outside the warehouse.

6. Data contracts

The producer agrees what the schema and semantic looks like. The consumer trusts it.

A contract specifies:

  • Schema — column names, types, nullability.
  • Semantics — "amount is always in USD cents, integer".
  • SLAs — freshness, completeness.
  • Owners — who to ping when broken.
  • Change policy — additive only? deprecation window?

Tooling:

  • Define in YAML or JSON.
  • Producer's CI fails if a change violates the contract.
  • Versioned in git; reviewed like API contracts.
dataset: orders
owner: orders-team
schema:
  - name: order_id
    type: string
    description: uuid
sla:
  freshness: 30m
  completeness: 99.9%

Push the validation upstream. Test the producer's output, not the consumer's input.

7. Catalogs — discoverability layer

DataHub, OpenMetadata, Unity Catalog, AWS Glue Data Catalog. Capabilities:

  • Auto-crawl warehouses, lakes, dashboards.
  • Show schemas, sample data, top users.
  • Capture lineage (often via OpenLineage).
  • Glossary — business definitions tied to physical tables.
  • Tags (PII, PCI, restricted).
  • Reviews / ratings — humans annotate trust.

For a startup: DataHub or OpenMetadata (OSS, self-hostable). On Databricks: Unity Catalog. On AWS: Glue.

8. Owners and stewards

Every dataset needs an owner (engineer who maintains) and a steward (business person who defines). Without ownership:

  • No one fixes broken pipelines.
  • No one approves schema changes.
  • No one answers "what does customer_tier_v3 mean?"

Bake owner into the catalog. Page the owner when their dataset SLA breaks.

9. PII, classification, masking

Tag columns with sensitivity (Public, Internal, Confidential, PII, PCI). Enforce via:

  • Row-level security — different users see different rows.
  • Column-level security — analysts see hashed email, support sees plain.
  • Dynamic data masking — same query, different output by role.
  • Lineage propagation — if PII flows into a downstream table, downstream is automatically tagged.

Modern warehouses (Snowflake, BigQuery, Databricks) support all of this natively. Use it.

10. Data observability vs data quality

  • Quality — a set of rules: "X must be > 0", "Y must not be NULL".
  • Observability — the running picture: what is happening across freshness, volume, schema, distribution, lineage.

You need both. Quality catches what you can articulate. Observability surfaces the surprises you didn't think to test for.

11. Incidents and postmortems for data

Treat data outages like service outages:

  • Severity levels.
  • On-call rotation for critical datasets.
  • War room / Slack channel during incident.
  • Postmortem with action items.
  • Public dashboard of dataset health.

Cultural shift. Data engineers learn from ops; ops doesn't tolerate "the dashboard is wrong, we'll fix it Monday".

12. Reality check

A first-90-days governance plan:

  1. Top-10 critical datasets — explicit owners, contracts, SLAs.
  2. dbt tests on every Gold model — basic shape + value.
  3. OpenLineage emission from Airflow/Dagster → DataHub.
  4. PII tags on user-facing tables.
  5. Slack alerts on failed tests, missed SLAs.
  6. Weekly review of broken-test trends.

You don't need a Monte Carlo subscription on day 1. You do need the discipline.

Reading material

Books:

  • Driving Data Quality with Data Contracts — Andrew Jones (2023; the canonical book on data contracts)
  • Fundamentals of Data Engineering — Joe Reis & Matt Housley (the governance / DataOps chapter)
  • Data Management at Scale, 2nd ed. — Piethein Strengholt (the enterprise data-mesh + governance book)
  • The DAMA Guide to the Data Management Body of Knowledge (DMBOK), 2nd ed. — DAMA International (the encyclopaedia; reference, not cover-to-cover)

Papers:

Official docs:

Blog posts:

In-depth research material

Videos

LeetCode — Graph Valid Tree

Post-session checklist

By the end of this session you should be able to:

  • Name the 5 pillars of data observability.
  • Explain dataset- vs column-level lineage and one use-case for each.
  • Write 4 dbt tests covering existence, uniqueness, referential integrity, range.
  • Sketch a data contract YAML with schema, SLA, and owner.
  • Choose between dbt tests, Great Expectations, and a SaaS observability tool.
  • Solve graph-valid-tree — union-find or BFS for tree validation (lineage graph sanity check).

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.