Search Tech Journey

Find topics, journeys and posts

back to blog
data engineeringadvanced 12m2026-06-22

Day 24 — Data Governance, Lineage, Quality — Catalogs, Contracts, Observability

At scale, governance isn't bureaucracy; it's how you keep trust in your data. Lineage, quality contracts, and observability tools are now first-class platform c…

Data quality issues compound silently. A nudge upstream becomes a fire downstream weeks later in a board deck. Governance is the discipline of catching them early and explaining what broke and why.

🧠 Concept

Why it matters & the mental model.

1. The four pillars

  • Catalog: what data exists, who owns it, where it lives. (DataHub, Atlan, OpenMetadata, Unity Catalog, Glue.)
  • Lineage: column-level "where did this come from / what does it feed?". (OpenLineage, Marquez, dbt docs, Unity Catalog.)
  • Quality: assertions that data meets expectations. (Great Expectations, Soda, dbt tests, Deequ.)
  • Observability: continuous monitoring of freshness, volume, schema, distribution. (Monte Carlo, Bigeye, Acceldata, Sifflet.)

2. Data contracts

A contract is a machine-checkable schema + SLO that producers commit to. Includes:

  • Column names, types, constraints (PK, FK, not-null, accepted values).
  • Freshness SLO (≤ 30 min stale).
  • Volume SLO (± 20% vs 7-day rolling).
  • Owner + on-call.

Implemented via dbt model contracts (in YAML), Avro/Protobuf schemas (for streaming), or platform-native (Unity Catalog).

3. Lineage — table vs column

  • Table-level: "fact_orders feeds mart_finance, dashboard_ceo". Useful but coarse.
  • Column-level: "fact_orders.amount_usd flows from raw.stripe_charges.amount × dim_fx.rate". Game-changing for impact analysis: change one source field, see every dashboard affected.

Generate from compiled SQL (sqllineage, dbt), Spark plans (OpenLineage Spark integration), or query history (Bigeye, Atlan).

🛠 Deep Dive

Internals, code, architecture.

4. Quality checks — the test pyramid

  • Unit (model-level): dbt tests on each model (unique, not_null, relationships).
  • Cross-model (assertions): revenue = sum(orders.amount).
  • Volume / freshness / schema drift: per-table, automated.
  • Statistical (distribution drift): PSI / KS on key columns over time.

5. Where to put the checks

  • Pre-merge (CI): contract validation, schema diff vs prod.
  • Pre-publish (in the pipeline, before exposing to BI): blocking checks.
  • Post-publish (observability): non-blocking, alerts only.

6. Ownership & on-call

Every dataset has a single accountable team. Catalog enforces this. Alerts route to the team's on-call channel, not a shared #data-alerts firehose nobody reads.

7. Privacy & PII

  • Classify columns (PII, sensitive PII, confidential, public) at ingest.
  • Row-level + column-level security in the lake (Unity Catalog, Lake Formation, BigQuery policy tags).
  • Pseudonymisation / tokenisation for analytics on PII.
  • GDPR-style right-to-be-forgotten requires deletes that propagate — easier in lakehouse (DELETE + VACUUM) than ancient warehouses.

🚀 In Practice

Trade-offs, exercises, what to ship today.

8. Cost governance

Often forgotten. Tag every query / job with owner, run cost reports, set budgets per team. The biggest cost monsters are usually careless full-scans on huge tables.

9. Metrics layer / semantic layer

dbt Semantic Layer / Cube.dev / MetricFlow / LookML — define metrics once, expose to all BI tools with consistent definitions. Solves the "every dashboard disagrees on active_user" problem.

10. The maturity ladder

  1. Reactive: someone notices the dashboard is wrong.
  2. Proactive tests: dbt tests block bad data.
  3. Observability: anomaly detection catches surprises.
  4. Contracts + lineage: upstream changes are negotiated and traced.
  5. Data product mindset: each dataset is a product with SLOs, owners, docs.

11. What to take away

"How would you stop bad data from reaching a CEO dashboard?" Strong answers: contracts on producers, dbt tests in CI, observability on every model, blocking quality gates pre-publish, lineage to identify owners. Bonus: SLOs and a paging policy.

Key points

    Resources

    Practice Problem: Implement Trie (Medium)