Day 24 — Data Governance, Lineage, Quality — Catalogs, Contracts, Observability
At scale, governance isn't bureaucracy; it's how you keep trust in your data. Lineage, quality contracts, and observability tools are now first-class platform c…
Data quality issues compound silently. A nudge upstream becomes a fire downstream weeks later in a board deck. Governance is the discipline of catching them early and explaining what broke and why.
🧠 Concept
Why it matters & the mental model.
1. The four pillars
- Catalog: what data exists, who owns it, where it lives. (DataHub, Atlan, OpenMetadata, Unity Catalog, Glue.)
- Lineage: column-level "where did this come from / what does it feed?". (OpenLineage, Marquez, dbt docs, Unity Catalog.)
- Quality: assertions that data meets expectations. (Great Expectations, Soda, dbt tests, Deequ.)
- Observability: continuous monitoring of freshness, volume, schema, distribution. (Monte Carlo, Bigeye, Acceldata, Sifflet.)
2. Data contracts
A contract is a machine-checkable schema + SLO that producers commit to. Includes:
- Column names, types, constraints (PK, FK, not-null, accepted values).
- Freshness SLO (≤ 30 min stale).
- Volume SLO (± 20% vs 7-day rolling).
- Owner + on-call.
Implemented via dbt model contracts (in YAML), Avro/Protobuf schemas (for streaming), or platform-native (Unity Catalog).
3. Lineage — table vs column
- Table-level: "fact_orders feeds mart_finance, dashboard_ceo". Useful but coarse.
- Column-level: "fact_orders.amount_usd flows from raw.stripe_charges.amount × dim_fx.rate". Game-changing for impact analysis: change one source field, see every dashboard affected.
Generate from compiled SQL (sqllineage, dbt), Spark plans (OpenLineage Spark integration), or query history (Bigeye, Atlan).
🛠 Deep Dive
Internals, code, architecture.
4. Quality checks — the test pyramid
- Unit (model-level): dbt tests on each model (unique, not_null, relationships).
- Cross-model (assertions): revenue = sum(orders.amount).
- Volume / freshness / schema drift: per-table, automated.
- Statistical (distribution drift): PSI / KS on key columns over time.
5. Where to put the checks
- Pre-merge (CI): contract validation, schema diff vs prod.
- Pre-publish (in the pipeline, before exposing to BI): blocking checks.
- Post-publish (observability): non-blocking, alerts only.
6. Ownership & on-call
Every dataset has a single accountable team. Catalog enforces this. Alerts route to the team's on-call channel, not a shared #data-alerts firehose nobody reads.
7. Privacy & PII
- Classify columns (PII, sensitive PII, confidential, public) at ingest.
- Row-level + column-level security in the lake (Unity Catalog, Lake Formation, BigQuery policy tags).
- Pseudonymisation / tokenisation for analytics on PII.
- GDPR-style right-to-be-forgotten requires deletes that propagate — easier in lakehouse (DELETE + VACUUM) than ancient warehouses.
🚀 In Practice
Trade-offs, exercises, what to ship today.
8. Cost governance
Often forgotten. Tag every query / job with owner, run cost reports, set budgets per team. The biggest cost monsters are usually careless full-scans on huge tables.
9. Metrics layer / semantic layer
dbt Semantic Layer / Cube.dev / MetricFlow / LookML — define metrics once, expose to all BI tools with consistent definitions. Solves the "every dashboard disagrees on active_user" problem.
10. The maturity ladder
- Reactive: someone notices the dashboard is wrong.
- Proactive tests: dbt tests block bad data.
- Observability: anomaly detection catches surprises.
- Contracts + lineage: upstream changes are negotiated and traced.
- Data product mindset: each dataset is a product with SLOs, owners, docs.
11. What to take away
"How would you stop bad data from reaching a CEO dashboard?" Strong answers: contracts on producers, dbt tests in CI, observability on every model, blocking quality gates pre-publish, lineage to identify owners. Bonus: SLOs and a paging policy.
Resources
- 🎥 Monte Carlo — Data Observability 101
- 📖 dbt — Data contracts
- 📖 OpenLineage spec
- 📖 Great Expectations — Getting started
Practice Problem: Implement Trie (Medium)