data engineeringadvanced 12m2026-06-22

Day 24 — Data Governance, Lineage, Quality — Catalogs, Contracts, Observability

At scale, governance isn't bureaucracy; it's how you keep trust in your data. Lineage, quality contracts, and observability tools are now first-class platform c…

Data quality issues compound silently. A nudge upstream becomes a fire downstream weeks later in a board deck. Governance is the discipline of catching them early and explaining what broke and why.

1. The four pillars

Catalog: what data exists, who owns it, where it lives. (DataHub, Atlan, OpenMetadata, Unity Catalog, Glue.)
Lineage: column-level "where did this come from / what does it feed?". (OpenLineage, Marquez, dbt docs, Unity Catalog.)
Quality: assertions that data meets expectations. (Great Expectations, Soda, dbt tests, Deequ.)
Observability: continuous monitoring of freshness, volume, schema, distribution. (Monte Carlo, Bigeye, Acceldata, Sifflet.)

2. Data contracts

A contract is a machine-checkable schema + SLO that producers commit to. Includes:

Column names, types, constraints (PK, FK, not-null, accepted values).
Freshness SLO (≤ 30 min stale).
Volume SLO (± 20% vs 7-day rolling).
Owner + on-call.

Implemented via dbt model contracts (in YAML), Avro/Protobuf schemas (for streaming), or platform-native (Unity Catalog).

3. Lineage — table vs column

Table-level: "fact_orders feeds mart_finance, dashboard_ceo". Useful but coarse.
Column-level: "fact_orders.amount_usd flows from raw.stripe_charges.amount × dim_fx.rate". Game-changing for impact analysis: change one source field, see every dashboard affected.

Generate from compiled SQL (sqllineage, dbt), Spark plans (OpenLineage Spark integration), or query history (Bigeye, Atlan).

4. Quality checks — the test pyramid

Unit (model-level): dbt tests on each model (unique, not_null, relationships).
Cross-model (assertions): revenue = sum(orders.amount).
Volume / freshness / schema drift: per-table, automated.
Statistical (distribution drift): PSI / KS on key columns over time.

5. Where to put the checks

Pre-merge (CI): contract validation, schema diff vs prod.
Pre-publish (in the pipeline, before exposing to BI): blocking checks.
Post-publish (observability): non-blocking, alerts only.

6. Ownership & on-call

Every dataset has a single accountable team. Catalog enforces this. Alerts route to the team's on-call channel, not a shared #data-alerts firehose nobody reads.

7. Privacy & PII

Classify columns (PII, sensitive PII, confidential, public) at ingest.
Row-level + column-level security in the lake (Unity Catalog, Lake Formation, BigQuery policy tags).
Pseudonymisation / tokenisation for analytics on PII.
GDPR-style right-to-be-forgotten requires deletes that propagate — easier in lakehouse (DELETE + VACUUM) than ancient warehouses.

🚀 In Practice

Trade-offs, exercises, what to ship today.

8. Cost governance

Often forgotten. Tag every query / job with owner, run cost reports, set budgets per team. The biggest cost monsters are usually careless full-scans on huge tables.

9. Metrics layer / semantic layer

dbt Semantic Layer / Cube.dev / MetricFlow / LookML — define metrics once, expose to all BI tools with consistent definitions. Solves the "every dashboard disagrees on active_user" problem.

10. The maturity ladder

Reactive: someone notices the dashboard is wrong.
Proactive tests: dbt tests block bad data.
Observability: anomaly detection catches surprises.
Contracts + lineage: upstream changes are negotiated and traced.
Data product mindset: each dataset is a product with SLOs, owners, docs.

"How would you stop bad data from reaching a CEO dashboard?" Strong answers: contracts on producers, dbt tests in CI, observability on every model, blocking quality gates pre-publish, lineage to identify owners. Bonus: SLOs and a paging policy.

Key points

Resources

Practice Problem: Implement Trie (Medium)

← previous

Day 23 — Multimodal LLMs — Vision-Language, Audio, and Tool-Use Combined

Day 25 — Practical Fine-Tuning — LoRA / QLoRA, PEFT, Instruction Datasets, DPO