Search Tech Journey

Find topics, journeys and posts

back to blog
data engineeringadvanced 12m2026-06-20

Day 22 — Data Modelling — Dimensional, Data Vault, OBT for the Lakehouse Era

Storage is cheap, but a bad model rots a platform from inside. Knowing when to dimensional-model, when to use Data Vault, and when to flat-OBT determines whethe…

Data modelling decides whether a warehouse evolves or calcifies. Three dominant patterns, each optimised for a different stakeholder set.

🧠 Concept

Why it matters & the mental model.

1. Star schema (Kimball)

  • Fact table: events / measurements (one row per order line).
  • Dimensions: descriptive context (customer, product, store, date).
  • Joins: fact ↔ dim on surrogate keys.
  • Pros: intuitive for BI tools, slowly-changing-dim (SCD2) handles history cleanly, well-understood.
  • Cons: refactoring requires rebuilds, joins on huge facts can be expensive (mitigated by lakehouse engines + columnar formats).

2. SCD types — handling change in dimensions

  • Type 0: never changes (date_of_birth).
  • Type 1: overwrite (typo fix). History lost.
  • Type 2: new row with valid_from / valid_to / is_current. Standard for "what did we know about customer X at the time of order Y?".
  • Type 3: previous value column. Rarely used.
  • Type 6: hybrid (current + history). Powerful, expensive.

3. Data Vault 2.0

Hub/Link/Satellite model designed for agile, source-driven, fully auditable warehouses:

  • Hub: business keys + load metadata (customer_id, source, loaded_at).
  • Link: relationships between hubs (order links customer, product, store).
  • Satellite: descriptive attributes that change over time (separate sat per source system). Pros: schema evolves additively (just add a satellite), parallel loads, full audit trail. Cons: more joins, BI layer needs star-schema views on top.

Best when: many source systems, regulatory requirements, frequent schema change.

🛠 Deep Dive

Internals, code, architecture.

4. One Big Table (OBT) / wide tables

  • Pre-join everything into wide tables per consumption pattern.
  • Pros: zero joins at query time, perfect for column-store engines (BigQuery / Snowflake / Iceberg + Trino), great for ML training tables.
  • Cons: redundancy, harder updates, taxonomy drift across multiple OBTs.

Best when: read-heavy analytics, ML feature tables, dashboards with predictable shape.

5. Mart layer / consumption tier

Regardless of the underlying model, a consumption layer of wide tables / cubes for BI is almost always worth it. dbt patterns: staging → intermediate → marts.

6. Slowly-changing relationships

Customer changes store; how do you attribute past orders? SCD2 on the link captures it. Often skipped, leading to silent revisionist history.

7. Surrogate keys vs natural keys

  • Natural (email, sku): readable, but changes break references.
  • Surrogate (auto int / hash of natural): stable, opaque, joins are tighter.
  • Hash key: deterministic surrogate (hash of natural + source) → idempotent loads across runs. Standard in Data Vault and modern dbt.

🚀 In Practice

Trade-offs, exercises, what to ship today.

8. Time, the universal dimension

Always have a dim_date (and dim_time if intraday). Carries fiscal year, quarter, weekday, holiday flags. Joins make calendar logic trivial in SQL.

9. dbt patterns

  • staging/ raw → renamed, casted, deduped.
  • intermediate/ business logic, light joins.
  • marts/ star or OBT, exposed to BI.
  • Tests: unique, not_null, relationships, accepted_values. Add custom singular tests for invariants ("daily revenue can never be negative").

10. Modern hybrid

Most serious shops in 2025 run:

  • Raw → Bronze (lakehouse, schema-on-read).
  • Cleaned → Silver (typed Delta/Iceberg).
  • Curated → Gold (star + OBT for consumption).
  • Feature store as a parallel branch of Gold for ML.

11. What to take away

"How do you model X?" Strong answers state the consumption pattern first, choose the model, name SCD strategy, and explain how the model evolves when a new source arrives. Bonus: cite dbt or DBT-like patterns.

Key points

    Resources

    Practice Problem: Course Schedule (Medium)