Search Tech Journey

Find topics, journeys and posts

back to blog
data engineeringintermediate 12m2026-06-09

Data Modelling — Dimensional, Data Vault, OBT for the Lakehouse

Session 28 of the 48-session learning series.

Date: Wed, 2026-07-01 · Time: 18:00–20:00 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 22 · Est. read: 2 h

Why this session matters

This is Session 28 of 48 in the Data Engineering track. The schema you pick will outlive your job — fact tables get queried for years. Choosing between dimensional, data vault, and one-big-table is one of the highest-leverage decisions a DE makes. Get it wrong and every downstream team pays.

Agenda

  • Kimball dimensional modelling — facts, dimensions, star/snowflake
  • Inmon vs Kimball vs Data Vault — when each wins
  • Data Vault 2.0 — hubs, links, satellites, the audit story
  • One Big Table (OBT) — when denormalisation beats joins
  • Modern lakehouse hybrid — Bronze/Silver/Gold + dbt

Pre-read (skim before the session)

Deep dive

1. Why modelling matters

A bad model causes:

  • Queries that scan TBs because they can't push down filters.
  • Reports that disagree because the definition lives in 5 places.
  • New analysts who can't find anything for weeks.
  • Cost overruns because every join shuffles.

A good model is queryable (fewer tables, intuitive joins), correct (one source of truth per concept), and evolvable (you can change without rewriting downstream).

2. Kimball dimensional (star schema)

The classic. Big fact table at centre, small dimension tables around it.

              [ dim_customer ]
                     │
[ dim_product ] ─ [ fact_orders ] ─ [ dim_date ]
                     │
              [ dim_store ]
  • Fact — measurements (revenue, qty, click). One row per event. Foreign keys to dimensions. Mostly numeric.
  • Dimension — descriptive context (customer attributes, product hierarchy, date breakdown). Joined to filter/group facts.

Star (one dimension table per axis) vs snowflake (dimensions normalised). Default to star — joins are cheaper in MPP/columnar engines than the storage savings of snowflake.

3. Slowly Changing Dimensions (SCD)

Customer attributes change. How do you preserve history?

TypeBehaviourWhen
SCD 0Never changeBirth date
SCD 1OverwriteFixing typos
SCD 2New row + valid_from/valid_toNeed historical accuracy (most analytical needs)
SCD 3Add previous_value columnComparing then-vs-now once
SCD 4History in separate tableMassive dimensions
SCD 6Hybrid 1+2+3Both current and historical

SCD 2 is the workhorse. Surrogate key (system-generated id) + business key + validity range.

4. Inmon vs Kimball

  • Inmon — top-down. Build a normalised 3NF enterprise data warehouse. Then build data marts (often dimensional) for departments.
  • Kimball — bottom-up. Build dimensional data marts directly, conformed via shared dimensions ("conformed bus").

Inmon: heavier upfront, more rigorous, slower. Kimball: faster delivery, occasional schema drift across marts.

Modern reality: most teams are Kimball-ish. The lakehouse Bronze layer is kind of Inmon (3NF-ish raw). Gold is dimensional.

5. Data Vault 2.0

Built for audit-heavy, regulated environments (banks, healthcare, government). Three artefact types:

  • Hub — business keys + load metadata. hub_customer(customer_bk, load_dts, source).
  • Link — associative; relates hubs. link_customer_order(customer_hk, order_hk, ...).
  • Satellite — descriptive attributes + history. sat_customer_demo(customer_hk, load_dts, name, city, ...).

Pros:

  • Insert-only — no updates → no destructive history loss.
  • Source-traceable — every record carries source + timestamp.
  • Highly parallel loads.
  • Schema-evolvable — new attribute = new satellite, no impact on existing.

Cons:

  • Verbose; 3–4x more tables than dimensional.
  • Queries need many joins → BI layer on top usually compiles down to dimensional.

Use when: regulator audits, many sources of same entity, frequent schema evolution. Skip for: small analytics shops, fast-moving startup.

6. One Big Table (OBT)

Single denormalised wide table. All attributes joined and flattened.

orders_wide(order_id, ts, customer_id, customer_name, customer_country,
            product_id, product_name, product_category, ...)

Pros:

  • Zero joins — fastest possible query.
  • Simplest mental model.
  • Plays nice with columnar pruning (no joins to break dim filters).

Cons:

  • Duplication — customer name in every row of every order.
  • Update nightmare — name change requires updating every row.
  • Storage cost.

Modern OBT thrives because:

  • Compression on columnar (Parquet) makes duplication cheap.
  • Compute is much more expensive than storage on cloud.
  • Many BI tools (Looker, Mode) flatten anyway.

Increasingly popular in lakehouse Gold layer. dbt + BigQuery / Snowflake users love it.

7. Bronze / Silver / Gold (lakehouse pattern)

Modern lakehouse layout:

LayerContentModelling
BronzeRaw landed data, immutableSource schema; minimal transform
SilverCleaned, deduplicated, enrichedNormalised-ish; one row per business event
GoldBusiness-ready, denormalisedDimensional star OR OBT; per use-case

Decoupling layers lets you fix bugs in Gold without re-ingesting; rebuild Gold from Silver in hours.

8. dbt as the connective tissue

dbt = SQL + Jinja + DAG + tests. Models live in .sql files; dependencies inferred from {{ ref('upstream') }}.

Conventions that scale:

  • models/staging/ — Bronze → Silver, 1:1 with source.
  • models/intermediate/ — joins, mid-level transformations.
  • models/marts/ — Gold; per business area (finance, marketing, product).
  • Tests in schema.yml: not_null, unique, accepted_values, custom.
  • Snapshots for SCD 2.

9. Surrogate vs natural keys

  • Natural key (BK) — value with meaning (order_id, email). Can change, can be reused.
  • Surrogate key (SK) — system-generated id (UUID, hash, sequence). Stable.

Always carry both in dimensions. Facts reference SK (stable join), surface BK for human users.

10. Modelling time

Two date attributes per event:

  • Event time — when it actually happened (user clicked).
  • Processing time — when it landed in your warehouse.

Always log both. Watermarks (S23), late-arriving facts, replay correctness all depend on it.

dim_date table with pre-computed attributes (day_of_week, fiscal_quarter, is_holiday) saves countless joins later.

11. Anti-patterns

  • 100-column dimension that mixes 3 entities → split.
  • Fact table with no FK to time → can't analyse trends.
  • Mart per dashboard → duplicate logic everywhere.
  • Materialised view as load mechanism → recomputation cost explodes.
  • Storing audit history in JSON columns → ungroupable.

12. Reality check

For a new project:

  1. Bronze: raw landings, no transforms.
  2. Silver: dedup, cast, enrich; one canonical event per concept.
  3. Gold: dimensional star or OBT depending on consumers (BI = star; ML = OBT).
  4. dbt for transforms; great_expectations / dbt tests for quality.
  5. Catalog (DataHub, OpenMetadata) for discoverability.

You don't need Data Vault unless an auditor is asking. You don't need pure Inmon unless you have decades of legacy systems to harmonise.

Reading material

In-depth research material

Video reference

▶︎ Kimball Dimensional Modelling Crash Course (Joe Reis)

Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.

LeetCode — Relative Ranks

  • Link: https://leetcode.com/problems/relative-ranks/
  • Difficulty: Easy
  • Why this problem: Sort + assign positional labels — exactly the shape of building a slowly-changing dimension with current rank tracked.
  • Time-box: 20 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

  • Define fact, dimension, conformed dimension, surrogate key.
  • Choose between Kimball, Inmon, Data Vault, OBT for a given scenario.
  • Implement SCD Type 2 in SQL (valid_from / valid_to / surrogate key).
  • Lay out Bronze/Silver/Gold for a lakehouse with dbt.
  • Explain why OBT thrives on columnar storage despite the duplication.
  • Solve relative-ranks — sort + label assignment, a mini surrogate-key generator.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.