data engineeringintermediate 12m2026-06-09

Data Modelling — Dimensional, Data Vault, OBT for the Lakehouse

Session 28 of the 48-session learning series.

Date: Wed, 2026-07-01 · Time: 18:00–20:00 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 22 · Est. read: 2 h

Why this session matters

This is Session 28 of 48 in the Data Engineering track. The schema you pick will outlive your job — fact tables get queried for years. Choosing between dimensional, data vault, and one-big-table is one of the highest-leverage decisions a DE makes. Get it wrong and every downstream team pays.

Agenda

Kimball dimensional modelling — facts, dimensions, star/snowflake
Inmon vs Kimball vs Data Vault — when each wins
Data Vault 2.0 — hubs, links, satellites, the audit story
One Big Table (OBT) — when denormalisation beats joins
Modern lakehouse hybrid — Bronze/Silver/Gold + dbt

Pre-read (skim before the session)

Deep dive

1. Why modelling matters

A bad model causes:

Queries that scan TBs because they can't push down filters.
Reports that disagree because the definition lives in 5 places.
New analysts who can't find anything for weeks.
Cost overruns because every join shuffles.

A good model is queryable (fewer tables, intuitive joins), correct (one source of truth per concept), and evolvable (you can change without rewriting downstream).

2. Kimball dimensional (star schema)

The classic. Big fact table at centre, small dimension tables around it.

              [ dim_customer ]
                     │
[ dim_product ] ─ [ fact_orders ] ─ [ dim_date ]
                     │
              [ dim_store ]

Fact — measurements (revenue, qty, click). One row per event. Foreign keys to dimensions. Mostly numeric.
Dimension — descriptive context (customer attributes, product hierarchy, date breakdown). Joined to filter/group facts.

Star (one dimension table per axis) vs snowflake (dimensions normalised). Default to star — joins are cheaper in MPP/columnar engines than the storage savings of snowflake.

3. Slowly Changing Dimensions (SCD)

Customer attributes change. How do you preserve history?

Type	Behaviour	When
SCD 0	Never change	Birth date
SCD 1	Overwrite	Fixing typos
SCD 2	New row + valid_from/valid_to	Need historical accuracy (most analytical needs)
SCD 3	Add `previous_value` column	Comparing then-vs-now once
SCD 4	History in separate table	Massive dimensions
SCD 6	Hybrid 1+2+3	Both current and historical

SCD 2 is the workhorse. Surrogate key (system-generated id) + business key + validity range.

4. Inmon vs Kimball

Inmon — top-down. Build a normalised 3NF enterprise data warehouse. Then build data marts (often dimensional) for departments.
Kimball — bottom-up. Build dimensional data marts directly, conformed via shared dimensions ("conformed bus").

Inmon: heavier upfront, more rigorous, slower. Kimball: faster delivery, occasional schema drift across marts.

Modern reality: most teams are Kimball-ish. The lakehouse Bronze layer is kind of Inmon (3NF-ish raw). Gold is dimensional.

5. Data Vault 2.0

Built for audit-heavy, regulated environments (banks, healthcare, government). Three artefact types:

Hub — business keys + load metadata. hub_customer(customer_bk, load_dts, source).
Link — associative; relates hubs. link_customer_order(customer_hk, order_hk, ...).
Satellite — descriptive attributes + history. sat_customer_demo(customer_hk, load_dts, name, city, ...).

Pros:

Insert-only — no updates → no destructive history loss.
Source-traceable — every record carries source + timestamp.
Highly parallel loads.
Schema-evolvable — new attribute = new satellite, no impact on existing.

Cons:

Verbose; 3–4x more tables than dimensional.
Queries need many joins → BI layer on top usually compiles down to dimensional.

Use when: regulator audits, many sources of same entity, frequent schema evolution. Skip for: small analytics shops, fast-moving startup.

6. One Big Table (OBT)

Single denormalised wide table. All attributes joined and flattened.

orders_wide(order_id, ts, customer_id, customer_name, customer_country,
            product_id, product_name, product_category, ...)

Pros:

Zero joins — fastest possible query.
Simplest mental model.
Plays nice with columnar pruning (no joins to break dim filters).

Cons:

Duplication — customer name in every row of every order.
Update nightmare — name change requires updating every row.
Storage cost.

Modern OBT thrives because:

Compression on columnar (Parquet) makes duplication cheap.
Compute is much more expensive than storage on cloud.
Many BI tools (Looker, Mode) flatten anyway.

Increasingly popular in lakehouse Gold layer. dbt + BigQuery / Snowflake users love it.

7. Bronze / Silver / Gold (lakehouse pattern)

Modern lakehouse layout:

Layer	Content	Modelling
Bronze	Raw landed data, immutable	Source schema; minimal transform
Silver	Cleaned, deduplicated, enriched	Normalised-ish; one row per business event
Gold	Business-ready, denormalised	Dimensional star OR OBT; per use-case

Decoupling layers lets you fix bugs in Gold without re-ingesting; rebuild Gold from Silver in hours.

8. dbt as the connective tissue

dbt = SQL + Jinja + DAG + tests. Models live in .sql files; dependencies inferred from {{ ref('upstream') }}.

Conventions that scale:

models/staging/ — Bronze → Silver, 1:1 with source.
models/intermediate/ — joins, mid-level transformations.
models/marts/ — Gold; per business area (finance, marketing, product).
Tests in schema.yml: not_null, unique, accepted_values, custom.
Snapshots for SCD 2.

9. Surrogate vs natural keys

Natural key (BK) — value with meaning (order_id, email). Can change, can be reused.
Surrogate key (SK) — system-generated id (UUID, hash, sequence). Stable.

Always carry both in dimensions. Facts reference SK (stable join), surface BK for human users.

10. Modelling time

Two date attributes per event:

Event time — when it actually happened (user clicked).
Processing time — when it landed in your warehouse.

Always log both. Watermarks (S23), late-arriving facts, replay correctness all depend on it.

dim_date table with pre-computed attributes (day_of_week, fiscal_quarter, is_holiday) saves countless joins later.

11. Anti-patterns

100-column dimension that mixes 3 entities → split.
Fact table with no FK to time → can't analyse trends.
Mart per dashboard → duplicate logic everywhere.
Materialised view as load mechanism → recomputation cost explodes.
Storing audit history in JSON columns → ungroupable.

12. Reality check

For a new project:

Bronze: raw landings, no transforms.
Silver: dedup, cast, enrich; one canonical event per concept.
Gold: dimensional star or OBT depending on consumers (BI = star; ML = OBT).
dbt for transforms; great_expectations / dbt tests for quality.
Catalog (DataHub, OpenMetadata) for discoverability.

You don't need Data Vault unless an auditor is asking. You don't need pure Inmon unless you have decades of legacy systems to harmonise.

Link: https://leetcode.com/problems/relative-ranks/
Difficulty: Easy
Why this problem: Sort + assign positional labels — exactly the shape of building a slowly-changing dimension with current rank tracked.
Time-box: 20 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Define fact, dimension, conformed dimension, surrogate key.
Choose between Kimball, Inmon, Data Vault, OBT for a given scenario.
Implement SCD Type 2 in SQL (valid_from / valid_to / surrogate key).
Lay out Bronze/Silver/Gold for a lakehouse with dbt.
Explain why OBT thrives on columnar storage despite the duplication.
Solve relative-ranks — sort + label assignment, a mini surrogate-key generator.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Recommender Systems — Two-Tower, Multi-Stage Ranking

Idiomatic Python (and a Touch of C++) — Type Hints, Protocols, Dataclasses