Day 12 — Lakehouse Architecture — Delta Lake / Iceberg / Hudi, ACID on Object Storage
The lakehouse is now the default analytics substrate (Databricks, Snowflake Iceberg, Microsoft Fabric, AWS Glue Iceberg). ACID + time travel + schema evolution…
The lakehouse pattern says: keep data in open columnar files on object storage (Parquet on S3/ADLS/GCS), but add a transaction layer on top to give you ACID, MERGE/UPDATE, time travel, and schema evolution. Three formats compete: Delta Lake (Databricks), Apache Iceberg (Netflix/Apple/many), Apache Hudi (Uber).
🧠 Concept
Why it matters & the mental model.
1. Why this matters
Old world:
- Data lake: cheap, scalable, schema-on-read — but no ACID, no UPDATE, hard to evolve schema.
- Data warehouse: ACID, fast, expensive, vendor-locked.
Lakehouse = both. One copy of data, BI tools query directly, ML jobs read same files, no ETL into a warehouse.
2. The architectural insight
The lakehouse is a layered stack: cheap parquet files at the bottom, an atomic metadata pointer in the middle, and pluggable engines on top. A query is "snapshot N = these 47 files"; an UPDATE writes new files and atomically swaps the pointer. The parquet files are still there → time travel is free.
The lakehouse stack
3. Delta Lake
- Transaction log = ordered JSON files in
_delta_log/00000.json, 00001.json, .... Every 10 commits compacted into a checkpoint Parquet. - Optimistic concurrency: writers compute "I want to add A and remove B"; commit succeeds if no conflicting commit slipped in.
- OPTIMIZE + Z-ORDER: compact small files, multi-dimensional clustering for skipping.
- CHANGE DATA FEED: emit row-level CDC for downstream pipelines.
- Heavy Databricks gravity but spec is open (Delta Lake 3.x + UniForm reads as Iceberg too).
4. Iceberg
- Three-level metadata: metadata.json → manifest list → manifest files → data files.
- Hidden partitioning: you write
event_ts, Iceberg tracks daily partition automatically; partition evolution doesn't break old queries. - Catalog-driven: AWS Glue, Hive, Nessie, Polaris, REST catalog. The REST catalog spec is winning interoperability.
- Best multi-engine support today: Spark, Trino, Flink, Snowflake, DuckDB, ClickHouse.
- Branching / tagging (Nessie / Iceberg v2): git-like dev branches for data.
🛠 Deep Dive
Internals, code, architecture.
5. Hudi
- Original use case: streaming upserts (Uber rides). Two table types:
- Copy-on-write: rewrite Parquet on update (read-fast).
- Merge-on-read: keep delta log of updates, merge on read (write-fast). Best for high-velocity upserts.
- Strong CDC / incremental query story (
hudi_table_changes()). - Smaller ecosystem than Iceberg today but unmatched for streaming-heavy lakes.
6. The common features
All three give you:
- ACID via optimistic concurrency control on a manifest commit.
- MERGE INTO (upsert), DELETE, UPDATE.
- Schema evolution: add column (any), drop / rename (with care).
- Time travel:
VERSION AS OForTIMESTAMP AS OF. - Hidden partitioning (Iceberg most cleanly).
- Data skipping via column min/max stats in manifests.
7. File layout matters
- Aim for 256 MB - 1 GB Parquet files post-compaction.
- Avoid the "small file problem": streaming jobs producing 10 KB files an hour will tank query performance. Schedule daily OPTIMIZE/Compact.
- Partition on low cardinality + frequent filter (date, country). Cluster (Z-order / sort) on high-cardinality filter (user_id, item_id).
8. ACID on object storage — the trick
S3 is eventually consistent for list operations and lacks atomic rename. The formats solve this with:
- Delta: atomic put of next
00000N.json(S3 conditional writes since 2020); a single key write is atomic. - Iceberg: catalog (Glue/Hive/Nessie) does atomic compare-and-swap on the table's metadata pointer.
- Hudi: timeline server + atomic commit file.
🚀 In Practice
Trade-offs, exercises, what to ship today.
9. Concurrency model
Optimistic: read snapshot N → compute writes → try commit at snapshot N+1 → if conflicting writes detected, replay with new snapshot or fail. Works great for ELT (append-mostly); painful for high-write OLTP-style workloads.
10. Picking one
- Databricks-first shop: Delta (still best-supported there) — increasingly write Delta + UniForm so Iceberg readers work.
- Multi-engine, open ecosystem: Iceberg with REST catalog (Polaris / Nessie). The 2025 momentum.
- Streaming upserts dominate: Hudi.
11. Common pitfalls
12. What to take away
"How would you build the data platform?" Strong answers name the format, the catalog, partition + cluster strategy, compaction schedule, and one operational concern (vacuum / small files / concurrency).
Resources
- 🎥 Databricks — Delta Lake Internals
- 📖 Iceberg spec — table format v2
- 📖 Onehouse — Delta vs Iceberg vs Hudi comparison
- 📖 Databricks — Delta Lake whitepaper
Practice Problem: Merge Intervals (Medium)