Search Tech Journey

Find topics, journeys and posts

back to blog
data engineeringadvanced 12m2026-06-10

Day 12 — Lakehouse Architecture — Delta Lake / Iceberg / Hudi, ACID on Object Storage

The lakehouse is now the default analytics substrate (Databricks, Snowflake Iceberg, Microsoft Fabric, AWS Glue Iceberg). ACID + time travel + schema evolution…

The lakehouse pattern says: keep data in open columnar files on object storage (Parquet on S3/ADLS/GCS), but add a transaction layer on top to give you ACID, MERGE/UPDATE, time travel, and schema evolution. Three formats compete: Delta Lake (Databricks), Apache Iceberg (Netflix/Apple/many), Apache Hudi (Uber).

🧠 Concept

Why it matters & the mental model.

1. Why this matters

Old world:

  • Data lake: cheap, scalable, schema-on-read — but no ACID, no UPDATE, hard to evolve schema.
  • Data warehouse: ACID, fast, expensive, vendor-locked.

Lakehouse = both. One copy of data, BI tools query directly, ML jobs read same files, no ETL into a warehouse.

2. The architectural insight

The lakehouse is a layered stack: cheap parquet files at the bottom, an atomic metadata pointer in the middle, and pluggable engines on top. A query is "snapshot N = these 47 files"; an UPDATE writes new files and atomically swaps the pointer. The parquet files are still there → time travel is free.

The lakehouse stack

3. Delta Lake

  • Transaction log = ordered JSON files in _delta_log/00000.json, 00001.json, .... Every 10 commits compacted into a checkpoint Parquet.
  • Optimistic concurrency: writers compute "I want to add A and remove B"; commit succeeds if no conflicting commit slipped in.
  • OPTIMIZE + Z-ORDER: compact small files, multi-dimensional clustering for skipping.
  • CHANGE DATA FEED: emit row-level CDC for downstream pipelines.
  • Heavy Databricks gravity but spec is open (Delta Lake 3.x + UniForm reads as Iceberg too).

4. Iceberg

  • Three-level metadata: metadata.jsonmanifest listmanifest filesdata files.
  • Hidden partitioning: you write event_ts, Iceberg tracks daily partition automatically; partition evolution doesn't break old queries.
  • Catalog-driven: AWS Glue, Hive, Nessie, Polaris, REST catalog. The REST catalog spec is winning interoperability.
  • Best multi-engine support today: Spark, Trino, Flink, Snowflake, DuckDB, ClickHouse.
  • Branching / tagging (Nessie / Iceberg v2): git-like dev branches for data.

🛠 Deep Dive

Internals, code, architecture.

5. Hudi

  • Original use case: streaming upserts (Uber rides). Two table types:
    • Copy-on-write: rewrite Parquet on update (read-fast).
    • Merge-on-read: keep delta log of updates, merge on read (write-fast). Best for high-velocity upserts.
  • Strong CDC / incremental query story (hudi_table_changes()).
  • Smaller ecosystem than Iceberg today but unmatched for streaming-heavy lakes.

6. The common features

All three give you:

  • ACID via optimistic concurrency control on a manifest commit.
  • MERGE INTO (upsert), DELETE, UPDATE.
  • Schema evolution: add column (any), drop / rename (with care).
  • Time travel: VERSION AS OF or TIMESTAMP AS OF.
  • Hidden partitioning (Iceberg most cleanly).
  • Data skipping via column min/max stats in manifests.

7. File layout matters

  • Aim for 256 MB - 1 GB Parquet files post-compaction.
  • Avoid the "small file problem": streaming jobs producing 10 KB files an hour will tank query performance. Schedule daily OPTIMIZE/Compact.
  • Partition on low cardinality + frequent filter (date, country). Cluster (Z-order / sort) on high-cardinality filter (user_id, item_id).

8. ACID on object storage — the trick

S3 is eventually consistent for list operations and lacks atomic rename. The formats solve this with:

  • Delta: atomic put of next 00000N.json (S3 conditional writes since 2020); a single key write is atomic.
  • Iceberg: catalog (Glue/Hive/Nessie) does atomic compare-and-swap on the table's metadata pointer.
  • Hudi: timeline server + atomic commit file.

🚀 In Practice

Trade-offs, exercises, what to ship today.

9. Concurrency model

Optimistic: read snapshot N → compute writes → try commit at snapshot N+1 → if conflicting writes detected, replay with new snapshot or fail. Works great for ELT (append-mostly); painful for high-write OLTP-style workloads.

10. Picking one

  • Databricks-first shop: Delta (still best-supported there) — increasingly write Delta + UniForm so Iceberg readers work.
  • Multi-engine, open ecosystem: Iceberg with REST catalog (Polaris / Nessie). The 2025 momentum.
  • Streaming upserts dominate: Hudi.

11. Common pitfalls

12. What to take away

"How would you build the data platform?" Strong answers name the format, the catalog, partition + cluster strategy, compaction schedule, and one operational concern (vacuum / small files / concurrency).

Key points

    Resources

    Practice Problem: Merge Intervals (Medium)