data engineeringadvanced 12m2026-06-10

Day 12 — Lakehouse Architecture — Delta Lake / Iceberg / Hudi, ACID on Object Storage

The lakehouse is now the default analytics substrate (Databricks, Snowflake Iceberg, Microsoft Fabric, AWS Glue Iceberg). ACID + time travel + schema evolution…

The lakehouse pattern says: keep data in open columnar files on object storage (Parquet on S3/ADLS/GCS), but add a transaction layer on top to give you ACID, MERGE/UPDATE, time travel, and schema evolution. Three formats compete: Delta Lake (Databricks), Apache Iceberg (Netflix/Apple/many), Apache Hudi (Uber).

🧠 Concept

Why it matters & the mental model.

1. Why this matters

Old world:

Data lake: cheap, scalable, schema-on-read — but no ACID, no UPDATE, hard to evolve schema.
Data warehouse: ACID, fast, expensive, vendor-locked.

Lakehouse = both. One copy of data, BI tools query directly, ML jobs read same files, no ETL into a warehouse.

2. The architectural insight

The lakehouse is a layered stack: cheap parquet files at the bottom, an atomic metadata pointer in the middle, and pluggable engines on top. A query is "snapshot N = these 47 files"; an UPDATE writes new files and atomically swaps the pointer. The parquet files are still there → time travel is free.

The lakehouse stack

3. Delta Lake

Transaction log = ordered JSON files in _delta_log/00000.json, 00001.json, .... Every 10 commits compacted into a checkpoint Parquet.
Optimistic concurrency: writers compute "I want to add A and remove B"; commit succeeds if no conflicting commit slipped in.
OPTIMIZE + Z-ORDER: compact small files, multi-dimensional clustering for skipping.
CHANGE DATA FEED: emit row-level CDC for downstream pipelines.
Heavy Databricks gravity but spec is open (Delta Lake 3.x + UniForm reads as Iceberg too).

4. Iceberg

Three-level metadata: metadata.json → manifest list → manifest files → data files.
Hidden partitioning: you write event_ts, Iceberg tracks daily partition automatically; partition evolution doesn't break old queries.
Catalog-driven: AWS Glue, Hive, Nessie, Polaris, REST catalog. The REST catalog spec is winning interoperability.
Best multi-engine support today: Spark, Trino, Flink, Snowflake, DuckDB, ClickHouse.
Branching / tagging (Nessie / Iceberg v2): git-like dev branches for data.

🛠 Deep Dive

Internals, code, architecture.

5. Hudi

Original use case: streaming upserts (Uber rides). Two table types:
- Copy-on-write: rewrite Parquet on update (read-fast).
- Merge-on-read: keep delta log of updates, merge on read (write-fast). Best for high-velocity upserts.
Strong CDC / incremental query story (hudi_table_changes()).
Smaller ecosystem than Iceberg today but unmatched for streaming-heavy lakes.

6. The common features

All three give you:

ACID via optimistic concurrency control on a manifest commit.
MERGE INTO (upsert), DELETE, UPDATE.
Schema evolution: add column (any), drop / rename (with care).
Time travel: VERSION AS OF or TIMESTAMP AS OF.
Hidden partitioning (Iceberg most cleanly).
Data skipping via column min/max stats in manifests.

7. File layout matters

Aim for 256 MB - 1 GB Parquet files post-compaction.
Avoid the "small file problem": streaming jobs producing 10 KB files an hour will tank query performance. Schedule daily OPTIMIZE/Compact.
Partition on low cardinality + frequent filter (date, country). Cluster (Z-order / sort) on high-cardinality filter (user_id, item_id).

8. ACID on object storage — the trick

S3 is eventually consistent for list operations and lacks atomic rename. The formats solve this with:

Delta: atomic put of next 00000N.json (S3 conditional writes since 2020); a single key write is atomic.
Iceberg: catalog (Glue/Hive/Nessie) does atomic compare-and-swap on the table's metadata pointer.
Hudi: timeline server + atomic commit file.

🚀 In Practice

Trade-offs, exercises, what to ship today.

9. Concurrency model

Optimistic: read snapshot N → compute writes → try commit at snapshot N+1 → if conflicting writes detected, replay with new snapshot or fail. Works great for ELT (append-mostly); painful for high-write OLTP-style workloads.

10. Picking one

Databricks-first shop: Delta (still best-supported there) — increasingly write Delta + UniForm so Iceberg readers work.
Multi-engine, open ecosystem: Iceberg with REST catalog (Polaris / Nessie). The 2025 momentum.
Streaming upserts dominate: Hudi.

Day 11 — Function Calling, Tool Use, and Agentic Loops

Day 13 — MLOps — Experiment Tracking, Model Registry, CI/CD for Models