Search Tech Journey

Find topics, journeys and posts

back to blog
data engineeringintermediate 12m2026-06-09

Lakehouse — Delta Lake, Iceberg, Hudi, ACID on Object Storage

Session 19 of the 48-session learning series.

Date: Wed, 2026-06-24 · Time: 18:00–20:00 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 12 · Est. read: 2 h

Why this session matters

This is Session 19 of 48 in the Data Engineering track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

  • Why lakehouse — bridging warehouse SQL and data-lake scale
  • The three formats — Delta Lake, Apache Iceberg, Apache Hudi
  • How ACID on S3 works — manifests, snapshots, optimistic concurrency
  • Time travel, schema evolution, partition evolution
  • Choosing — vendor lock-in, ecosystem, performance trade-offs

Pre-read (skim before the session)

Deep dive

1. The lakehouse pitch

Old world: data lake (cheap S3 storage, schema-on-read, no ACID) and data warehouse (expensive proprietary storage, full ACID, fast queries). Engineers shuttled data between them.

Lakehouse: keep data in open columnar files (Parquet) on cheap object storage, add a transactional metadata layer on top. Now you get ACID, time travel, schema evolution, and warehouse-grade queries at lake prices. The format owns the metadata; any engine that speaks the spec can read/write.

2. The three formats — at a glance

FeatureDelta LakeIcebergHudi
OriginDatabricks (2017)Netflix (2017, donated to Apache)Uber (2017)
MetadataJSON log + checkpointsJSON snapshots + manifest listsTimeline (instants) + metadata table
ConcurrencyOptimistic, log-basedOptimistic, snapshot-basedOCC, MVCC
Time travelYes (version or timestamp)Yes (snapshot id)Yes (instant)
Schema evolutionYes (add/drop/rename)Yes (add/drop/rename/reorder)Yes (limited reorder)
Partition evolutionNoYes — hidden partitioningYes
Upsert / mergeYes (MERGE INTO)Yes (MERGE INTO)Yes (native focus)
StreamingYes (Spark Structured Streaming)Yes (Flink + Spark)Yes (built around streaming)
EnginesSpark, Flink, Trino, Athena, DuckDBSpark, Flink, Trino, Snowflake, BigQuery, DuckDBSpark, Flink, Trino
Vendor pullDatabricks-heavy (but OSS)Truly multi-vendorOnehouse-pushed

3. How Delta Lake works

Each Delta table is a directory:

my_table/
├── _delta_log/
│   ├── 00000000000000000000.json     ← commit 0: add file A
│   ├── 00000000000000000001.json     ← commit 1: add file B
│   ├── ...
│   ├── 00000000000000000010.checkpoint.parquet  ← every 10 commits
│   └── _last_checkpoint
├── part-00000-...-snappy.parquet    ← actual data
├── part-00001-...-snappy.parquet
└── ...

Each commit is a JSON file: list of add and remove actions (file paths, stats). To read the table, replay commits (or load the latest checkpoint + delta).

Writers use optimistic concurrency: read latest version, prepare a new commit N+1, attempt to write N+1.json. If another writer raced and N+1 already exists, retry with N+2.

4. How Iceberg works

Iceberg's metadata is more layered:

table_metadata.json (root pointer)
   ↓
snapshot N
   ↓
manifest list (which manifests this snapshot includes)
   ↓
manifest files (which data files, with column stats)
   ↓
parquet files

Each commit produces a new table_metadata.json. Atomic swap via S3 conditional put (etag) or HMS rename. Snapshot id is opaque; users see versions.

Iceberg's hidden partitioning is a killer feature: partition by bucket(user_id, 16) or month(ts); queries filter on user_id or ts, and Iceberg prunes partitions transparently. No partition columns in queries → no broken queries when partitioning changes.

5. How Hudi works

Hudi was built around streaming upserts. Two table types:

  • Copy-on-Write (CoW) — rewrite affected Parquet files on each update. Slower writes, fast reads.
  • Merge-on-Read (MoR) — append delta logs (Avro), merge at read time or via compaction. Fast writes, slower reads (or run compaction).

Hudi exposes three query types:

  • Snapshot — latest, merged.
  • Incremental — give me only what changed since last instant (CDC-style — perfect for downstream ETL).
  • Read-optimised — only the latest compacted view (no merge cost).

That incremental query type is Hudi's superpower for streaming ingestion → downstream ELT.

6. ACID on S3 — the hard parts

S3 is eventually consistent (now strong consistency since Dec 2020) but doesn't have multi-object atomic writes. So formats fake it:

  • Atomic commit = atomic write of a single metadata file (Delta's commit JSON, Iceberg's table-metadata pointer). All formats lean on this.
  • Listing race — old: S3 LIST might miss freshly-written files. Now consistent globally.
  • Compaction safety — when you compact 1000 small files into 100, you must atomically swap. All formats use "add new, remove old in one commit".

7. Time travel

All three formats keep old data files (until vacuum/expire). Reads can request a past version:

SELECT * FROM events VERSION AS OF 1234;
SELECT * FROM events TIMESTAMP AS OF '2026-06-08 12:00:00';

Cost: storage holds tombstones until vacuum. Set retention based on debugging horizon (7–30 days typical).

8. Schema evolution

Add, drop, rename columns without rewriting data:

ALTER TABLE events ADD COLUMN extra STRING;
ALTER TABLE events RENAME COLUMN old_name TO new_name;

Iceberg uses field IDs in the metadata (not column names), so renames are pure metadata. Delta uses column names with mapping. Both make backwards-compatible reads of old files Just Work.

9. Partition evolution

Iceberg lets you change the partition spec without rewriting data:

ALTER TABLE events ADD PARTITION FIELD bucket(user_id, 16);
ALTER TABLE events DROP PARTITION FIELD day(ts);

Old data keeps its old partition layout; new data uses the new one. Queries handle both. Delta doesn't really do this (partition columns are fixed at table creation).

10. Picking one

  • Databricks shop, mostly Spark → Delta. Best tooling, OPTIMIZE, Z-ORDER, MERGE INTO are mature.
  • Multi-vendor, Trino/Snowflake/BigQuery on same data → Iceberg. The format with truly broad engine support.
  • Streaming upserts core to your pipeline (CDC) → Hudi. Native incremental queries.

In 2026 the trend is convergent: Iceberg is winning on cross-engine support; Delta UniForm exposes Delta as Iceberg metadata; Snowflake adopted Iceberg natively; BigQuery supports both Iceberg and Hudi external tables.

11. Operational concerns

  • Small files — every commit adds files; thousands of tiny files = slow queries + S3 LIST cost. Run OPTIMIZE / rewrite_data_files periodically.
  • Manifest bloat — Iceberg manifest lists grow; run rewrite_manifests.
  • Tombstones — vacuum old files based on retention. Bug: setting retention too low breaks time-travel debug.
  • Z-order / clustering — co-locate rows with similar values in the same files; massive query speedup for selective queries (covered in S37).

12. Cost model

Storage: $$ (S3 ~$0.023/GB-month standard). Metadata: tiny. Compute: $$$$ (the actual cost — Spark/Trino/etc.).

Lakehouse wins economically vs warehouse on storage (10–50× cheaper) and breaks even or slightly worse on compute for interactive queries. For batch ETL, much cheaper.

Reading material

In-depth research material

Video reference

▶︎ Apache Iceberg — Ryan Blue, Architecture Deep Dive

Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.

LeetCode — Design Snapshot Array

Post-session checklist

By the end of this session you should be able to:

  • Compare Delta, Iceberg, Hudi on metadata structure and engine support.
  • Explain optimistic concurrency on object storage in one paragraph.
  • Use time-travel SQL to read a table as-of yesterday.
  • Add and drop a column without rewriting Parquet data.
  • Schedule OPTIMIZE / rewrite_data_files for a high-ingest table.
  • Solve design-snapshot-array — per-index list of (snap, val) + binary search; mirrors time travel.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.