Search Tech Journey

Find topics, journeys and posts

back to blog
data engineeringintermediate 12m2026-06-09

Petabyte Cost Optimisation — Compression, Partitioning, Z-Order, File Sizing

Session 37 of the 48-session learning series.

Date: Wed, 2026-07-08 · Time: 18:00–20:00 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 12 · Est. read: 2 h

Why this session matters

This is Session 37 of 48 in the DE track. Petabyte-scale data is cheap to store and ruinously expensive to query badly. The difference between a 50k/monthand50k/month and 500k/month cloud bill is rarely the data volume — it's the file sizing, the partitioning, the compression choice. The unsexy decisions that compound.

Agenda

  • Storage cost vs query cost — the real money is in scans
  • Partitioning — how to pick a partition column without regret
  • File sizing — the small-file problem and how to keep it dead
  • Compression — Snappy, ZSTD, LZ4, GZIP; columnar layout matters
  • Z-Order, liquid clustering, sorted-by — the post-partition optimisations

Pre-read (skim before the session)

Deep dive

1. Where the money actually goes

In a typical cloud lakehouse for a 1 PB workload:

BucketApprox % of bill
Compute (query, transform)65–80 %
Storage (S3/ADLS/GCS)5–15 %
Egress5–15 %
Metadata services (Glue, Hive)< 5 %

The lesson: optimise compute first, storage second. Storage at 20/TB/monthfor1PBis20/TB/month for 1 PB is 20k. Compute can easily be $200k. Same workload.

2. Scan amount = cost

Every query reads N TB from storage. Per query cost ≈ scan-bytes × $ per byte. Reducing scan-bytes is your primary lever.

Three knobs:

  • Pruning — read only relevant files (partitioning + min/max statistics).
  • Projection pushdown — read only relevant columns (columnar format).
  • Predicate pushdown — apply filters at the file/row-group level.

All three need the storage layer to cooperate (Parquet/ORC + manifest stats).

3. Partitioning — get this right

A partition is a folder hierarchy on disk. s3://bucket/orders/event_date=2026-07-01/region=eu/. Files inside are filtered by the planner before any read.

Rules:

  • Partition by the column most filters use (almost always a date).
  • Cardinality 100–10,000 partitions total per table is the sweet spot.
  • Cardinality 1M+ partitions → metadata catastrophe (Hive metastore meltdown, planner overhead).
  • Don't over-partition — year/month/day/hour for cold data = millions of folders, terrible.

Bad: partition by user_id (cardinality millions). Good: partition by event_date (cardinality 365 per year). Better with Iceberg/Delta: partition by bucket(user_id, 32) — hash bucket; bounded cardinality.

4. The small-file problem

When ingestion writes one tiny file every few seconds, you get millions of 50 KB files. Symptoms:

  • Listing the partition takes minutes.
  • Scan throughput collapses (each file = one fixed cost + a tiny read).
  • Per-file metadata bloat.

Cure:

  • Compact periodicallyOPTIMIZE (Delta), rewrite-data-files (Iceberg), or a scheduled merge job.
  • Aim for 128–512 MB target file size.
  • Stream + batch writers — append small files now, compact within minutes.

5. File-size tuning

Too small: too many files, planner choked. Too big: parallelism collapses; one slow reader stalls the job.

Sweet spot for analytical queries: 128–512 MB per Parquet file. Spark default spark.sql.files.maxPartitionBytes=128M is a sane starting point.

For low-latency point lookups: smaller (32–64 MB) sometimes better.

6. Parquet — what's inside

Per file:

  • Magic header.
  • Row groups (default ~128 MB) — independently readable; have their own statistics.
  • Within row group: column chunks; within column chunks: pages.
  • Footer with metadata: schema, row group locations, statistics.

Reading is push-down friendly: planner skips entire row groups via stats, then skips pages within them. Wide schemas with hundreds of columns are cheap if you only select 3.

7. Compression choice

CodecCompression ratioDecompress speedUse
Snappy~2×very fastDefault; balanced
ZSTD~2.5×fastLong-lived warehouse data
GZIP~3×slowArchive
LZ4~2×fastestHot ingest path
LZO~1.8×very fastLegacy Hadoop
Brotli~3×slowWeb; rare in analytics

Modern default: ZSTD level 3. Excellent ratio, ~80% of Snappy speed on decompress. Saves real money on storage and egress.

8. Column ordering

Sort row groups by frequently-filtered columns. Min/max statistics become tight → pruning becomes effective.

-- Delta
OPTIMIZE orders ZORDER BY (customer_id, event_ts);

-- Iceberg
ALTER TABLE orders WRITE ORDERED BY customer_id, event_ts;

Z-Order: multi-dimensional clustering using a space-filling curve. Filters on either dimension still prune well. Useful when you have 2–4 filter columns of similar importance.

Linear sort: better for 1 dominant filter.

Liquid clustering (Databricks 2024+): automatic, learns from query patterns; less manual tuning.

9. Bloom filters and zone maps

  • Zone maps / min-max indices — built into Parquet footer. Free.
  • Bloom filters — opt-in per column; great for high-cardinality equality lookups (user_id = ?).
  • Bitmap indices — Iceberg / Hive variants; less common.

Adding bloom filters on user_id for a per-user analytics table can take a query from 30s scan to 200ms.

10. Cost monitoring

Instrument:

  • Query history with bytes-scanned per query.
  • Per-team / per-pipeline rollup.
  • Top-10 expensive queries weekly review.
  • Alert on bytes-scanned regression after a deploy.

Snowflake, Databricks, BigQuery all expose this. Surface it to query authors via comment in the SQL — -- This query scanned 4.2 TB ($2.10).

11. Tiering

  • Hot — last 30 days; standard storage; high query volume.
  • Warm — 30–365 days; lower-cost storage class (S3 IA, GCS Nearline); 1.5× retrieval cost.
  • Cold — > 1 year; archival (Glacier, Coldline); minutes-hours to thaw.

Auto-tier via lifecycle policies. Query layer should be tier-aware so warm/cold reads warn the user.

12. Vacuum and retention

Delta/Iceberg/Hudi keep old data files for time travel. Without VACUUM:

  • Storage grows monotonically forever.
  • Old files clog the manifest.

Default retention: 7 days for Delta. Set explicitly per-table based on time-travel needs (regulatory: months; ephemeral: days).

13. Reality check

A petabyte-scale cost-optimisation checklist:

  1. Partition by date with hash bucket for high-cardinality joins.
  2. Target 256 MB Parquet files; weekly OPTIMIZE.
  3. ZSTD level 3 compression.
  4. Z-Order or liquid clustering on the top-2 filter columns.
  5. Bloom filters on PK-style columns used in equality queries.
  6. Lifecycle policy: hot 30d, warm 90d, cold 1yr+.
  7. Weekly query-cost review; chargeback by team.
  8. VACUUM with 7-day retention (unless regulated).

Doing all of this on a fresh 1 PB pipeline routinely cuts compute spend 50–70%. The work pays back in the first month.

Reading material

In-depth research material

Video reference

▶︎ Petabyte-Scale Cost Optimization (Databricks)

Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.

LeetCode — Largest Rectangle in Histogram

Post-session checklist

By the end of this session you should be able to:

  • Explain why compute dominates a petabyte bill, not storage.
  • Pick a partition column that lands in the 100–10k cardinality sweet spot.
  • Diagnose and fix a small-file problem with OPTIMIZE / compaction.
  • Choose between Snappy / ZSTD / LZ4 by use-case.
  • Apply Z-Order / liquid clustering for multi-dimension filters.
  • Solve largest-rectangle-in-histogram — monotonic stack, same area-maximisation flavor.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.