data engineeringintermediate 12m2026-06-09

Petabyte Cost Optimisation — Compression, Partitioning, Z-Order, File Sizing

Session 37 of the 48-session learning series.

Date: Wed, 2026-07-08 · Time: 18:00–20:00 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 12 · Est. read: 2 h

Why this session matters

This is Session 37 of 48 in the DE track. Petabyte-scale data is cheap to store and ruinously expensive to query badly. The difference between a $50k/month and$ 500k/month cloud bill is rarely the data volume — it's the file sizing, the partitioning, the compression choice. The unsexy decisions that compound.

Agenda

Storage cost vs query cost — the real money is in scans
Partitioning — how to pick a partition column without regret
File sizing — the small-file problem and how to keep it dead
Compression — Snappy, ZSTD, LZ4, GZIP; columnar layout matters
Z-Order, liquid clustering, sorted-by — the post-partition optimisations

Pre-read (skim before the session)

Deep dive

1. Where the money actually goes

In a typical cloud lakehouse for a 1 PB workload:

Bucket	Approx % of bill
Compute (query, transform)	65–80 %
Storage (S3/ADLS/GCS)	5–15 %
Egress	5–15 %
Metadata services (Glue, Hive)	< 5 %

The lesson: optimise compute first, storage second. Storage at $20/TB/month for 1 PB is$ 20k. Compute can easily be $200k. Same workload.

2. Scan amount = cost

Every query reads N TB from storage. Per query cost ≈ scan-bytes × $ per byte. Reducing scan-bytes is your primary lever.

Three knobs:

Pruning — read only relevant files (partitioning + min/max statistics).
Projection pushdown — read only relevant columns (columnar format).
Predicate pushdown — apply filters at the file/row-group level.

All three need the storage layer to cooperate (Parquet/ORC + manifest stats).

3. Partitioning — get this right

A partition is a folder hierarchy on disk. s3://bucket/orders/event_date=2026-07-01/region=eu/. Files inside are filtered by the planner before any read.

Rules:

Partition by the column most filters use (almost always a date).
Cardinality 100–10,000 partitions total per table is the sweet spot.
Cardinality 1M+ partitions → metadata catastrophe (Hive metastore meltdown, planner overhead).
Don't over-partition — year/month/day/hour for cold data = millions of folders, terrible.

Bad: partition by user_id (cardinality millions). Good: partition by event_date (cardinality 365 per year). Better with Iceberg/Delta: partition by bucket(user_id, 32) — hash bucket; bounded cardinality.

4. The small-file problem

When ingestion writes one tiny file every few seconds, you get millions of 50 KB files. Symptoms:

Listing the partition takes minutes.
Scan throughput collapses (each file = one fixed cost + a tiny read).
Per-file metadata bloat.

Cure:

Compact periodically — OPTIMIZE (Delta), rewrite-data-files (Iceberg), or a scheduled merge job.
Aim for 128–512 MB target file size.
Stream + batch writers — append small files now, compact within minutes.

5. File-size tuning

Too small: too many files, planner choked. Too big: parallelism collapses; one slow reader stalls the job.

Sweet spot for analytical queries: 128–512 MB per Parquet file. Spark default spark.sql.files.maxPartitionBytes=128M is a sane starting point.

For low-latency point lookups: smaller (32–64 MB) sometimes better.

6. Parquet — what's inside

Per file:

Magic header.
Row groups (default ~128 MB) — independently readable; have their own statistics.
Within row group: column chunks; within column chunks: pages.
Footer with metadata: schema, row group locations, statistics.

Reading is push-down friendly: planner skips entire row groups via stats, then skips pages within them. Wide schemas with hundreds of columns are cheap if you only select 3.

7. Compression choice

Codec	Compression ratio	Decompress speed	Use
Snappy	~2×	very fast	Default; balanced
ZSTD	~2.5×	fast	Long-lived warehouse data
GZIP	~3×	slow	Archive
LZ4	~2×	fastest	Hot ingest path
LZO	~1.8×	very fast	Legacy Hadoop
Brotli	~3×	slow	Web; rare in analytics

Modern default: ZSTD level 3. Excellent ratio, ~80% of Snappy speed on decompress. Saves real money on storage and egress.

8. Column ordering

Sort row groups by frequently-filtered columns. Min/max statistics become tight → pruning becomes effective.

-- Delta
OPTIMIZE orders ZORDER BY (customer_id, event_ts);

-- Iceberg
ALTER TABLE orders WRITE ORDERED BY customer_id, event_ts;

Z-Order: multi-dimensional clustering using a space-filling curve. Filters on either dimension still prune well. Useful when you have 2–4 filter columns of similar importance.

Linear sort: better for 1 dominant filter.

Liquid clustering (Databricks 2024+): automatic, learns from query patterns; less manual tuning.

9. Bloom filters and zone maps

Zone maps / min-max indices — built into Parquet footer. Free.
Bloom filters — opt-in per column; great for high-cardinality equality lookups (user_id = ?).
Bitmap indices — Iceberg / Hive variants; less common.

Adding bloom filters on user_id for a per-user analytics table can take a query from 30s scan to 200ms.

10. Cost monitoring

Instrument:

Query history with bytes-scanned per query.
Per-team / per-pipeline rollup.
Top-10 expensive queries weekly review.
Alert on bytes-scanned regression after a deploy.

Snowflake, Databricks, BigQuery all expose this. Surface it to query authors via comment in the SQL — -- This query scanned 4.2 TB ($2.10).

11. Tiering

Hot — last 30 days; standard storage; high query volume.
Warm — 30–365 days; lower-cost storage class (S3 IA, GCS Nearline); 1.5× retrieval cost.
Cold — > 1 year; archival (Glacier, Coldline); minutes-hours to thaw.

Auto-tier via lifecycle policies. Query layer should be tier-aware so warm/cold reads warn the user.

12. Vacuum and retention

Delta/Iceberg/Hudi keep old data files for time travel. Without VACUUM:

Storage grows monotonically forever.
Old files clog the manifest.

Default retention: 7 days for Delta. Set explicitly per-table based on time-travel needs (regulatory: months; ephemeral: days).

13. Reality check

A petabyte-scale cost-optimisation checklist:

Partition by date with hash bucket for high-cardinality joins.
Target 256 MB Parquet files; weekly OPTIMIZE.
ZSTD level 3 compression.
Z-Order or liquid clustering on the top-2 filter columns.
Bloom filters on PK-style columns used in equality queries.
Lifecycle policy: hot 30d, warm 90d, cold 1yr+.
Weekly query-cost review; chargeback by team.
VACUUM with 7-day retention (unless regulated).

Doing all of this on a fresh 1 PB pipeline routinely cuts compute spend 50–70%. The work pays back in the first month.

Link: https://leetcode.com/problems/largest-rectangle-in-histogram/
Difficulty: Hard
Why this problem: Find the maximum scan-window you can afford given file sizes — same shape as the rectangle area maximisation.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Explain why compute dominates a petabyte bill, not storage.
Pick a partition column that lands in the 100–10k cardinality sweet spot.
Diagnose and fix a small-file problem with OPTIMIZE / compaction.
Choose between Snappy / ZSTD / LZ4 by use-case.
Apply Z-Order / liquid clustering for multi-dimension filters.
Solve largest-rectangle-in-histogram — monotonic stack, same area-maximisation flavor.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Multimodal LLMs — Vision, Language, Audio, Tool Use Combined

Feature Engineering & Feature Stores at Scale