Petabyte Cost Optimisation — Compression, Partitioning, Z-Order, File Sizing
Session 37 of the 48-session learning series.
Date: Wed, 2026-07-08 · Time: 18:00–20:00 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 12 · Est. read: 2 h
Why this session matters
This is Session 37 of 48 in the DE track. Petabyte-scale data is cheap to store and ruinously expensive to query badly. The difference between a 500k/month cloud bill is rarely the data volume — it's the file sizing, the partitioning, the compression choice. The unsexy decisions that compound.
Agenda
- Storage cost vs query cost — the real money is in scans
- Partitioning — how to pick a partition column without regret
- File sizing — the small-file problem and how to keep it dead
- Compression — Snappy, ZSTD, LZ4, GZIP; columnar layout matters
- Z-Order, liquid clustering, sorted-by — the post-partition optimisations
Pre-read (skim before the session)
- Databricks — Best practices for Delta Lake
- Apache Parquet — File format spec
- Iceberg — Performance tuning
- Brooklyn Data — Lakehouse cost optimization
Deep dive
1. Where the money actually goes
In a typical cloud lakehouse for a 1 PB workload:
| Bucket | Approx % of bill |
|---|---|
| Compute (query, transform) | 65–80 % |
| Storage (S3/ADLS/GCS) | 5–15 % |
| Egress | 5–15 % |
| Metadata services (Glue, Hive) | < 5 % |
The lesson: optimise compute first, storage second. Storage at 20k. Compute can easily be $200k. Same workload.
2. Scan amount = cost
Every query reads N TB from storage. Per query cost ≈ scan-bytes × $ per byte. Reducing scan-bytes is your primary lever.
Three knobs:
- Pruning — read only relevant files (partitioning + min/max statistics).
- Projection pushdown — read only relevant columns (columnar format).
- Predicate pushdown — apply filters at the file/row-group level.
All three need the storage layer to cooperate (Parquet/ORC + manifest stats).
3. Partitioning — get this right
A partition is a folder hierarchy on disk. s3://bucket/orders/event_date=2026-07-01/region=eu/. Files inside are filtered by the planner before any read.
Rules:
- Partition by the column most filters use (almost always a date).
- Cardinality 100–10,000 partitions total per table is the sweet spot.
- Cardinality 1M+ partitions → metadata catastrophe (Hive metastore meltdown, planner overhead).
- Don't over-partition —
year/month/day/hourfor cold data = millions of folders, terrible.
Bad: partition by user_id (cardinality millions).
Good: partition by event_date (cardinality 365 per year).
Better with Iceberg/Delta: partition by bucket(user_id, 32) — hash bucket; bounded cardinality.
4. The small-file problem
When ingestion writes one tiny file every few seconds, you get millions of 50 KB files. Symptoms:
- Listing the partition takes minutes.
- Scan throughput collapses (each file = one fixed cost + a tiny read).
- Per-file metadata bloat.
Cure:
- Compact periodically —
OPTIMIZE(Delta),rewrite-data-files(Iceberg), or a scheduled merge job. - Aim for 128–512 MB target file size.
- Stream + batch writers — append small files now, compact within minutes.
5. File-size tuning
Too small: too many files, planner choked. Too big: parallelism collapses; one slow reader stalls the job.
Sweet spot for analytical queries: 128–512 MB per Parquet file. Spark default spark.sql.files.maxPartitionBytes=128M is a sane starting point.
For low-latency point lookups: smaller (32–64 MB) sometimes better.
6. Parquet — what's inside
Per file:
- Magic header.
- Row groups (default ~128 MB) — independently readable; have their own statistics.
- Within row group: column chunks; within column chunks: pages.
- Footer with metadata: schema, row group locations, statistics.
Reading is push-down friendly: planner skips entire row groups via stats, then skips pages within them. Wide schemas with hundreds of columns are cheap if you only select 3.
7. Compression choice
| Codec | Compression ratio | Decompress speed | Use |
|---|---|---|---|
| Snappy | ~2× | very fast | Default; balanced |
| ZSTD | ~2.5× | fast | Long-lived warehouse data |
| GZIP | ~3× | slow | Archive |
| LZ4 | ~2× | fastest | Hot ingest path |
| LZO | ~1.8× | very fast | Legacy Hadoop |
| Brotli | ~3× | slow | Web; rare in analytics |
Modern default: ZSTD level 3. Excellent ratio, ~80% of Snappy speed on decompress. Saves real money on storage and egress.
8. Column ordering
Sort row groups by frequently-filtered columns. Min/max statistics become tight → pruning becomes effective.
-- Delta
OPTIMIZE orders ZORDER BY (customer_id, event_ts);
-- Iceberg
ALTER TABLE orders WRITE ORDERED BY customer_id, event_ts;
Z-Order: multi-dimensional clustering using a space-filling curve. Filters on either dimension still prune well. Useful when you have 2–4 filter columns of similar importance.
Linear sort: better for 1 dominant filter.
Liquid clustering (Databricks 2024+): automatic, learns from query patterns; less manual tuning.
9. Bloom filters and zone maps
- Zone maps / min-max indices — built into Parquet footer. Free.
- Bloom filters — opt-in per column; great for high-cardinality equality lookups (
user_id = ?). - Bitmap indices — Iceberg / Hive variants; less common.
Adding bloom filters on user_id for a per-user analytics table can take a query from 30s scan to 200ms.
10. Cost monitoring
Instrument:
- Query history with bytes-scanned per query.
- Per-team / per-pipeline rollup.
- Top-10 expensive queries weekly review.
- Alert on bytes-scanned regression after a deploy.
Snowflake, Databricks, BigQuery all expose this. Surface it to query authors via comment in the SQL — -- This query scanned 4.2 TB ($2.10).
11. Tiering
- Hot — last 30 days; standard storage; high query volume.
- Warm — 30–365 days; lower-cost storage class (S3 IA, GCS Nearline); 1.5× retrieval cost.
- Cold — > 1 year; archival (Glacier, Coldline); minutes-hours to thaw.
Auto-tier via lifecycle policies. Query layer should be tier-aware so warm/cold reads warn the user.
12. Vacuum and retention
Delta/Iceberg/Hudi keep old data files for time travel. Without VACUUM:
- Storage grows monotonically forever.
- Old files clog the manifest.
Default retention: 7 days for Delta. Set explicitly per-table based on time-travel needs (regulatory: months; ephemeral: days).
13. Reality check
A petabyte-scale cost-optimisation checklist:
- Partition by date with hash bucket for high-cardinality joins.
- Target 256 MB Parquet files; weekly OPTIMIZE.
- ZSTD level 3 compression.
- Z-Order or liquid clustering on the top-2 filter columns.
- Bloom filters on PK-style columns used in equality queries.
- Lifecycle policy: hot 30d, warm 90d, cold 1yr+.
- Weekly query-cost review; chargeback by team.
- VACUUM with 7-day retention (unless regulated).
Doing all of this on a fresh 1 PB pipeline routinely cuts compute spend 50–70%. The work pays back in the first month.
Reading material
- The Parquet file format spec
- Designing Data-Intensive Applications — storage chapter
- Databricks — Best practices for Delta Lake
- Iceberg — performance tuning
In-depth research material
- Databricks — Liquid clustering
- Snowflake — Micro-partitioning whitepaper
- Z-Order curve explained
- BigQuery — query optimization patterns
Video reference
▶︎ Petabyte-Scale Cost Optimization (Databricks)
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Largest Rectangle in Histogram
- Link: https://leetcode.com/problems/largest-rectangle-in-histogram/
- Difficulty: Hard
- Why this problem: Find the maximum scan-window you can afford given file sizes — same shape as the rectangle area maximisation.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Explain why compute dominates a petabyte bill, not storage.
- Pick a partition column that lands in the 100–10k cardinality sweet spot.
- Diagnose and fix a small-file problem with OPTIMIZE / compaction.
- Choose between Snappy / ZSTD / LZ4 by use-case.
- Apply Z-Order / liquid clustering for multi-dimension filters.
- Solve
largest-rectangle-in-histogram— monotonic stack, same area-maximisation flavor.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.