data engineeringadvanced 12m2026-05-31

Day 02 — Apache Spark Architecture — Driver, Executors, Shuffles, Catalyst

Spark is still the workhorse for petabyte ETL and feature engineering. Understanding the execution model is the difference between a 9-minute job and a 9-hour j…

Spark looks magical from the DataFrame API but underneath it is just a scheduler over a DAG of stages, where each stage is a pipeline of narrow transformations and stage boundaries are shuffles. Knowing this map by heart is what separates "uses Spark" from "is dangerous in Spark".

🧠 Concept

Why it matters & the mental model.

1. The cluster topology

The driver holds the SparkContext, builds the logical/physical plan, and tracks tasks. Executors are long-lived JVMs that hold cached partitions and run tasks on cores. Cluster manager allocates executors. Death of the driver = job dies; death of an executor = tasks re-run elsewhere (lineage).

2. RDD lineage → logical → physical → tasks

Catalyst takes your DataFrame ops and walks four stages:

Parsed logical plan (just a tree, possibly invalid).
Analyzed logical plan (resolved against catalog).
Optimized logical plan (rule-based: predicate pushdown, column pruning, constant folding, join reorder).
Physical plans (cost-based via CBO if stats exist) → one chosen → split into stages at shuffle boundaries → each stage = N tasks (one per output partition).

df.explain('formatted') prints all four. Read it for every job before you tune.

3. Narrow vs wide transformations

Narrow (map, filter, select, union): each input partition contributes to one output partition. No shuffle. Pipelined into one stage.
Wide (groupBy, join on non-co-partitioned keys, distinct, window): output partition depends on many input partitions → shuffle: data is hash-partitioned, written to local disk, fetched over the network. This is the single biggest cost driver.

🛠 Deep Dive

Internals, code, architecture.

4. The shuffle in detail

Each map task writes (key, value) pairs into spark.sql.shuffle.partitions (default 200) files on local disk, indexed by partition id. Each reduce task fetches its slice from every map task. Two failure modes:

Too few partitions → giant tasks, OOM, stragglers.
Too many partitions → metadata overhead, slow scheduling. Rule of thumb: aim for 128 MB - 256 MB per shuffle partition after compression.

5. Killing the shuffle

Broadcast hash join: when one side is < spark.sql.autoBroadcastJoinThreshold (default 10 MB), Spark ships it to every executor — turns a wide join into a narrow one.
Bucketing: pre-shuffle data into N buckets at write time on the join key; subsequent joins are bucket-aligned, no exchange.
AQE (Adaptive Query Execution) in Spark 3+: coalesces small shuffle partitions at runtime, dynamically switches sort-merge → broadcast, handles skew with split-and-replicate.

6. Memory model

Each executor splits its heap into:

Reserved (300 MB system).
User memory (your UDF state).
Spark unified memory (spark.memory.fraction, default 0.6): execution (shuffle, joins, sorts) + storage (cached DataFrames) share a pool with dynamic borrowing. OOM 99% of the time = a single huge partition (skew) or a UDF building a big Python list in the driver.

🚀 In Practice

Trade-offs, exercises, what to ship today.

7. Skew, the silent killer

A few keys dominate (e.g. a "null" or "anonymous" user). One task processes 90% of the data, the rest sit idle. Detect via stage's "task duration" histogram in the UI. Fixes: salting (append rand 1..N to the key, replicate the smaller side), AQE skew join, or repartition(key).sortWithinPartitions(key) then iterate.

8. Catalyst CBO and stats

ANALYZE TABLE … COMPUTE STATISTICS FOR COLUMNS … gives the optimizer cardinality and NDV. With CBO it can re-order multi-way joins for you. Without stats it falls back to estimates that are often wrong by orders of magnitude.

9. Practical tuning checklist

Set spark.sql.shuffle.partitions to ~ 2-4 × total_cores.
Turn on AQE: spark.sql.adaptive.enabled = true.
Cache only what is reused; prefer columnar formats (Parquet / Delta).
Avoid Python UDFs on the hot path; use built-in expressions or pandas UDFs with Arrow.
For huge joins: try broadcast → bucket → skew salting in that order.

10. What to take away

Senior engineers should be able to read a physical plan, name the shuffle, predict where it'll OOM, and pick the right mitigation? Practice on a real plan today.

Key points

Resources

Practice Problem: Group Anagrams (Medium)

Part of the 28-day 2026-05-30 prep program.

← previous

Day 01 — Transformer Internals — Attention, Embeddings, Positional Encoding

Day 03 — Gradient Boosted Trees — XGBoost / LightGBM, Loss, Regularisation