data engineeringintermediate 12m2026-06-09

Spark Part 1 — Driver, Executors, RDDs, Lazy Evaluation

Session 2 of the 48-session learning series.

Date: Thu, 2026-06-11 · Time: 18:00–20:00 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 02 · Est. read: 2 h

Why this session matters

This is Session 02 of 48 in the Data Engineering track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

Cluster topology — driver, executors, cluster manager, shuffle service
RDD lineage and why Spark is lazy by default
DataFrame vs Dataset vs RDD — when each one is the right answer
Tasks, stages, jobs — the unit of work and where they get scheduled
Local hands-on — read 5 GB Parquet, inspect a physical plan

Pre-read (skim before the session)

Deep dive

1. Where Spark still wins

Spark is the workhorse for petabyte ETL and feature engineering because:

One API for batch + streaming + ML.
Runs on YARN, Kubernetes, or standalone — portable.
Catalyst (the query optimiser) gets you most of the way without hand-tuning.
It survived 10+ years and absorbed every lesson from MapReduce, Hive, and Tez.

Understanding the execution model is the difference between a 9-minute job and a 9-hour job. It's also the most common deep-dive in senior DE interviews.

2. Cluster topology

┌──────────────┐
│  Driver      │  holds SparkContext, plans queries,
│  (your code) │  schedules tasks, tracks state
└──────┬───────┘
       │
┌──────▼────────────────┐
│  Cluster Manager      │  YARN / Kubernetes / standalone
│  (allocates           │  hands out containers to the
│   executor JVMs)      │  driver on request
└──────┬────────────────┘
       │
   ┌───┴────┬────────┬────────┐
   ▼        ▼        ▼        ▼
 Executor Executor Executor Executor   (long-lived JVMs,
   1        2        3        N         hold cached partitions,
                                        run tasks on cores)
   │         │         │         │
   └─────────┴────┬────┴─────────┘
                  ▼
            shuffle service
            (writes/reads shuffle files between stages)

Driver — your main(). Death of the driver = job dies. Keep its memory tight; avoid collect() of giant DataFrames.
Executor — JVM with cores and memory. Holds cached partitions and runs tasks. Death of an executor = its tasks re-run elsewhere thanks to lineage.
Cluster manager — schedules containers. Most production shops run on YARN or k8s.
Shuffle service — external process per node that lets executors come and go without losing shuffle files (essential for dynamic allocation).

3. The RDD — lineage and laziness

An RDD (Resilient Distributed Dataset) is a plan for how to compute a partitioned collection, plus the lineage to recompute lost partitions. Two operation types:

Transformations (map, filter, join, groupByKey) — lazy. They build the DAG.
Actions (count, collect, save) — eager. They trigger execution.

This is why df.filter(...).filter(...).count() only runs once across the data — Spark fuses the filters into one pass at execution time.

4. Tasks → stages → jobs

Unit	What it is
Job	One action (e.g. `df.write.parquet(...)`).
Stage	A boundary between shuffles. Within a stage, ops are pipelined.
Task	One stage × one partition. Sent to an executor core to run.

A groupBy().count() produces 2 stages: stage 0 reads + partial aggregates locally; stage 1 (after shuffle) does the final aggregate.

5. Narrow vs wide transformations

Narrow — output partition depends on exactly one input partition. map, filter, mapPartitions, narrow select. No shuffle, no network.
Wide — output depends on multiple input partitions. groupByKey, join on un-bucketed keys, reduceByKey. Forces a shuffle — write to disk, network transfer, read back, merge.

Shuffles are 10–100× more expensive than narrow ops. Two reliable ways to flip a wide into a narrow:

Co-partition the inputs (repartition(N, key) on both sides ahead of a join — same N and same hash function).
Bucket at write time (df.write.bucketBy(N, "key")) so future joins are shuffle-free.

6. DataFrame vs Dataset vs RDD

API	Type-safe	Catalyst	Codegen	When to use
RDD	yes (in Scala)	❌	❌	only when you need fine partition control
DataFrame	no (Row)	✅	✅	default for everything in PySpark
Dataset	yes (Scala only)	✅	✅	Scala-only; lost in PySpark

In Python: just use DataFrame + SQL. RDD is now an escape hatch.

7. Catalyst — the four stages

When you write spark.sql("SELECT ...") or DataFrame ops, Catalyst walks:

Parsed logical plan — a tree, possibly invalid.
Analyzed logical plan — resolved against the catalog. Column names known.
Optimized logical plan — predicate pushdown, projection pruning, constant folding, join reordering.
Physical plan — choose the actual operator (SortMergeJoin vs BroadcastHashJoin), apply codegen.

Always check with:

df.explain("formatted")          # readable physical plan
df.explain("cost")               # with cost estimates (Spark 3+)

The "Exchange" operators in explain() are your shuffles. Count them; minimise them.

8. Local hands-on (the actual deliverable for this session)

docker run -it --rm -p 4040:4040 jupyter/pyspark-notebook

In a notebook:

from pyspark.sql import SparkSession, functions as F
spark = (SparkSession.builder
         .appName("s02").config("spark.sql.shuffle.partitions", 16)
         .getOrCreate())

# Use NYC taxi green-tripdata (free, ~150 MB/month) — fetch a year ≈ 2 GB.
df = spark.read.parquet("s3a://nyc-tlc/trip data/yellow_tripdata_2024-*.parquet")
df.printSchema()
print(df.count())

# Pretend join: aggregate by pickup zone, join to a zone lookup
zones = spark.read.csv("taxi_zone_lookup.csv", header=True)
agg = df.groupBy("PULocationID").agg(F.count("*").alias("trips"))
result = agg.join(zones, F.col("PULocationID") == F.col("LocationID"))
result.explain("formatted")
result.write.mode("overwrite").parquet("/tmp/out")

Open the Spark UI at http://localhost:4040. Look at:

The stages tab — count of shuffles.
The SQL tab — the physical plan with row counts at each operator.
The executors tab — task time vs GC time vs shuffle read/write.

9. Common gotchas

collect() on a big DataFrame → OOM on the driver. Use take(n) or write to disk.
groupByKey (RDD) — never; use reduceByKey so partial aggregation happens in the map stage.
UDF in Python — serialises every row to Python, breaks codegen. Use built-in functions or pandas UDFs.
spark.sql.shuffle.partitions = 200 (default) for a 5 GB table = too many. Tune to ~(input GB × 2) for a starting point. With AQE (Spark 3+), Spark coalesces post-shuffle for you.

10. What's next (Session 7 — Spark Part 2)

Shuffle anatomy — sort vs hash-based, bypass merge sort
Catalyst rule examples
AQE — adaptive coalescing, skew handling, join strategy switch
Tuning recipes (memory, parallelism, broadcast threshold)

Link: https://leetcode.com/problems/group-anagrams/
Difficulty: Medium
Why this problem: Sort the string or use a 26-count signature as the hash-map key.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Draw the cluster topology and name what each box does.
Define narrow vs wide transformation; give 3 examples of each.
Walk through the 4 Catalyst stages on a small SQL example.
Read a .explain('formatted') output and identify shuffle boundaries.
Tune spark.sql.shuffle.partitions for a known input size.
Pick the right API (DataFrame vs RDD) for a given workload.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Transformers Part 1 — Attention, Q/K/V, Multi-Head

URL Shortener Part 1 — Numbers, IDs, Storage