Spark Part 1 — Driver, Executors, RDDs, Lazy Evaluation
Session 2 of the 48-session learning series.
Date: Thu, 2026-06-11 · Time: 18:00–20:00 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 02 · Est. read: 2 h
Why this session matters
This is Session 02 of 48 in the Data Engineering track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.
Agenda
- Cluster topology — driver, executors, cluster manager, shuffle service
- RDD lineage and why Spark is lazy by default
- DataFrame vs Dataset vs RDD — when each one is the right answer
- Tasks, stages, jobs — the unit of work and where they get scheduled
- Local hands-on — read 5 GB Parquet, inspect a physical plan
Pre-read (skim before the session)
- Learning Spark, 2nd ed. — Chs. 1-4 (free PDF)
- Spark Tuning Guide (official)
- Catalyst Optimizer deep dive — Databricks blog
- Anatomy of a Spark Job
Deep dive
1. Where Spark still wins
Spark is the workhorse for petabyte ETL and feature engineering because:
- One API for batch + streaming + ML.
- Runs on YARN, Kubernetes, or standalone — portable.
- Catalyst (the query optimiser) gets you most of the way without hand-tuning.
- It survived 10+ years and absorbed every lesson from MapReduce, Hive, and Tez.
Understanding the execution model is the difference between a 9-minute job and a 9-hour job. It's also the most common deep-dive in senior DE interviews.
2. Cluster topology
┌──────────────┐
│ Driver │ holds SparkContext, plans queries,
│ (your code) │ schedules tasks, tracks state
└──────┬───────┘
│
┌──────▼────────────────┐
│ Cluster Manager │ YARN / Kubernetes / standalone
│ (allocates │ hands out containers to the
│ executor JVMs) │ driver on request
└──────┬────────────────┘
│
┌───┴────┬────────┬────────┐
▼ ▼ ▼ ▼
Executor Executor Executor Executor (long-lived JVMs,
1 2 3 N hold cached partitions,
run tasks on cores)
│ │ │ │
└─────────┴────┬────┴─────────┘
▼
shuffle service
(writes/reads shuffle files between stages)
- Driver — your
main(). Death of the driver = job dies. Keep its memory tight; avoidcollect()of giant DataFrames. - Executor — JVM with cores and memory. Holds cached partitions and runs tasks. Death of an executor = its tasks re-run elsewhere thanks to lineage.
- Cluster manager — schedules containers. Most production shops run on YARN or k8s.
- Shuffle service — external process per node that lets executors come and go without losing shuffle files (essential for dynamic allocation).
3. The RDD — lineage and laziness
An RDD (Resilient Distributed Dataset) is a plan for how to compute a partitioned collection, plus the lineage to recompute lost partitions. Two operation types:
- Transformations (
map,filter,join,groupByKey) — lazy. They build the DAG. - Actions (
count,collect,save) — eager. They trigger execution.
This is why df.filter(...).filter(...).count() only runs once across the data — Spark fuses the filters into one pass at execution time.
4. Tasks → stages → jobs
| Unit | What it is |
|---|---|
| Job | One action (e.g. df.write.parquet(...)). |
| Stage | A boundary between shuffles. Within a stage, ops are pipelined. |
| Task | One stage × one partition. Sent to an executor core to run. |
A groupBy().count() produces 2 stages: stage 0 reads + partial aggregates locally; stage 1 (after shuffle) does the final aggregate.
5. Narrow vs wide transformations
- Narrow — output partition depends on exactly one input partition.
map,filter,mapPartitions, narrowselect. No shuffle, no network. - Wide — output depends on multiple input partitions.
groupByKey,joinon un-bucketed keys,reduceByKey. Forces a shuffle — write to disk, network transfer, read back, merge.
Shuffles are 10–100× more expensive than narrow ops. Two reliable ways to flip a wide into a narrow:
- Co-partition the inputs (
repartition(N, key)on both sides ahead of a join — same N and same hash function). - Bucket at write time (
df.write.bucketBy(N, "key")) so future joins are shuffle-free.
6. DataFrame vs Dataset vs RDD
| API | Type-safe | Catalyst | Codegen | When to use |
|---|---|---|---|---|
| RDD | yes (in Scala) | ❌ | ❌ | only when you need fine partition control |
| DataFrame | no (Row) | ✅ | ✅ | default for everything in PySpark |
| Dataset | yes (Scala only) | ✅ | ✅ | Scala-only; lost in PySpark |
In Python: just use DataFrame + SQL. RDD is now an escape hatch.
7. Catalyst — the four stages
When you write spark.sql("SELECT ...") or DataFrame ops, Catalyst walks:
- Parsed logical plan — a tree, possibly invalid.
- Analyzed logical plan — resolved against the catalog. Column names known.
- Optimized logical plan — predicate pushdown, projection pruning, constant folding, join reordering.
- Physical plan — choose the actual operator (SortMergeJoin vs BroadcastHashJoin), apply codegen.
Always check with:
df.explain("formatted") # readable physical plan
df.explain("cost") # with cost estimates (Spark 3+)
The "Exchange" operators in explain() are your shuffles. Count them; minimise them.
8. Local hands-on (the actual deliverable for this session)
docker run -it --rm -p 4040:4040 jupyter/pyspark-notebook
In a notebook:
from pyspark.sql import SparkSession, functions as F
spark = (SparkSession.builder
.appName("s02").config("spark.sql.shuffle.partitions", 16)
.getOrCreate())
# Use NYC taxi green-tripdata (free, ~150 MB/month) — fetch a year ≈ 2 GB.
df = spark.read.parquet("s3a://nyc-tlc/trip data/yellow_tripdata_2024-*.parquet")
df.printSchema()
print(df.count())
# Pretend join: aggregate by pickup zone, join to a zone lookup
zones = spark.read.csv("taxi_zone_lookup.csv", header=True)
agg = df.groupBy("PULocationID").agg(F.count("*").alias("trips"))
result = agg.join(zones, F.col("PULocationID") == F.col("LocationID"))
result.explain("formatted")
result.write.mode("overwrite").parquet("/tmp/out")
Open the Spark UI at http://localhost:4040. Look at:
- The stages tab — count of shuffles.
- The SQL tab — the physical plan with row counts at each operator.
- The executors tab — task time vs GC time vs shuffle read/write.
9. Common gotchas
collect()on a big DataFrame → OOM on the driver. Usetake(n)orwriteto disk.groupByKey(RDD) — never; usereduceByKeyso partial aggregation happens in the map stage.UDFin Python — serialises every row to Python, breaks codegen. Use built-in functions or pandas UDFs.spark.sql.shuffle.partitions = 200(default) for a 5 GB table = too many. Tune to ~(input GB × 2)for a starting point. With AQE (Spark 3+), Spark coalesces post-shuffle for you.
10. What's next (Session 7 — Spark Part 2)
- Shuffle anatomy — sort vs hash-based, bypass merge sort
- Catalyst rule examples
- AQE — adaptive coalescing, skew handling, join strategy switch
- Tuning recipes (memory, parallelism, broadcast threshold)
Reading material
- Learning Spark, 2nd ed.
- Spark: The Definitive Guide (Bill Chambers, Matei Zaharia)
- Spark official tuning guide
In-depth research material
- Resilient Distributed Datasets (Zaharia et al., 2012)
- Spark SQL: Relational Data Processing in Spark (2015)
- Catalyst Optimizer deep dive (Databricks)
- AQE in Spark 3 — Databricks blog
Video reference
▶︎ Databricks — Apache Spark Internals (Reynold Xin)
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Group Anagrams
- Link: https://leetcode.com/problems/group-anagrams/
- Difficulty: Medium
- Why this problem: Sort the string or use a 26-count signature as the hash-map key.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Draw the cluster topology and name what each box does.
- Define narrow vs wide transformation; give 3 examples of each.
- Walk through the 4 Catalyst stages on a small SQL example.
- Read a
.explain('formatted')output and identify shuffle boundaries. - Tune
spark.sql.shuffle.partitionsfor a known input size. - Pick the right API (DataFrame vs RDD) for a given workload.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.