Streaming with Flink/Spark — Watermarks, Windows, State
Session 23 of the 48-session learning series.
Date: Sat, 2026-06-27 · Time: 14:30–16:30 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 17 · Est. read: 2 h
Why this session matters
This is Session 23 of 48 in the Data Engineering track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.
Agenda
- Event time vs processing time — and why everyone confuses them
- Watermarks — how systems decide "time has passed"
- Windows — tumbling, sliding, session, global
- State management — keyed state, broadcast state, checkpoints, savepoints
- Flink vs Spark Structured Streaming — choosing for your workload
Pre-read (skim before the session)
- Tyler Akidau — The world beyond batch (Parts I & II)
- Flink — Concepts: Stateful Stream Processing
- Spark Structured Streaming Programming Guide
- Dataflow Model paper (Akidau et al., 2015)
Deep dive
1. Why streaming is different
Batch: bounded dataset, you wait for all of it. Streaming: unbounded, you produce answers as data arrives. The question becomes: when do I know I've seen "enough" to produce an answer?
The fundamental insight (Dataflow paper, 2015): you cannot answer this perfectly. You can only trade between completeness, latency, and cost. Watermarks are how systems pick a point on that trade-off.
2. Event time vs processing time
- Event time — when the event happened in the real world (in the payload).
- Processing time — when the streaming system observed it.
Difference matters because events are delayed by network, batching, mobile-offline, etc. A "user clicked at 12:00:01" event might arrive at 12:00:05 — or 12:30:00.
Almost always, you want event-time semantics. Processing-time aggregates are non-deterministic across replays.
3. Watermarks — "time is X-ish"
A watermark W(t) is a claim: "I believe all events with event time ≤ t have arrived." When the watermark passes the end of a window, that window can fire.
Two strategies:
- Perfect watermark — only achievable if you can characterise sources perfectly.
- Heuristic watermark — based on observed max event time minus a slack. E.g.,
W = max_seen_event_time - 5 seconds.
Late events (after watermark) get handled by allowed lateness — extend the window's lifespan and update results; eventually drop.
events: 12:00:01, 12:00:03, 12:00:02 (late!), 12:00:05
max=01 max=03 still 03 max=05
watermark = max - 2s:
-1 01 01 03
Window [12:00:00, 12:00:05) fires when watermark crosses 12:00:05 → when we see an event with ts ≥ 12:00:07 (because of 2s slack).
4. Windows
- Tumbling — fixed-size, non-overlapping.
[0,5), [5,10), [10,15). The default for periodic reports. - Sliding — fixed-size, overlapping.
[0,5), [1,6), [2,7). Rolling stats. - Session — gap-based; groups events with
\< gaptime between. User sessions. - Global — one window for the whole stream; only fires on triggers.
Trigger = "when should this window emit?". Default: at watermark. Custom: every N elements, every N seconds, on watermark + early/late, etc.
5. State management
Streaming = batch with state. Every stateful operation (counts per key, joins, top-K) carries state across events.
Keyed state — partitioned by key. count_per_user["alice"] = 42.
Operator state — non-keyed; e.g., Kafka source offsets.
Broadcast state — small reference data sent to every operator instance.
State backends:
- Memory — fast, fits in heap. Lose it on crash unless checkpointed.
- RocksDB (Flink default for big state) — on-disk, LSM. Slower per op, but TBs are fine. Snapshots to S3/HDFS.
6. Checkpoints and savepoints (Flink)
Checkpoint — automatic, triggered every N seconds. Used for failure recovery. Cleared after retention. Cheap, frequent.
Savepoint — manual, used to upgrade code or migrate state. More expensive (full).
On failure: Flink restores from last checkpoint, replays Kafka from checkpointed offsets. End-to-end exactly-once requires:
- Source replay (Kafka with consumer offsets in checkpoint).
- State recovery (checkpoint).
- Sink idempotency or transactionality (Kafka transactional producer; database upsert with primary key).
7. Spark Structured Streaming
Spark's model: streaming = infinitely growing table. Each micro-batch is a SQL query over the new rows + state.
events = (spark.readStream
.format("kafka")
.option("subscribe", "events")
.load())
per_user = (events
.withWatermark("event_time", "10 minutes")
.groupBy(window("event_time", "5 minutes"), "user_id")
.count())
per_user.writeStream.format("delta").outputMode("append").start("/data/out")
Micro-batch latency floor: ~hundreds of ms to seconds. Recently added "continuous processing" mode for ms-level latency at lower throughput.
8. Flink — true streaming
Flink processes record-by-record. Latency floor: single-digit ms. State is first-class; APIs are more low-level.
DataStream<Event> events = env.addSource(new FlinkKafkaConsumer<>(...));
events
.keyBy(Event::getUserId)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.reduce((a, b) -> a.merge(b))
.addSink(new FlinkKafkaProducer<>(...));
9. Choosing Flink vs Spark Structured Streaming
| Dimension | Flink | Spark SS |
|---|---|---|
| Latency floor | ~10 ms | ~100 ms (micro-batch) |
| Stateful ops | First-class, advanced | Improving, simpler API |
| SQL | Yes (mature) | Yes (very mature) |
| Batch story | Reasonable | Best-in-class |
| Operations | Dedicated cluster | Reuse Spark infra |
| Community / ecosystem | Strong in streaming | Wider |
Pick Flink if streaming is core to your business (Uber, Netflix, payment processors).
Pick Spark Structured Streaming if you already run Spark for batch and need "good enough" streaming — saves the operational tax of a second engine.
10. Common bugs
- Wall-clock for timestamp —
now()in your sink, notevent_time. Replay non-determinism. - No watermark, default null — windows never fire. Set a watermark even for processing time.
- Allowed lateness too generous — state grows forever; OOM.
- State backend mismatch — RocksDB with tiny state has overhead; switch to heap.
- Sink not idempotent — exactly-once at the engine doesn't mean exactly-once at the DB.
11. Real production numbers (a Flink job we operated)
- 500 K events/sec from Kafka.
- 5-minute tumbling window, key = (user_id, event_type).
- 50 GB state in RocksDB.
- Checkpoint every 30 s; full size ~12 GB, incremental ~200 MB.
- p99 end-to-end latency: 1.2 s.
- Hardware: 24 task managers × 8 vCPU × 32 GB.
After tuning (parallelism = 96, RocksDB block cache = 8 GB per TM, async snapshots): p99 down to 600 ms, checkpoint time halved.
Reading material
- Streaming Systems book (Akidau, Chernyak, Lax)
- Dataflow Model paper
- Flink docs — Stateful processing
- Spark Structured Streaming guide
In-depth research material
- Tyler Akidau — Streaming 101 / 102
- Flink Forward talks (YouTube)
- Watermarks in Flink — deep dive
- Kafka Streams Topology Guide
Video reference
▶︎ Flink Forward — Watermarks in Detail
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Sliding Window Maximum
- Link: https://leetcode.com/problems/sliding-window-maximum/
- Difficulty: Hard
- Why this problem: Monotonic deque of indices — exact pattern a streaming window aggregator uses.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Distinguish event time and processing time; pick one for analytics.
- Explain watermarks and allowed lateness; trace a late event.
- Pick a window type for: top-K every 5 min, session lengths, daily aggregates.
- Configure end-to-end exactly-once across Kafka → Flink → Kafka.
- Choose Flink vs Spark SS for a given workload.
- Solve
sliding-window-maximum— monotonic deque, the textbook streaming window kernel.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.