data engineeringintermediate 12m2026-06-09

Streaming with Flink/Spark — Watermarks, Windows, State

Session 23 of the 48-session learning series.

Date: Sat, 2026-06-27 · Time: 14:30–16:30 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 17 · Est. read: 2 h

Why this session matters

This is Session 23 of 48 in the Data Engineering track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

Event time vs processing time — and why everyone confuses them
Watermarks — how systems decide "time has passed"
Windows — tumbling, sliding, session, global
State management — keyed state, broadcast state, checkpoints, savepoints
Flink vs Spark Structured Streaming — choosing for your workload

Pre-read (skim before the session)

Deep dive

1. Why streaming is different

Batch: bounded dataset, you wait for all of it. Streaming: unbounded, you produce answers as data arrives. The question becomes: when do I know I've seen "enough" to produce an answer?

The fundamental insight (Dataflow paper, 2015): you cannot answer this perfectly. You can only trade between completeness, latency, and cost. Watermarks are how systems pick a point on that trade-off.

2. Event time vs processing time

Event time — when the event happened in the real world (in the payload).
Processing time — when the streaming system observed it.

Difference matters because events are delayed by network, batching, mobile-offline, etc. A "user clicked at 12:00:01" event might arrive at 12:00:05 — or 12:30:00.

Almost always, you want event-time semantics. Processing-time aggregates are non-deterministic across replays.

3. Watermarks — "time is X-ish"

A watermark W(t) is a claim: "I believe all events with event time ≤ t have arrived." When the watermark passes the end of a window, that window can fire.

Two strategies:

Perfect watermark — only achievable if you can characterise sources perfectly.
Heuristic watermark — based on observed max event time minus a slack. E.g., W = max_seen_event_time - 5 seconds.

Late events (after watermark) get handled by allowed lateness — extend the window's lifespan and update results; eventually drop.

events: 12:00:01, 12:00:03, 12:00:02 (late!), 12:00:05
        max=01    max=03    still 03      max=05
watermark = max - 2s:
            -1     01        01           03

Window [12:00:00, 12:00:05) fires when watermark crosses 12:00:05 → when we see an event with ts ≥ 12:00:07 (because of 2s slack).

4. Windows

Tumbling — fixed-size, non-overlapping. [0,5), [5,10), [10,15). The default for periodic reports.
Sliding — fixed-size, overlapping. [0,5), [1,6), [2,7). Rolling stats.
Session — gap-based; groups events with \< gap time between. User sessions.
Global — one window for the whole stream; only fires on triggers.

Trigger = "when should this window emit?". Default: at watermark. Custom: every N elements, every N seconds, on watermark + early/late, etc.

5. State management

Streaming = batch with state. Every stateful operation (counts per key, joins, top-K) carries state across events.

Keyed state — partitioned by key. count_per_user["alice"] = 42.

Operator state — non-keyed; e.g., Kafka source offsets.

Broadcast state — small reference data sent to every operator instance.

State backends:

Memory — fast, fits in heap. Lose it on crash unless checkpointed.
RocksDB (Flink default for big state) — on-disk, LSM. Slower per op, but TBs are fine. Snapshots to S3/HDFS.

6. Checkpoints and savepoints (Flink)

Checkpoint — automatic, triggered every N seconds. Used for failure recovery. Cleared after retention. Cheap, frequent.

Savepoint — manual, used to upgrade code or migrate state. More expensive (full).

On failure: Flink restores from last checkpoint, replays Kafka from checkpointed offsets. End-to-end exactly-once requires:

Source replay (Kafka with consumer offsets in checkpoint).
State recovery (checkpoint).
Sink idempotency or transactionality (Kafka transactional producer; database upsert with primary key).

7. Spark Structured Streaming

Spark's model: streaming = infinitely growing table. Each micro-batch is a SQL query over the new rows + state.

events = (spark.readStream
    .format("kafka")
    .option("subscribe", "events")
    .load())

per_user = (events
    .withWatermark("event_time", "10 minutes")
    .groupBy(window("event_time", "5 minutes"), "user_id")
    .count())

per_user.writeStream.format("delta").outputMode("append").start("/data/out")

Micro-batch latency floor: ~hundreds of ms to seconds. Recently added "continuous processing" mode for ms-level latency at lower throughput.

8. Flink — true streaming

Flink processes record-by-record. Latency floor: single-digit ms. State is first-class; APIs are more low-level.

DataStream<Event> events = env.addSource(new FlinkKafkaConsumer<>(...));
events
  .keyBy(Event::getUserId)
  .window(TumblingEventTimeWindows.of(Time.minutes(5)))
  .reduce((a, b) -> a.merge(b))
  .addSink(new FlinkKafkaProducer<>(...));

9. Choosing Flink vs Spark Structured Streaming

Dimension	Flink	Spark SS
Latency floor	~10 ms	~100 ms (micro-batch)
Stateful ops	First-class, advanced	Improving, simpler API
SQL	Yes (mature)	Yes (very mature)
Batch story	Reasonable	Best-in-class
Operations	Dedicated cluster	Reuse Spark infra
Community / ecosystem	Strong in streaming	Wider

Pick Flink if streaming is core to your business (Uber, Netflix, payment processors).

Pick Spark Structured Streaming if you already run Spark for batch and need "good enough" streaming — saves the operational tax of a second engine.

10. Common bugs

Wall-clock for timestamp — now() in your sink, not event_time. Replay non-determinism.
No watermark, default null — windows never fire. Set a watermark even for processing time.
Allowed lateness too generous — state grows forever; OOM.
State backend mismatch — RocksDB with tiny state has overhead; switch to heap.
Sink not idempotent — exactly-once at the engine doesn't mean exactly-once at the DB.

11. Real production numbers (a Flink job we operated)

500 K events/sec from Kafka.
5-minute tumbling window, key = (user_id, event_type).
50 GB state in RocksDB.
Checkpoint every 30 s; full size ~12 GB, incremental ~200 MB.
p99 end-to-end latency: 1.2 s.
Hardware: 24 task managers × 8 vCPU × 32 GB.

After tuning (parallelism = 96, RocksDB block cache = 8 GB per TM, async snapshots): p99 down to 600 ms, checkpoint time halved.

Link: https://leetcode.com/problems/sliding-window-maximum/
Difficulty: Hard
Why this problem: Monotonic deque of indices — exact pattern a streaming window aggregator uses.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Distinguish event time and processing time; pick one for analytics.
Explain watermarks and allowed lateness; trace a late event.
Pick a window type for: top-K every 5 min, session lengths, daily aggregates.
Configure end-to-end exactly-once across Kafka → Flink → Kafka.
Choose Flink vs Spark SS for a given workload.
Solve sliding-window-maximum — monotonic deque, the textbook streaming window kernel.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Designing a Chat System — Connections, Fanout, Storage, Delivery

MLOps — Experiment Tracking, Model Registry, CI/CD for Models