Search Tech Journey

Find topics, journeys and posts

back to blog
data engineeringintermediate 12m2026-06-09

Streaming with Flink/Spark — Watermarks, Windows, State

Session 23 of the 48-session learning series.

Date: Sat, 2026-06-27 · Time: 14:30–16:30 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 17 · Est. read: 2 h

Why this session matters

This is Session 23 of 48 in the Data Engineering track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

  • Event time vs processing time — and why everyone confuses them
  • Watermarks — how systems decide "time has passed"
  • Windows — tumbling, sliding, session, global
  • State management — keyed state, broadcast state, checkpoints, savepoints
  • Flink vs Spark Structured Streaming — choosing for your workload

Pre-read (skim before the session)

Deep dive

1. Why streaming is different

Batch: bounded dataset, you wait for all of it. Streaming: unbounded, you produce answers as data arrives. The question becomes: when do I know I've seen "enough" to produce an answer?

The fundamental insight (Dataflow paper, 2015): you cannot answer this perfectly. You can only trade between completeness, latency, and cost. Watermarks are how systems pick a point on that trade-off.

2. Event time vs processing time

  • Event time — when the event happened in the real world (in the payload).
  • Processing time — when the streaming system observed it.

Difference matters because events are delayed by network, batching, mobile-offline, etc. A "user clicked at 12:00:01" event might arrive at 12:00:05 — or 12:30:00.

Almost always, you want event-time semantics. Processing-time aggregates are non-deterministic across replays.

3. Watermarks — "time is X-ish"

A watermark W(t) is a claim: "I believe all events with event time ≤ t have arrived." When the watermark passes the end of a window, that window can fire.

Two strategies:

  • Perfect watermark — only achievable if you can characterise sources perfectly.
  • Heuristic watermark — based on observed max event time minus a slack. E.g., W = max_seen_event_time - 5 seconds.

Late events (after watermark) get handled by allowed lateness — extend the window's lifespan and update results; eventually drop.

events: 12:00:01, 12:00:03, 12:00:02 (late!), 12:00:05
        max=01    max=03    still 03      max=05
watermark = max - 2s:
            -1     01        01           03

Window [12:00:00, 12:00:05) fires when watermark crosses 12:00:05 → when we see an event with ts ≥ 12:00:07 (because of 2s slack).

4. Windows

  • Tumbling — fixed-size, non-overlapping. [0,5), [5,10), [10,15). The default for periodic reports.
  • Sliding — fixed-size, overlapping. [0,5), [1,6), [2,7). Rolling stats.
  • Session — gap-based; groups events with \< gap time between. User sessions.
  • Global — one window for the whole stream; only fires on triggers.

Trigger = "when should this window emit?". Default: at watermark. Custom: every N elements, every N seconds, on watermark + early/late, etc.

5. State management

Streaming = batch with state. Every stateful operation (counts per key, joins, top-K) carries state across events.

Keyed state — partitioned by key. count_per_user["alice"] = 42.

Operator state — non-keyed; e.g., Kafka source offsets.

Broadcast state — small reference data sent to every operator instance.

State backends:

  • Memory — fast, fits in heap. Lose it on crash unless checkpointed.
  • RocksDB (Flink default for big state) — on-disk, LSM. Slower per op, but TBs are fine. Snapshots to S3/HDFS.

Checkpoint — automatic, triggered every N seconds. Used for failure recovery. Cleared after retention. Cheap, frequent.

Savepoint — manual, used to upgrade code or migrate state. More expensive (full).

On failure: Flink restores from last checkpoint, replays Kafka from checkpointed offsets. End-to-end exactly-once requires:

  1. Source replay (Kafka with consumer offsets in checkpoint).
  2. State recovery (checkpoint).
  3. Sink idempotency or transactionality (Kafka transactional producer; database upsert with primary key).

7. Spark Structured Streaming

Spark's model: streaming = infinitely growing table. Each micro-batch is a SQL query over the new rows + state.

events = (spark.readStream
    .format("kafka")
    .option("subscribe", "events")
    .load())

per_user = (events
    .withWatermark("event_time", "10 minutes")
    .groupBy(window("event_time", "5 minutes"), "user_id")
    .count())

per_user.writeStream.format("delta").outputMode("append").start("/data/out")

Micro-batch latency floor: ~hundreds of ms to seconds. Recently added "continuous processing" mode for ms-level latency at lower throughput.

Flink processes record-by-record. Latency floor: single-digit ms. State is first-class; APIs are more low-level.

DataStream<Event> events = env.addSource(new FlinkKafkaConsumer<>(...));
events
  .keyBy(Event::getUserId)
  .window(TumblingEventTimeWindows.of(Time.minutes(5)))
  .reduce((a, b) -> a.merge(b))
  .addSink(new FlinkKafkaProducer<>(...));
DimensionFlinkSpark SS
Latency floor~10 ms~100 ms (micro-batch)
Stateful opsFirst-class, advancedImproving, simpler API
SQLYes (mature)Yes (very mature)
Batch storyReasonableBest-in-class
OperationsDedicated clusterReuse Spark infra
Community / ecosystemStrong in streamingWider

Pick Flink if streaming is core to your business (Uber, Netflix, payment processors).

Pick Spark Structured Streaming if you already run Spark for batch and need "good enough" streaming — saves the operational tax of a second engine.

10. Common bugs

  1. Wall-clock for timestampnow() in your sink, not event_time. Replay non-determinism.
  2. No watermark, default null — windows never fire. Set a watermark even for processing time.
  3. Allowed lateness too generous — state grows forever; OOM.
  4. State backend mismatch — RocksDB with tiny state has overhead; switch to heap.
  5. Sink not idempotent — exactly-once at the engine doesn't mean exactly-once at the DB.
  • 500 K events/sec from Kafka.
  • 5-minute tumbling window, key = (user_id, event_type).
  • 50 GB state in RocksDB.
  • Checkpoint every 30 s; full size ~12 GB, incremental ~200 MB.
  • p99 end-to-end latency: 1.2 s.
  • Hardware: 24 task managers × 8 vCPU × 32 GB.

After tuning (parallelism = 96, RocksDB block cache = 8 GB per TM, async snapshots): p99 down to 600 ms, checkpoint time halved.

Reading material

In-depth research material

Video reference

▶︎ Flink Forward — Watermarks in Detail

Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.

LeetCode — Sliding Window Maximum

Post-session checklist

By the end of this session you should be able to:

  • Distinguish event time and processing time; pick one for analytics.
  • Explain watermarks and allowed lateness; trace a late event.
  • Pick a window type for: top-K every 5 min, session lengths, daily aggregates.
  • Configure end-to-end exactly-once across Kafka → Flink → Kafka.
  • Choose Flink vs Spark SS for a given workload.
  • Solve sliding-window-maximum — monotonic deque, the textbook streaming window kernel.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.