Search Tech Journey

Find topics, journeys and posts

back to blog
data engineeringadvanced 12m2026-06-15

Day 17 — Streaming with Flink / Spark Structured Streaming — Watermarks & Windows

Real-time analytics, fraud, IOT, personalisation — all flow through stream processors. Watermarks, late data, and exactly-once semantics are the hard parts that…

Streaming compute differs from batch in one fundamental way: the data never ends, so you must decide when to emit results. Watermarks, windows, triggers and accumulation modes are the four knobs that answer "when".

🧠 Concept

Why it matters & the mental model.

1. Event time vs processing time

  • Event time: when the event happened (in the source system).
  • Processing time: when your engine sees it. Network delays, app retries, mobile being offline → events arrive minutes/hours late. Use event time for correctness, processing time only for ops dashboards.

2. Windows

  • Tumbling: fixed, non-overlapping (every 1 min).
  • Sliding: fixed size, overlapping (1 min size, 10 s slide → 6 reports per minute).
  • Session: dynamic, closes after idle gap (great for "browsing session" analytics).
  • Global with custom triggers: full control.

3. Watermarks — the "I think I've seen everything up to T" signal

Watermark = monotonically non-decreasing timestamp the system uses to decide when to close windows. Generated heuristically: watermark = max_event_time_seen - allowed_lateness. If a late event arrives after the watermark passes its window:

  • Drop (default cheap mode).
  • Re-trigger the window (Flink "late firings"; Spark "watermark + late tolerance").
  • Side-output to a "late data" stream for manual handling.

🛠 Deep Dive

Internals, code, architecture.

4. Triggers & accumulation

  • Trigger: when to emit (at end of window, every N elements, every M seconds, on watermark).
  • Accumulation: discard / accumulate / accumulate-retract. Determines whether downstream sees deltas or full restatements.

5. State management

Streaming operators are stateful: counts, joins, deduplication, session tracking. Flink uses keyed state + RocksDB-backed snapshots to disk; Spark uses state store with HDFS/S3 checkpoints. Tune state TTL to bound size; missing TTL is the #1 cause of slow degradation.

6. Exactly-once semantics

End-to-end EOS requires:

  • Replayable source (Kafka with offsets).
  • Checkpointed state.
  • Idempotent or transactional sink (Kafka with transactional.id, JDBC with upsert).

Flink's checkpoint barriers (Chandy-Lamport) snapshot the entire DAG consistently; on failure, restore + replay since the checkpoint. Spark uses commit logs per micro-batch.

7. Joins

  • Stream-stream join with windowed bounds (e.g. impression and click within 30 min).
  • Stream-table join (enrichment with slowly-changing dim, broadcast or temporal join).
  • Interval join (Flink): elegant for fraud / matching.

🚀 In Practice

Trade-offs, exercises, what to ship today.

8. Backpressure

When downstream can't keep up, pressure must propagate up to the source (slow consume from Kafka). Flink/Spark do this natively. Don't drop unless you've architecturally decided dropping is OK (best-effort metrics).

9. Operational concerns

  • Checkpoint interval: too short = overhead, too long = big replay. Start at 1-10 min.
  • Restart strategy: fixed-delay / exponential-backoff; alert on N restarts in 1h.
  • Schema evolution: stop-and-redeploy vs. backward-compatible Avro (preferred).
  • Skew: re-key or pre-aggregate.
  • Flink: true event-by-event, lower latency (10-100 ms), rich windowing, complex state — pick for sub-second pipelines and event-time-heavy work.
  • Spark Structured Streaming: micro-batch (continuous mode is experimental), seconds latency, easy if team already runs Spark.

11. What to take away

"Design a fraud detection pipeline." Strong answers: event time, watermark with allowed lateness, session windows or sliding, stateful key per card, exactly-once sink, late-data handling, restart story.

Key points

    Resources

    Practice Problem: Sliding Window Median (Hard)