data engineeringadvanced 12m2026-06-15

Day 17 — Streaming with Flink / Spark Structured Streaming — Watermarks & Windows

Real-time analytics, fraud, IOT, personalisation — all flow through stream processors. Watermarks, late data, and exactly-once semantics are the hard parts that…

Streaming compute differs from batch in one fundamental way: the data never ends, so you must decide when to emit results. Watermarks, windows, triggers and accumulation modes are the four knobs that answer "when".

🧠 Concept

Why it matters & the mental model.

1. Event time vs processing time

Event time: when the event happened (in the source system).
Processing time: when your engine sees it. Network delays, app retries, mobile being offline → events arrive minutes/hours late. Use event time for correctness, processing time only for ops dashboards.

2. Windows

Tumbling: fixed, non-overlapping (every 1 min).
Sliding: fixed size, overlapping (1 min size, 10 s slide → 6 reports per minute).
Session: dynamic, closes after idle gap (great for "browsing session" analytics).
Global with custom triggers: full control.

3. Watermarks — the "I think I've seen everything up to T" signal

Watermark = monotonically non-decreasing timestamp the system uses to decide when to close windows. Generated heuristically: watermark = max_event_time_seen - allowed_lateness. If a late event arrives after the watermark passes its window:

Drop (default cheap mode).
Re-trigger the window (Flink "late firings"; Spark "watermark + late tolerance").
Side-output to a "late data" stream for manual handling.

🛠 Deep Dive

Internals, code, architecture.

4. Triggers & accumulation

Trigger: when to emit (at end of window, every N elements, every M seconds, on watermark).
Accumulation: discard / accumulate / accumulate-retract. Determines whether downstream sees deltas or full restatements.

5. State management

Streaming operators are stateful: counts, joins, deduplication, session tracking. Flink uses keyed state + RocksDB-backed snapshots to disk; Spark uses state store with HDFS/S3 checkpoints. Tune state TTL to bound size; missing TTL is the #1 cause of slow degradation.

6. Exactly-once semantics

End-to-end EOS requires:

Replayable source (Kafka with offsets).
Checkpointed state.
Idempotent or transactional sink (Kafka with transactional.id, JDBC with upsert).

Flink's checkpoint barriers (Chandy-Lamport) snapshot the entire DAG consistently; on failure, restore + replay since the checkpoint. Spark uses commit logs per micro-batch.

7. Joins

Stream-stream join with windowed bounds (e.g. impression and click within 30 min).
Stream-table join (enrichment with slowly-changing dim, broadcast or temporal join).
Interval join (Flink): elegant for fraud / matching.

🚀 In Practice

Trade-offs, exercises, what to ship today.

8. Backpressure

When downstream can't keep up, pressure must propagate up to the source (slow consume from Kafka). Flink/Spark do this natively. Don't drop unless you've architecturally decided dropping is OK (best-effort metrics).

9. Operational concerns

Checkpoint interval: too short = overhead, too long = big replay. Start at 1-10 min.
Restart strategy: fixed-delay / exponential-backoff; alert on N restarts in 1h.
Schema evolution: stop-and-redeploy vs. backward-compatible Avro (preferred).
Skew: re-key or pre-aggregate.

10. Flink vs Spark Streaming today

Flink: true event-by-event, lower latency (10-100 ms), rich windowing, complex state — pick for sub-second pipelines and event-time-heavy work.
Spark Structured Streaming: micro-batch (continuous mode is experimental), seconds latency, easy if team already runs Spark.

11. What to take away

"Design a fraud detection pipeline." Strong answers: event time, watermark with allowed lateness, session windows or sliding, stateful key per card, exactly-once sink, late-data handling, restart story.

Key points

Resources

Practice Problem: Sliding Window Median (Hard)

← previous

Day 16 — LLM Evaluation — Benchmarks, LLM-as-Judge, RAGAS, Inspect

Day 18 — Recommender Systems — Two-Tower, Multi-Stage Ranking