Day 17 — Streaming with Flink / Spark Structured Streaming — Watermarks & Windows
Real-time analytics, fraud, IOT, personalisation — all flow through stream processors. Watermarks, late data, and exactly-once semantics are the hard parts that…
Streaming compute differs from batch in one fundamental way: the data never ends, so you must decide when to emit results. Watermarks, windows, triggers and accumulation modes are the four knobs that answer "when".
🧠 Concept
Why it matters & the mental model.
1. Event time vs processing time
- Event time: when the event happened (in the source system).
- Processing time: when your engine sees it. Network delays, app retries, mobile being offline → events arrive minutes/hours late. Use event time for correctness, processing time only for ops dashboards.
2. Windows
- Tumbling: fixed, non-overlapping (every 1 min).
- Sliding: fixed size, overlapping (1 min size, 10 s slide → 6 reports per minute).
- Session: dynamic, closes after idle gap (great for "browsing session" analytics).
- Global with custom triggers: full control.
3. Watermarks — the "I think I've seen everything up to T" signal
Watermark = monotonically non-decreasing timestamp the system uses to decide when to close windows. Generated heuristically: watermark = max_event_time_seen - allowed_lateness. If a late event arrives after the watermark passes its window:
- Drop (default cheap mode).
- Re-trigger the window (Flink "late firings"; Spark "watermark + late tolerance").
- Side-output to a "late data" stream for manual handling.
🛠 Deep Dive
Internals, code, architecture.
4. Triggers & accumulation
- Trigger: when to emit (at end of window, every N elements, every M seconds, on watermark).
- Accumulation: discard / accumulate / accumulate-retract. Determines whether downstream sees deltas or full restatements.
5. State management
Streaming operators are stateful: counts, joins, deduplication, session tracking. Flink uses keyed state + RocksDB-backed snapshots to disk; Spark uses state store with HDFS/S3 checkpoints. Tune state TTL to bound size; missing TTL is the #1 cause of slow degradation.
6. Exactly-once semantics
End-to-end EOS requires:
- Replayable source (Kafka with offsets).
- Checkpointed state.
- Idempotent or transactional sink (Kafka with
transactional.id, JDBC with upsert).
Flink's checkpoint barriers (Chandy-Lamport) snapshot the entire DAG consistently; on failure, restore + replay since the checkpoint. Spark uses commit logs per micro-batch.
7. Joins
- Stream-stream join with windowed bounds (e.g. impression and click within 30 min).
- Stream-table join (enrichment with slowly-changing dim, broadcast or temporal join).
- Interval join (Flink): elegant for fraud / matching.
🚀 In Practice
Trade-offs, exercises, what to ship today.
8. Backpressure
When downstream can't keep up, pressure must propagate up to the source (slow consume from Kafka). Flink/Spark do this natively. Don't drop unless you've architecturally decided dropping is OK (best-effort metrics).
9. Operational concerns
- Checkpoint interval: too short = overhead, too long = big replay. Start at 1-10 min.
- Restart strategy: fixed-delay / exponential-backoff; alert on N restarts in 1h.
- Schema evolution: stop-and-redeploy vs. backward-compatible Avro (preferred).
- Skew: re-key or pre-aggregate.
10. Flink vs Spark Streaming today
- Flink: true event-by-event, lower latency (10-100 ms), rich windowing, complex state — pick for sub-second pipelines and event-time-heavy work.
- Spark Structured Streaming: micro-batch (continuous mode is experimental), seconds latency, easy if team already runs Spark.
11. What to take away
"Design a fraud detection pipeline." Strong answers: event time, watermark with allowed lateness, session windows or sliding, stateful key per card, exactly-once sink, late-data handling, restart story.
Resources
- 🎥 Confluent — Streaming 101 (Tyler Akidau)
- 📖 Apache Flink — Event time and watermarks
- 📖 Spark Structured Streaming programming guide
- 📖 Streaming Systems book — Tyler Akidau
Practice Problem: Sliding Window Median (Hard)