data engineeringadvanced 12m2026-06-05

Day 07 — Apache Kafka Deep Dive — Partitions, Replication, Consumer Groups, Exactly-Once

Kafka is the universal log of modern data infra. Mastery of partition keys, consumer-group rebalancing, and EOS is the difference between a streaming system tha…

Kafka is a distributed append-only log, partitioned for parallelism and replicated for durability. Almost every advanced behaviour you hit in practice falls out of those two design choices.

A topic is a logical stream, split into N partitions. Each partition is an immutable, ordered log on disk, replicated to R brokers. Producers append by key (key → partition via hash). Consumers in a group divide the partitions among themselves; each consumer commits its own offset.

2. The order guarantee

Kafka guarantees per-partition order, never global. If user_42 must be ordered, key by user_id and every event for that user lands on the same partition. Cross-key order is not guaranteed; if you need it you must collapse to one partition (and lose parallelism).

3. Replication & ISR

Each partition has a leader (handles reads/writes) and R-1 followers. ISR = in-sync replicas, those that have caught up within replica.lag.time.max.ms. Producers can set acks:

acks=0: fire-and-forget. Lose data on broker death.
acks=1: leader acked. Lose if leader dies before replication.
acks=all: all ISR acked. Survives min.insync.replicas - 1 failures. Production default: acks=all, min.insync.replicas=2, replication.factor=3.

4. Consumer groups & rebalancing

All consumers in the same group.id share partitions; each partition is read by exactly one consumer in the group. When members come/go, the group coordinator triggers a rebalance:

Eager rebalance (legacy): everyone stops, partitions reassigned, everyone starts → "stop the world".
Cooperative incremental rebalance (KIP-429, default 3.x): only impacted partitions move, others keep consuming. Much smoother. Set partition.assignment.strategy = CooperativeStickyAssignor.

🛠 Deep Dive

Internals, code, architecture.

5. Offset management — the source of duplicates

Offsets are stored in the internal __consumer_offsets topic. The classic pitfall: auto-commit + manual processing — commit may run before your side-effect completes, so on crash you skip events. Two safe patterns:

At-least-once + idempotent sink: process, then commit; on crash, replay a few events, downstream handles duplicates (idempotency key, upsert).
Exactly-once (EOS): producer uses transactional.id, wraps send + sendOffsetsToTransaction in a transaction; consumer sets isolation.level=read_committed. End-to-end exactly-once across Kafka→processing→Kafka.

6. Producer internals

Batching (linger.ms, batch.size): wait up to linger.ms to accumulate batch.size bytes per partition before sending. Throughput vs latency knob.
Compression (compression.type=zstd): 3-5× reduction, almost free CPU on modern brokers.
Idempotent producer (enable.idempotence=true, default in 3.x): each producer gets an ID; broker dedupes on retry within a session. Different from EOS but a prerequisite.

7. Storage & retention

Each partition is a series of segment files (e.g. 1 GB each). Retention can be time-based (retention.ms), size-based (retention.bytes), or log-compacted (keep only latest value per key — great for CDC/state).

8. Performance — sequential IO is king

Kafka's speed comes from:

Append-only sequential disk writes: SSDs and HDDs both love this.
Zero-copy via sendfile: data goes disk → NIC without touching JVM heap.
Page cache: hot data served from RAM by the OS; Kafka itself uses little JVM heap.

🚀 In Practice

Trade-offs, exercises, what to ship today.

9. Common ops nightmares

Under-replicated partitions: a broker is lagging; check disk, GC, network.
Consumer lag growing: scale consumers (≤ partitions), check downstream sink, profile UDFs.
Rebalance storm: misconfigured session.timeout.ms / max.poll.interval.ms causing consumers to be marked dead → frequent rebalances → no progress. Tune so processing time + buffer < max.poll.interval.ms.
Hot partition: bad key choice (e.g. country_code with US dominating). Re-key or use composite key.

10. Schema discipline

Use a Schema Registry (Confluent / Apicurio) with Avro / Protobuf / JSON Schema. Pick a compatibility mode (BACKWARD is the safe default). This stops the inevitable "producer added a field, consumer broke" outage.

11. Streaming on top

Kafka Streams (JVM library, exactly-once, stateful joins/aggregations).
Flink (separate cluster, richer semantics, watermarks, sessions).
Spark Structured Streaming (micro-batch, easy if you're already on Spark).
ksqlDB (SQL-on-Kafka for simple transforms).

12. What to take away

"Design a streaming pipeline for X." Strong answers always specify: partition key + rationale, RF + acks, idempotency strategy, lag monitoring, and one recovery scenario. Bonus points for naming min.insync.replicas correctly.

Key points

Resources

Practice Problem: Sliding Window Maximum (Hard)

← previous

Day 06 — Retrieval-Augmented Generation (RAG) End-to-End

Day 08 — Embeddings, Vector Spaces & Contrastive Learning