Kafka Part 2 — Replication, ISR, Consumer Groups, Exactly-Once
Session 15 of the 48-session learning series.
Date: Sun, 2026-06-21 · Time: 09:00–11:00 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 07 · Est. read: 2 h
Why this session matters
This is Session 15 of 48 in the Data Engineering track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.
Agenda
- Replication — leader/follower, ISR, high watermark, acks
- Consumer groups — rebalance, sticky vs cooperative, offset management
- Delivery semantics — at-most-once, at-least-once, exactly-once
- Transactions and the idempotent producer — what they actually guarantee
- Operating Kafka — unclean leader election, min.insync.replicas, retention
Pre-read (skim before the session)
- Confluent — Kafka Internals
- Confluent — Exactly-Once Semantics Are Possible
- KIP-98 — Exactly Once Delivery and Transactional Messaging
- Cooperative Rebalancing (KIP-429)
Deep dive
1. Replication primer
Each partition has one leader and zero or more followers (replicas). Writes go to the leader; followers Fetch to copy. The set of followers caught up to the leader is the ISR (In-Sync Replicas).
producer ── write ──▶ leader ── replicate ──▶ follower1
└────────▶ follower2
└────────▶ follower3 (lagging — kicked out of ISR)
A follower stays in ISR as long as it has fetched within replica.lag.time.max.ms (default 30 s). Drop behind and you're out — and writes can succeed without you.
2. The high watermark
The high watermark (HW) is the offset up to which all ISR replicas have replicated. Consumers can only read up to the HW. Why? Because anything past the HW might be lost if the leader dies before followers catch up.
offsets: 0 1 2 3 4 5 6 7 8
leader: ■ ■ ■ ■ ■ ■ ■ ■ ■
isr1: ■ ■ ■ ■ ■ ■ ■
isr2: ■ ■ ■ ■ ■ ■ ■
HW = 7 (consumers can read 0..6)
LEO = 9 (Log End Offset on leader)
3. acks — producer durability knob
acks=0 → fire and forget. Lowest latency, can lose unbounded messages.
acks=1 → leader acks once written to its log. Can lose 1 message on leader failure.
acks=all → leader acks after all ISR replicas have written. No data loss IF
min.insync.replicas ≥ 2 and ISR ≥ min.insync.replicas.
The gotcha: acks=all alone is not durable. If your ISR shrinks to just the leader (followers all lag), acks=all is effectively acks=1. Always pair with min.insync.replicas=2 (for RF=3) so the leader rejects writes when it can't guarantee.
4. min.insync.replicas — the actual durability knob
| RF | min.insync.replicas | Tolerates | Notes |
|---|---|---|---|
| 3 | 2 | 1 broker loss | Standard for production |
| 3 | 3 | 0 broker loss | Higher latency, no resilience |
| 3 | 1 | 2 broker loss | But data can be lost on leader failure |
When ISR \< min.insync.replicas, producers get NotEnoughReplicasException and writes pause. That's the desired behaviour — pause rather than silently risk data loss.
5. Unclean leader election
If all ISR replicas die, what happens? Two options:
unclean.leader.election.enable=false(default since 2.x): partition unavailable until an ISR replica returns. Data preserved.unclean.leader.election.enable=true: an out-of-sync replica becomes leader. Service restored; data loss (everything past the lagger's offset is gone).
Don't enable unclean election unless you're 100% sure your data can be re-derived.
6. Consumer groups
Multiple consumers in the same group share a topic — each partition assigned to exactly one consumer. Add consumers → throughput scales linearly up to # partitions. Beyond that, extra consumers sit idle.
topic: 6 partitions
group: 3 consumers
→ each consumer gets 2 partitions
group: 6 consumers
→ each consumer gets 1 partition
group: 9 consumers
→ 6 active, 3 idle
7. Rebalance — the operational landmine
When a consumer joins or leaves, Kafka rebalances: all consumers stop, partitions are reassigned. Two modes:
- Eager (legacy): everyone gives up all partitions, then re-acquires. Full stop-the-world.
- Cooperative (KIP-429, default since 2.4): only the moving partitions pause. Most consumers keep processing.
Pre-cooperative, a rolling deploy could cause 30+ seconds of cumulative pause. Cooperative cut that to seconds. Always use cooperative now (partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor).
8. Offsets — where you are in the log
Consumers commit their offset to __consumer_offsets (an internal Kafka topic). Two modes:
- Auto-commit (
enable.auto.commit=true, every 5 s): convenient, but commits before you've processed the messages → at-least-once where you might skip on crash. Avoid for non-trivial work. - Manual commit (
consumer.commitSync()after processing): commit after you've durably written / acked downstream. Standard for production.
9. Delivery semantics
| Mode | Producer | Consumer | Use case |
|---|---|---|---|
| At-most-once | acks=0 | auto-commit before process | Metrics, logs (loss OK) |
| At-least-once | acks=all | commit after process | Most pipelines (must be idempotent) |
| Exactly-once | idempotent producer + transactions | read-process-write in transaction | Financial, audit |
10. Idempotent producer (KIP-98)
Setting enable.idempotence=true makes the producer attach a producer ID (PID) and sequence number to each record. The broker dedupes any retried record with the same (PID, seq). This eliminates the classic "retry caused duplicate" bug.
Default since 3.0. Always on. Free correctness improvement.
11. Transactions
producer = KafkaProducer(transactional_id="payments-tx-1", enable_idempotence=True)
producer.init_transactions()
producer.begin_transaction()
producer.send("orders", order_msg)
producer.send("payments", payment_msg)
producer.send_offsets_to_transaction(consumer.position(), consumer_group)
producer.commit_transaction() # or abort_transaction() on error
Transactions provide atomic multi-partition writes and read-process-write exactly-once when the consumer reads in read_committed mode (skips aborted records). They cost 5–20% throughput. Worth it for billing, ledgers, audit pipelines.
12. Retention
log.retention.hours=168 # 7 days
log.retention.bytes=-1 # unlimited
log.segment.bytes=1073741824 # 1 GB segments
log.cleanup.policy=delete # or 'compact' for keyed topics
Compacted topics keep only the latest value per key (key=customer_id, value=current profile) — a Kafka topic doubles as a changelog / KV store. State stores in Kafka Streams are built on compacted topics.
13. Operating in production
- Monitor: under-replicated partitions, ISR shrinks, consumer lag, request latency.
- Capacity: aim for <70% disk usage; partitions sized so retention fits.
- Partition count: 100–1000 per broker; too many = high controller overhead, slow leader election.
- JVM tuning: G1GC, ~25% heap, rest for OS page cache (Kafka leans on page cache hard).
- Network: 10 Gbps minimum for serious throughput; producers are usually network-bound.
Reading material
- Confluent — Kafka Internals docs
- Kafka — The Definitive Guide, 2nd ed. (Narkhede et al., O'Reilly)
- KIP-98 — Exactly-once
- KIP-429 — Cooperative rebalancing
In-depth research material
- Exactly-Once Semantics Are Possible (Confluent blog)
- Kafka KRaft mode (KIP-500)
- Jay Kreps — The Log: What every software engineer should know
- Jepsen analysis of Kafka
Video reference
▶︎ Confluent — Kafka Internals (Tim Berglund)
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Design Hit Counter
- Link: https://leetcode.com/problems/design-hit-counter/
- Difficulty: Medium
- Why this problem: Bucketed queue by second; sum last 300 buckets for 5-min window.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Explain the ISR + high-watermark mechanism with a diagram.
- State the difference between
acks=allandacks=all + min.insync.replicas=2. - Describe cooperative rebalancing and why it beats eager.
- Implement at-least-once consumer logic with manual offset commits.
- Explain what transactions guarantee (and what they don't).
- Solve
design-hit-counter— bucketed time window, same primitive as Kafka's log segment trimming.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.