Change Data Capture — Debezium, Outbox Pattern, Snapshot+Stream
Session 40 of the 48-session learning series.
Date: Sat, 2026-07-11 · Time: 09:00–11:00 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 07 · Est. read: 2 h
Why this session matters
This is Session 40 of 48 in the DE track. CDC is how you keep two systems in sync without polling — the connective tissue between OLTP and OLAP, between microservices, between primary and search/cache. Get it right and your data lake stays seconds-fresh; get it wrong and you lose updates.
Agenda
- What CDC is and why polling is the wrong answer
- WAL/binlog mining — how Debezium really works
- Snapshot + stream — bootstrapping correctness
- Outbox pattern — atomic publish-with-write
- Schema evolution and watermarks for CDC streams
Pre-read (skim before the session)
- Debezium architecture overview
- Microservices.io — Transactional Outbox pattern
- Martin Kleppmann — Turning the database inside out
- Confluent — Streaming ETL with Kafka Connect
Deep dive
1. The problem CDC solves
You have a Postgres OLTP. You want:
- A Snowflake replica updated within seconds.
- A search index (Elastic / OpenSearch) updated within seconds.
- An event stream to drive ML features.
- An audit log of every change for compliance.
Bad options:
SELECT * WHERE updated_at > last_poll— misses deletes; misses non-updated_at columns; polling load on prod.- Application dual-write — write to DB and to Kafka in the same handler. Two-phase consistency problem.
CDC option: read the DB's write-ahead log (WAL). Every committed change is there, in order, with full row diffs.
2. How log mining works
Every transactional DB has a sequential log of every change for crash recovery:
- Postgres: WAL.
- MySQL: binary log.
- SQL Server: transaction log.
- Mongo: oplog.
A CDC tool reads the log:
- Decodes each entry into
{op: INSERT|UPDATE|DELETE, before: {...}, after: {...}}. - Publishes to a downstream sink (Kafka, S3, anywhere).
- Tracks position (LSN, GTID) for resume after crash.
It's exactly-once on the producer side (the DB committed it), at-least-once on the sink side (you must dedup).
3. Debezium — the open-source workhorse
Architecture:
- Kafka Connect framework + Debezium connectors.
- One connector per DB → reads log → emits per-table Kafka topic.
- Schema Registry holds Avro/JSON schema; auto-evolves.
[ Postgres WAL ]
↓
[ Debezium Postgres connector (Kafka Connect worker) ]
↓
[ Kafka: pg.public.users topic ]
↓
[ consumers: warehouse, search index, ML pipeline, ... ]
Each consumer is independent. Add new ones without touching the source.
4. Snapshot + stream
The cold-start problem: the WAL doesn't contain history before you turned CDC on.
Solution:
- Snapshot phase —
SELECT * FROM tablefor each table, publish each row as if it were an INSERT, in some boundary transaction. - Stream phase — switch to log mining from the LSN captured at snapshot start.
- Order matters: snapshot must finish before stream catches up, or you get out-of-order writes.
Modern variants:
- Incremental snapshots (Debezium 1.5+) — snapshot a chunk at a time, interleaved with stream. No long lock; resumable.
- Backfill from warehouse + stream from now — if the warehouse already has history, skip the snapshot.
5. The outbox pattern
Problem: when a service writes a business event to DB and needs to publish to Kafka, the two are not atomic. Crash between them = inconsistency.
Solution: write to an outbox table in the same DB transaction as the business write. A CDC poll/connector reads the outbox and publishes to Kafka, then marks as published (or deletes).
BEGIN;
INSERT INTO orders(...);
INSERT INTO outbox(event_type, payload, status) VALUES ('OrderCreated', '{...}', 'PENDING');
COMMIT;
Separate process:
for row in outbox.select_pending():
kafka.publish(topic=row.type, value=row.payload)
outbox.mark_published(row.id)
CDC the outbox table with Debezium → no need for the custom poller. Best of both.
Trade-off: outbox grows; needs periodic cleanup.
6. Schema evolution
The DB schema changes. The CDC stream must:
- Detect (Debezium does, via DDL events in some sources).
- Emit a new schema version (Avro forward/backward compat).
- Consumers handle the change (add new optional fields = no-op; drop field = consumers ignore or fail).
Rules:
- Only add columns, never drop, in CDC-fed schemas.
- Always nullable defaults.
- Versioned schema registry; deprecate slowly.
- Mark breaking changes; require consumer acknowledgement before producing.
7. Watermarks and lateness
CDC events have:
- DB commit time (source timestamp).
- WAL position (logical clock).
- Sink ingestion time.
If consumers do windowed aggregations (S23), pick a watermark policy:
event_time = source commit time— most correct for OLAP.- Late events arrive after watermark; either drop or trigger restate.
For dedup: use (source_table, source_pk, LSN) as the unique key.
8. Performance and back-pressure
A busy OLTP can produce 100k+ row changes/sec. Connector throughput must keep up or:
- WAL grows unbounded → DB out of disk.
- Lag grows → downstream stale.
Patterns:
- Keep connector workers near the DB (low network latency).
- Filter at the connector — don't ship every column / every table; ship what's needed.
- Multiple connectors per DB (per schema, per high-traffic table).
- Monitor
wal_lag_bytes— alert at 1 GB / N minutes behind.
9. Deletes are subtle
CDC delete events contain before row but after = null. Sinks must handle:
- Warehouse: soft delete with
deleted_atcolumn? hard delete? - Search index: explicit delete by ID.
- Aggregations: must subtract old contribution.
Hard deletes are easy to get wrong. Default to soft deletes in downstream stores for safety.
10. Consistency levels
- Eventually consistent — downstream lags source by seconds-minutes; ok for analytics.
- Read-your-writes — caller writes, then immediately reads from downstream; downstream might not have it yet. Solve with: read from source for that key; or sync wait until LSN reached.
- Strong consistency across sinks — impossible without 2PC across systems. Don't promise this.
Be honest about consistency in API docs.
11. Failure recovery
The CDC connector restarts:
- Reads its last committed offset (LSN).
- Resumes from there.
- DB must retain WAL beyond that LSN (configure retention!).
If you lose the offset state: re-snapshot. Cost: minutes to hours depending on data size.
Always store offset state durably (Kafka Connect uses a Kafka topic itself).
12. Reality check
CDC stack for a new project:
- Postgres (or MySQL) as OLTP.
- Debezium → Kafka.
- Outbox table for explicit business events.
- Per-domain consumers: warehouse loader, search index updater, ML feature stream.
- Schema Registry (Confluent or Apicurio).
- Lag dashboards + alerts.
CDC isn't the simplest pattern. It earns its complexity when you have 3+ downstream consumers of the same data. Below that, just dual-write carefully and move on.
Reading material
- Debezium docs
- Designing Data-Intensive Applications — chapter on logs and CDC
- Confluent — CDC patterns
- microservices.io — Outbox pattern
In-depth research material
- Martin Kleppmann — Turning the database inside out
- Materialize — Streaming CDC use cases
- Postgres logical replication docs
- Schema Registry — Avro compatibility
Video reference
▶︎ Change Data Capture Deep Dive (Tim Berglund)
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Event Emitter
- Link: https://leetcode.com/problems/event-emitter/
- Difficulty: Medium
- Why this problem: Build a publish/subscribe primitive in code — the same shape Debezium implements over a DB log.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Explain why polling is the wrong way to sync changes.
- Describe WAL/binlog log mining at a high level.
- Sketch the snapshot + stream bootstrap correctly.
- Implement the outbox pattern in a service.
- Handle schema evolution and deletes safely in CDC sinks.
- Solve
event-emitter— subscribe/publish/unsubscribe; CDC's interface in miniature.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.