data engineeringintermediate 12m2026-06-09

Change Data Capture — Debezium, Outbox Pattern, Snapshot+Stream

Session 40 of the 48-session learning series.

Date: Sat, 2026-07-11 · Time: 09:00–11:00 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 07 · Est. read: 2 h

Why this session matters

This is Session 40 of 48 in the DE track. CDC is how you keep two systems in sync without polling — the connective tissue between OLTP and OLAP, between microservices, between primary and search/cache. Get it right and your data lake stays seconds-fresh; get it wrong and you lose updates.

Agenda

What CDC is and why polling is the wrong answer
WAL/binlog mining — how Debezium really works
Snapshot + stream — bootstrapping correctness
Outbox pattern — atomic publish-with-write
Schema evolution and watermarks for CDC streams

Pre-read (skim before the session)

Deep dive

1. The problem CDC solves

You have a Postgres OLTP. You want:

A Snowflake replica updated within seconds.
A search index (Elastic / OpenSearch) updated within seconds.
An event stream to drive ML features.
An audit log of every change for compliance.

Bad options:

SELECT * WHERE updated_at > last_poll — misses deletes; misses non-updated_at columns; polling load on prod.
Application dual-write — write to DB and to Kafka in the same handler. Two-phase consistency problem.

CDC option: read the DB's write-ahead log (WAL). Every committed change is there, in order, with full row diffs.

2. How log mining works

Every transactional DB has a sequential log of every change for crash recovery:

Postgres: WAL.
MySQL: binary log.
SQL Server: transaction log.
Mongo: oplog.

A CDC tool reads the log:

Decodes each entry into {op: INSERT|UPDATE|DELETE, before: {...}, after: {...}}.
Publishes to a downstream sink (Kafka, S3, anywhere).
Tracks position (LSN, GTID) for resume after crash.

It's exactly-once on the producer side (the DB committed it), at-least-once on the sink side (you must dedup).

3. Debezium — the open-source workhorse

Architecture:

Kafka Connect framework + Debezium connectors.
One connector per DB → reads log → emits per-table Kafka topic.
Schema Registry holds Avro/JSON schema; auto-evolves.

[ Postgres WAL ]
      ↓
[ Debezium Postgres connector (Kafka Connect worker) ]
      ↓
[ Kafka: pg.public.users topic ]
      ↓
[ consumers: warehouse, search index, ML pipeline, ... ]

Each consumer is independent. Add new ones without touching the source.

4. Snapshot + stream

The cold-start problem: the WAL doesn't contain history before you turned CDC on.

Solution:

Snapshot phase — SELECT * FROM table for each table, publish each row as if it were an INSERT, in some boundary transaction.
Stream phase — switch to log mining from the LSN captured at snapshot start.
Order matters: snapshot must finish before stream catches up, or you get out-of-order writes.

Modern variants:

Incremental snapshots (Debezium 1.5+) — snapshot a chunk at a time, interleaved with stream. No long lock; resumable.
Backfill from warehouse + stream from now — if the warehouse already has history, skip the snapshot.

5. The outbox pattern

Problem: when a service writes a business event to DB and needs to publish to Kafka, the two are not atomic. Crash between them = inconsistency.

Solution: write to an outbox table in the same DB transaction as the business write. A CDC poll/connector reads the outbox and publishes to Kafka, then marks as published (or deletes).

BEGIN;
  INSERT INTO orders(...);
  INSERT INTO outbox(event_type, payload, status) VALUES ('OrderCreated', '{...}', 'PENDING');
COMMIT;

Separate process:

for row in outbox.select_pending():
    kafka.publish(topic=row.type, value=row.payload)
    outbox.mark_published(row.id)

CDC the outbox table with Debezium → no need for the custom poller. Best of both.

Trade-off: outbox grows; needs periodic cleanup.

6. Schema evolution

The DB schema changes. The CDC stream must:

Detect (Debezium does, via DDL events in some sources).
Emit a new schema version (Avro forward/backward compat).
Consumers handle the change (add new optional fields = no-op; drop field = consumers ignore or fail).

Rules:

Only add columns, never drop, in CDC-fed schemas.
Always nullable defaults.
Versioned schema registry; deprecate slowly.
Mark breaking changes; require consumer acknowledgement before producing.

7. Watermarks and lateness

CDC events have:

DB commit time (source timestamp).
WAL position (logical clock).
Sink ingestion time.

If consumers do windowed aggregations (S23), pick a watermark policy:

event_time = source commit time — most correct for OLAP.
Late events arrive after watermark; either drop or trigger restate.

For dedup: use (source_table, source_pk, LSN) as the unique key.

8. Performance and back-pressure

A busy OLTP can produce 100k+ row changes/sec. Connector throughput must keep up or:

WAL grows unbounded → DB out of disk.
Lag grows → downstream stale.

Patterns:

Keep connector workers near the DB (low network latency).
Filter at the connector — don't ship every column / every table; ship what's needed.
Multiple connectors per DB (per schema, per high-traffic table).
Monitor wal_lag_bytes — alert at 1 GB / N minutes behind.

9. Deletes are subtle

CDC delete events contain before row but after = null. Sinks must handle:

Warehouse: soft delete with deleted_at column? hard delete?
Search index: explicit delete by ID.
Aggregations: must subtract old contribution.

Hard deletes are easy to get wrong. Default to soft deletes in downstream stores for safety.

10. Consistency levels

Eventually consistent — downstream lags source by seconds-minutes; ok for analytics.
Read-your-writes — caller writes, then immediately reads from downstream; downstream might not have it yet. Solve with: read from source for that key; or sync wait until LSN reached.
Strong consistency across sinks — impossible without 2PC across systems. Don't promise this.

Be honest about consistency in API docs.

11. Failure recovery

The CDC connector restarts:

Reads its last committed offset (LSN).
Resumes from there.
DB must retain WAL beyond that LSN (configure retention!).

If you lose the offset state: re-snapshot. Cost: minutes to hours depending on data size.

Always store offset state durably (Kafka Connect uses a Kafka topic itself).

12. Reality check

CDC stack for a new project:

Postgres (or MySQL) as OLTP.
Debezium → Kafka.
Outbox table for explicit business events.
Per-domain consumers: warehouse loader, search index updater, ML feature stream.
Schema Registry (Confluent or Apicurio).
Lag dashboards + alerts.

CDC isn't the simplest pattern. It earns its complexity when you have 3+ downstream consumers of the same data. Below that, just dual-write carefully and move on.

Link: https://leetcode.com/problems/event-emitter/
Difficulty: Medium
Why this problem: Build a publish/subscribe primitive in code — the same shape Debezium implements over a DB log.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Explain why polling is the wrong way to sync changes.
Describe WAL/binlog log mining at a high level.
Sketch the snapshot + stream bootstrap correctly.
Implement the outbox pattern in a service.
Handle schema evolution and deletes safely in CDC sinks.
Solve event-emitter — subscribe/publish/unsubscribe; CDC's interface in miniature.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Designing a Distributed Job Queue — Reliability, Backoff, Idempotency

Testing, Mocks, Property-Based Tests, Mutation Testing