Search Tech Journey

Find topics, journeys and posts

back to blog
data engineeringintermediate 12m2026-06-09

Change Data Capture — Debezium, Outbox Pattern, Snapshot+Stream

Session 40 of the 48-session learning series.

Date: Sat, 2026-07-11 · Time: 09:00–11:00 IST · Track: 🗂️ Data Engineering (DE) · Parent 28-day topic: Day 07 · Est. read: 2 h

Why this session matters

This is Session 40 of 48 in the DE track. CDC is how you keep two systems in sync without polling — the connective tissue between OLTP and OLAP, between microservices, between primary and search/cache. Get it right and your data lake stays seconds-fresh; get it wrong and you lose updates.

Agenda

  • What CDC is and why polling is the wrong answer
  • WAL/binlog mining — how Debezium really works
  • Snapshot + stream — bootstrapping correctness
  • Outbox pattern — atomic publish-with-write
  • Schema evolution and watermarks for CDC streams

Pre-read (skim before the session)

Deep dive

1. The problem CDC solves

You have a Postgres OLTP. You want:

  • A Snowflake replica updated within seconds.
  • A search index (Elastic / OpenSearch) updated within seconds.
  • An event stream to drive ML features.
  • An audit log of every change for compliance.

Bad options:

  • SELECT * WHERE updated_at > last_poll — misses deletes; misses non-updated_at columns; polling load on prod.
  • Application dual-write — write to DB and to Kafka in the same handler. Two-phase consistency problem.

CDC option: read the DB's write-ahead log (WAL). Every committed change is there, in order, with full row diffs.

2. How log mining works

Every transactional DB has a sequential log of every change for crash recovery:

  • Postgres: WAL.
  • MySQL: binary log.
  • SQL Server: transaction log.
  • Mongo: oplog.

A CDC tool reads the log:

  • Decodes each entry into {op: INSERT|UPDATE|DELETE, before: {...}, after: {...}}.
  • Publishes to a downstream sink (Kafka, S3, anywhere).
  • Tracks position (LSN, GTID) for resume after crash.

It's exactly-once on the producer side (the DB committed it), at-least-once on the sink side (you must dedup).

3. Debezium — the open-source workhorse

Architecture:

  • Kafka Connect framework + Debezium connectors.
  • One connector per DB → reads log → emits per-table Kafka topic.
  • Schema Registry holds Avro/JSON schema; auto-evolves.
[ Postgres WAL ]
      ↓
[ Debezium Postgres connector (Kafka Connect worker) ]
      ↓
[ Kafka: pg.public.users topic ]
      ↓
[ consumers: warehouse, search index, ML pipeline, ... ]

Each consumer is independent. Add new ones without touching the source.

4. Snapshot + stream

The cold-start problem: the WAL doesn't contain history before you turned CDC on.

Solution:

  1. Snapshot phaseSELECT * FROM table for each table, publish each row as if it were an INSERT, in some boundary transaction.
  2. Stream phase — switch to log mining from the LSN captured at snapshot start.
  3. Order matters: snapshot must finish before stream catches up, or you get out-of-order writes.

Modern variants:

  • Incremental snapshots (Debezium 1.5+) — snapshot a chunk at a time, interleaved with stream. No long lock; resumable.
  • Backfill from warehouse + stream from now — if the warehouse already has history, skip the snapshot.

5. The outbox pattern

Problem: when a service writes a business event to DB and needs to publish to Kafka, the two are not atomic. Crash between them = inconsistency.

Solution: write to an outbox table in the same DB transaction as the business write. A CDC poll/connector reads the outbox and publishes to Kafka, then marks as published (or deletes).

BEGIN;
  INSERT INTO orders(...);
  INSERT INTO outbox(event_type, payload, status) VALUES ('OrderCreated', '{...}', 'PENDING');
COMMIT;

Separate process:

for row in outbox.select_pending():
    kafka.publish(topic=row.type, value=row.payload)
    outbox.mark_published(row.id)

CDC the outbox table with Debezium → no need for the custom poller. Best of both.

Trade-off: outbox grows; needs periodic cleanup.

6. Schema evolution

The DB schema changes. The CDC stream must:

  • Detect (Debezium does, via DDL events in some sources).
  • Emit a new schema version (Avro forward/backward compat).
  • Consumers handle the change (add new optional fields = no-op; drop field = consumers ignore or fail).

Rules:

  • Only add columns, never drop, in CDC-fed schemas.
  • Always nullable defaults.
  • Versioned schema registry; deprecate slowly.
  • Mark breaking changes; require consumer acknowledgement before producing.

7. Watermarks and lateness

CDC events have:

  • DB commit time (source timestamp).
  • WAL position (logical clock).
  • Sink ingestion time.

If consumers do windowed aggregations (S23), pick a watermark policy:

  • event_time = source commit time — most correct for OLAP.
  • Late events arrive after watermark; either drop or trigger restate.

For dedup: use (source_table, source_pk, LSN) as the unique key.

8. Performance and back-pressure

A busy OLTP can produce 100k+ row changes/sec. Connector throughput must keep up or:

  • WAL grows unbounded → DB out of disk.
  • Lag grows → downstream stale.

Patterns:

  • Keep connector workers near the DB (low network latency).
  • Filter at the connector — don't ship every column / every table; ship what's needed.
  • Multiple connectors per DB (per schema, per high-traffic table).
  • Monitor wal_lag_bytes — alert at 1 GB / N minutes behind.

9. Deletes are subtle

CDC delete events contain before row but after = null. Sinks must handle:

  • Warehouse: soft delete with deleted_at column? hard delete?
  • Search index: explicit delete by ID.
  • Aggregations: must subtract old contribution.

Hard deletes are easy to get wrong. Default to soft deletes in downstream stores for safety.

10. Consistency levels

  • Eventually consistent — downstream lags source by seconds-minutes; ok for analytics.
  • Read-your-writes — caller writes, then immediately reads from downstream; downstream might not have it yet. Solve with: read from source for that key; or sync wait until LSN reached.
  • Strong consistency across sinks — impossible without 2PC across systems. Don't promise this.

Be honest about consistency in API docs.

11. Failure recovery

The CDC connector restarts:

  • Reads its last committed offset (LSN).
  • Resumes from there.
  • DB must retain WAL beyond that LSN (configure retention!).

If you lose the offset state: re-snapshot. Cost: minutes to hours depending on data size.

Always store offset state durably (Kafka Connect uses a Kafka topic itself).

12. Reality check

CDC stack for a new project:

  • Postgres (or MySQL) as OLTP.
  • Debezium → Kafka.
  • Outbox table for explicit business events.
  • Per-domain consumers: warehouse loader, search index updater, ML feature stream.
  • Schema Registry (Confluent or Apicurio).
  • Lag dashboards + alerts.

CDC isn't the simplest pattern. It earns its complexity when you have 3+ downstream consumers of the same data. Below that, just dual-write carefully and move on.

Reading material

In-depth research material

Video reference

▶︎ Change Data Capture Deep Dive (Tim Berglund)

Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.

LeetCode — Event Emitter

  • Link: https://leetcode.com/problems/event-emitter/
  • Difficulty: Medium
  • Why this problem: Build a publish/subscribe primitive in code — the same shape Debezium implements over a DB log.
  • Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

  • Explain why polling is the wrong way to sync changes.
  • Describe WAL/binlog log mining at a high level.
  • Sketch the snapshot + stream bootstrap correctly.
  • Implement the outbox pattern in a service.
  • Handle schema evolution and deletes safely in CDC sinks.
  • Solve event-emitter — subscribe/publish/unsubscribe; CDC's interface in miniature.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.