← all lessons
System Design · Consistency & Replication ·

Change Data Capture: Streaming Every Write Without Touching Your Application Code

Core concept
When multiple systems need to react to changes in a database — a search index, a cache, an analytics pipeline — the naïve approach is to write to all of them in your application code. This creates tight coupling and silent inconsistency whenever one write succeeds and another fails. Change Data Capture (CDC) takes a different route: it reads the database's own internal changelog (the same append-only log the database uses for crash recovery) and turns every insert, update, and delete into an event stream that downstream consumers can subscribe to independently. Because the source of truth is the database log itself — not your application — you can add new consumers without changing a single line of business logic.

Diagram

flowchart LR
    A[App] -->|writes| B[Primary DB]
    B -->|replication log| C[CDC Reader]
    C -->|event stream| D[Search Index]
    C -->|event stream| E[Cache Invalidator]
    C -->|event stream| F[Analytics Pipeline]

Concrete real-world example
Imagine an e-commerce platform where a product's price changes in the primary PostgreSQL (a popular open-source relational database) table. Without CDC, the application must also update Elasticsearch (a distributed full-text search engine) and invalidate a Redis (an in-memory key-value store) cache in the same request. If Elasticsearch is temporarily down, the price stays stale in search results indefinitely. With CDC — using a tool like Debezium (an open-source CDC connector that reads database logs) — the price change is captured from PostgreSQL's WAL (write-ahead log: append-only record of every DB change) and published to a Kafka (a distributed, high-throughput event streaming platform) topic. Elasticsearch and Redis each consume that topic at their own pace, and if either is down it simply replays from where it left off when it recovers.

One trade-off / gotcha
CDC gives you at-least-once delivery (each event arrives one or more times, never zero), not exactly-once. If your consumer crashes mid-processing, it will re-read and re-apply the same event on restart. Consumers must therefore be idempotent (safe to apply the same operation multiple times without changing the outcome beyond the first application) — updating a row to price = 49.99 twice is fine; incrementing a counter twice is not.

An interview-style question to ponder
Your team wants to populate a brand-new search index from an existing database that already has 50 million rows, then keep it in sync going forward using CDC. What ordering problem do you have to solve, and how would you approach it?

Stuck? Show a hint

Think about what happens in time between the moment you start the bulk snapshot and the moment CDC begins delivering live events — there's a gap where both are running simultaneously.

Show answer

You need to ensure the snapshot and the CDC stream stitch together without gaps or duplicates, which means capturing a log position before the snapshot starts.

  • The core problem is a race: if you snapshot first and then start CDC, any writes that happened during the snapshot are already baked into the database pages you copied but also appear as CDC events. If you start CDC after the snapshot finishes, you miss writes that happened during the copy entirely.
  • The solution is to record the exact log sequence number (LSN — a monotonically increasing pointer into the database's changelog) at the moment the snapshot begins, then start consuming CDC events from that exact LSN once the bulk load completes. Events whose changes are already reflected in the snapshot data are applied again (idempotently), and no events are skipped.
  • Concretely: begin a consistent read transaction in PostgreSQL, note its LSN (e.g., 0/16B3748), stream all 50 million rows into Elasticsearch, commit, then tell your CDC consumer "start from LSN 0/16B3748." You'll re-apply a few thousand updates redundantly, but nothing is lost.
  • But why not just truncate and re-snapshot periodically instead of running CDC at all? For 50 million rows, a full re-index takes hours and puts heavy read load on the primary database, meaning your search index is always hours stale; CDC keeps lag in the low seconds at a fraction of the I/O cost.
  • Watch out: if your database log is not retained long enough (PostgreSQL's wal_keep_size setting controls this), the LSN you need may be garbage-collected before your bulk load finishes — always size log retention to exceed your worst-case snapshot duration.