← System Design Data & Messaging Systems
System Design

Change Data Capture

Eliminate dual-write drift by using log-based CDC to capture database mutations asynchronously.

TL;DR
  • Eliminate dual-write drift by using log-based CDC to capture database mutations asynchronously.
  • Read directly from the database transaction log to avoid the performance penalties of trigger-based or query-polling approaches.
  • Decouple database schemas from downstream consumers by translating raw mutations into structured events using a schema registry.
  • Monitor replication lag and WAL disk usage to prevent database outages when downstream consumers stall.

The Problem

When applications attempt to update a primary database and write to a search index or cache in a single transaction block (dual-writes), network partitions or application crashes inevitably cause data drift. If the database write succeeds but the cache update fails, the systems become permanently inconsistent.

Conversely, polling the database with SELECT queries to find updated rows degrades database performance, misses intermediate updates (deletes and rapid updates), and introduces significant latency.

Core System Idea

Change Data Capture (CDC) solves this by treating the database's internal write-ahead log (WAL) or transaction log as a reliable event source. Because every database mutation (INSERT, UPDATE, DELETE) must be written to the log before being applied to the storage engine, a log-based CDC engine can read these mutations asynchronously without impacting query execution performance.

The CDC engine parses the raw binary log, extracts the before-and-after states of the mutated rows, and publishes these changes as structured events to a message broker.

Downstream systems (caches, search indexes, data warehouses) consume these events to update their state. This guarantees eventual consistency without dual-writes, as the transaction log acts as the single source of truth.

System Flow

flowchart TD A["Application Client"] -->|"Write Transaction"| B["Primary Database"] B -->|"Write-Ahead Log"| C["Database Engine"] D["CDC Engine"] -->|"Tail Binary Log"| C D -->|"Emit Mutation Event"| E["Schema Registry"] D -->|"Publish Event"| F["Message Broker"] F -->|"Consume and Update"| G["Elasticsearch / Cache"]

Log-based Change Data Capture pipeline extracting database mutations and streaming them to downstream search and cache systems.

Real-World Examples Indicative

Debezium at Shopify — binlog with 24hr WAL retention

Shopify runs Debezium connectors against MySQL binlog for their Order management system. binlog_retention_hours is set to 24 hours to guard against connector downtime. Each connector reads binlog via READ_COMMITTED snapshot isolation, capturing every INSERT/UPDATE/DELETE on the orders, line_items, and inventory_levels tables. Events are published to Kafka with schemas registered in Confluent Schema Registry using Avro. During Shopify's 2021 platform migration, the CDC pipeline allowed engineers to run old and new database schemas in parallel for 6 weeks by publishing to two separate consumer groups simultaneously — with zero application code changes.

Netflix DBLog — chunk-by-chunk snapshot without read locks

Netflix built DBLog to solve a specific problem: snapshotting a running database without acquiring a read lock. DBLog reads a window of primary keys (e.g., WHERE id BETWEEN 1 AND 10000), pauses, captures binlog events that arrived during that chunk window, merges them with the chunk scan results, and advances to the next chunk. This ensures snapshot consistency with the live binlog stream without blocking writes. Netflix uses DBLog to bootstrap new search index shards from their user_profile table (100M+ rows) without any downtime or query performance impact on the primary.

Stripe Outbox Pattern — exactly-once webhook delivery

Stripe's payment service uses the Outbox Pattern to guarantee that every charge.succeeded event triggers exactly one webhook. When a payment transitions state, the service atomically writes to the payments table AND inserts a row into an outbox table within the same ACID transaction. A Debezium connector tails the outbox table binlog and publishes events to SQS. The webhook dispatcher deduplicates using the event's idempotency_key stored in a separate Redis set, ensuring that even if SQS delivers the event twice, only one HTTP request reaches the merchant endpoint.

Anti-Patterns

Using Database Triggers for CDC

Implementing AFTER INSERT OR UPDATE triggers on high-write tables degrades transaction throughput, increases lock contention, and bloats the database.

Directly Exposing Raw DB Schemas Downstream

Streaming raw database column names and types directly to downstream consumers tightly couples them to the database schema, causing downstream failures during migrations.

Ignoring Delete Events

Failing to handle DELETE mutations in the CDC stream results in orphaned data and memory leaks in downstream caches and search indexes.

Unmonitored WAL Growth

If the CDC consumer stops reading or falls behind, the database must retain WAL files on disk, which can quickly exhaust disk space and crash the primary database.

Design Tradeoffs

DimensionLog-Based CDCQuery-Based Polling
DB query overheadZero performance impact on query execution; reads directly from the binary log, never touching the query engineHigh overhead; constant polling of indexed timestamp or status columns competes with application queries for I/O
Capture completenessCaptures all intermediate states, including hard deletes and rapid sequential updates between poll intervalsMisses intermediate updates and hard deletes entirely if they occur between polling windows
Operational complexityHigh; requires deep database permissions, connector deployment, schema registry integration, and WAL disk monitoringExtremely simple; standard SQL queries and a scheduler — no specialized tools or database permissions required

Best Practices

Use the Outbox PatternIf you need transactionally guaranteed application events, write them to an outbox table within the same database transaction, and have CDC tail that table specifically.
Integrate a Schema RegistryUse Avro or Protobuf with a schema registry to enforce backward-compatible schema changes, shielding downstream consumers from database migrations.
Monitor Replication LagSet up strict alerting on replication lag (the time difference between the database write and the CDC event publication) to detect stalled pipelines.
Configure WAL Retention LimitsSet maximum WAL size limits on the database to prevent disk exhaustion if the CDC connector goes offline permanently.
Implement Idempotent Downstream ConsumersEnsure downstream consumers use unique primary keys from the source database to perform upserts, preventing duplicate event processing issues.

When to Use / Avoid

Use WhenAvoid When
Synchronizing relational databases with search indexes (Elasticsearch) or caches (Redis) in real time.The source database does not expose a transaction log or binary log API.
Building real-time data pipelines for streaming transactional data into a data warehouse (Snowflake).You only need to run daily or weekly batch ETL jobs where latency is not a concern.
Implementing event-driven microservices based on database state changes without dual-writes.The application requires immediate, synchronous read-your-writes consistency across all systems.