Change Data Capture
Eliminate dual-write drift by using log-based CDC to capture database mutations asynchronously.
- Eliminate dual-write drift by using log-based CDC to capture database mutations asynchronously.
- Read directly from the database transaction log to avoid the performance penalties of trigger-based or query-polling approaches.
- Decouple database schemas from downstream consumers by translating raw mutations into structured events using a schema registry.
- Monitor replication lag and WAL disk usage to prevent database outages when downstream consumers stall.
The Problem
When applications attempt to update a primary database and write to a search index or cache in a single transaction block (dual-writes), network partitions or application crashes inevitably cause data drift. If the database write succeeds but the cache update fails, the systems become permanently inconsistent.
Conversely, polling the database with SELECT queries to find updated rows degrades database performance, misses intermediate updates (deletes and rapid updates), and introduces significant latency.
Core System Idea
Change Data Capture (CDC) solves this by treating the database's internal write-ahead log (WAL) or transaction log as a reliable event source. Because every database mutation (INSERT, UPDATE, DELETE) must be written to the log before being applied to the storage engine, a log-based CDC engine can read these mutations asynchronously without impacting query execution performance.
The CDC engine parses the raw binary log, extracts the before-and-after states of the mutated rows, and publishes these changes as structured events to a message broker.
Downstream systems (caches, search indexes, data warehouses) consume these events to update their state. This guarantees eventual consistency without dual-writes, as the transaction log acts as the single source of truth.
System Flow
Log-based Change Data Capture pipeline extracting database mutations and streaming them to downstream search and cache systems.
Real-World Examples Indicative
Shopify runs Debezium connectors against MySQL binlog for their Order management system. binlog_retention_hours is set to 24 hours to guard against connector downtime. Each connector reads binlog via READ_COMMITTED snapshot isolation, capturing every INSERT/UPDATE/DELETE on the orders, line_items, and inventory_levels tables. Events are published to Kafka with schemas registered in Confluent Schema Registry using Avro. During Shopify's 2021 platform migration, the CDC pipeline allowed engineers to run old and new database schemas in parallel for 6 weeks by publishing to two separate consumer groups simultaneously — with zero application code changes.
Netflix built DBLog to solve a specific problem: snapshotting a running database without acquiring a read lock. DBLog reads a window of primary keys (e.g., WHERE id BETWEEN 1 AND 10000), pauses, captures binlog events that arrived during that chunk window, merges them with the chunk scan results, and advances to the next chunk. This ensures snapshot consistency with the live binlog stream without blocking writes. Netflix uses DBLog to bootstrap new search index shards from their user_profile table (100M+ rows) without any downtime or query performance impact on the primary.
Stripe's payment service uses the Outbox Pattern to guarantee that every charge.succeeded event triggers exactly one webhook. When a payment transitions state, the service atomically writes to the payments table AND inserts a row into an outbox table within the same ACID transaction. A Debezium connector tails the outbox table binlog and publishes events to SQS. The webhook dispatcher deduplicates using the event's idempotency_key stored in a separate Redis set, ensuring that even if SQS delivers the event twice, only one HTTP request reaches the merchant endpoint.
Anti-Patterns
Implementing AFTER INSERT OR UPDATE triggers on high-write tables degrades transaction throughput, increases lock contention, and bloats the database.
Streaming raw database column names and types directly to downstream consumers tightly couples them to the database schema, causing downstream failures during migrations.
Failing to handle DELETE mutations in the CDC stream results in orphaned data and memory leaks in downstream caches and search indexes.
If the CDC consumer stops reading or falls behind, the database must retain WAL files on disk, which can quickly exhaust disk space and crash the primary database.
Design Tradeoffs
| Dimension | Log-Based CDC | Query-Based Polling |
|---|---|---|
| DB query overhead | Zero performance impact on query execution; reads directly from the binary log, never touching the query engine | High overhead; constant polling of indexed timestamp or status columns competes with application queries for I/O |
| Capture completeness | Captures all intermediate states, including hard deletes and rapid sequential updates between poll intervals | Misses intermediate updates and hard deletes entirely if they occur between polling windows |
| Operational complexity | High; requires deep database permissions, connector deployment, schema registry integration, and WAL disk monitoring | Extremely simple; standard SQL queries and a scheduler — no specialized tools or database permissions required |
Best Practices
outbox table within the same database transaction, and have CDC tail that table specifically.When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Synchronizing relational databases with search indexes (Elasticsearch) or caches (Redis) in real time. | The source database does not expose a transaction log or binary log API. |
| Building real-time data pipelines for streaming transactional data into a data warehouse (Snowflake). | You only need to run daily or weekly batch ETL jobs where latency is not a concern. |
| Implementing event-driven microservices based on database state changes without dual-writes. | The application requires immediate, synchronous read-your-writes consistency across all systems. |