This article is an introduction to and overview of change-data-capture (CDC) as a software tool. I cover the root problems that change-data-capture addresses, a rough model for how it works, and a series of options for using it to connect data between systems.

The Case For CDC

CDC is a method of monitoring a database for changes, selectively extracting modified records, and making those changes available to other systems. Most CDC implementations attempt to keep the time between a change happening and the delivery of that change to other systems as low as possible. The primary use cases for CDC are replicating data and building event driven systems that are triggered by changes to data. The most common applications are in extract-transform-load (ETL), database migrations, event sourcing, and maintaining read-only replicas of a dataset.

ETL

Extract-transform-load[1] processes are usually described in the context of data analytics and data warehousing where the goal is to collect data from multiple parts of a system, fit all that data into a common structure that can be linked and queried, and finally to make the data available through a single database or query engine. This type of process is common to support data analytics or business analytics teams who are asked to report on and make predictions based on a global view of data from all relevant systems.

The use case for CDC in ETL workloads is reducing the time it takes for changes in source data to appear on the other side of the ETL. This is sometimes referred to as data "freshness" or "staleness". Classic ETL processes work on intervals and snapshots. For example, an ETL system may periodically copy all production data to some kind of warehouse[2] where reports and analytics are run. The reports, then, operate on a snapshot of the data that represents the state of the world at the time of last export.

Using CDC allows ETL processes to be real time rather than based on snapshots of data. This enables analytic reporting and business automation to also happen in real time rather than on intervals.

Dual Writer

Database migrations, read replica maintenance, caching, and populating event source streams are all variations of the same underlying concept: synchronizing multiple, disparate systems with the same content.

The classic approach to these tasks is to modify an application to write changes to multiple destinations. For example, a system in the middle of a database migration might write changes to two databases but only read from one until a configuration value is changed. Likewise, an application that needs to notify other systems of changes might first write to its own database and then write a notification to a queue or stream.

The process of writing a change to multiple destinations is referred to as the "dual writer" method. The dual writer approach comes with some inherent operational challenges that are difficult to manage at scale[3][4][5].

The core issue with the dual writer approach is that providing transactional guarantees for writes to multiple systems can be either impractical or impossible depending on the exact technologies in use. A two-phase commit[6] hypothetically provides this type of guarantee and there’s an open standard for implementations called X/Open XA[7]. However, using a two-phase commit requires that every write destination supports the protocol, the system can tolerate the added latency of coordination, and the transaction coordinator is highly available. The practical challenges to two-phase commits lead most dual writer implementations to operate without transactional guarantees. The end result is that the non-transactional approach to data replication always provides an opportunity for corruption where at least one of the write destinations is missing updates. The risk of corruption increases with time.

CDC allows for a single database to act as the authoritative source for committed changes to the data. Those committed changes are then propagated out to other systems that need them in a durable and recoverable way. It trades a more eventual consistency for a greater guarantee of correctness.

CDC Strategies

Change-data-capture isn’t a single implementation or strategy. It’s a descriptive title given to any process that can selectively extract committed changes from a database. For example, a system that periodically scans an entire database and extracts any records that have changed since the last scan could be described as CDC. However, the majority of contemporary implementations fit in two categories: trigger based and log based.

Trigger Based CDC

In a trigger based CDC strategy, the database is configured to run a trigger[8] on each committed change to a table. That trigger then copies the current and previous version of any changed record to an outbox table. The outbox table is then processed like a queue by some external system.

To illustrate, let’s consider a SQL database table that contains user data:

CREATE TABLE users (
    id INT AUTO_INCREMENT PRIMARY KEY,
    username VARCHAR(255) NOT NULL,
    email VARCHAR(255) NOT NULL
);

An outbox table for this would have a new and old variant of each field from the original along with some change related metadata:

CREATE TABLE users_cdc (
    change_id INT AUTO_INCREMENT PRIMARY KEY,
    change_time DATETIME DEFAULT CURRENT_TIMESTAMP,
    change_type VARCHAR(6),
    old_id INT,
    old_username VARCHAR(255),
    old_email VARCHAR(255),
    new_id INT,
    new_username VARCHAR(255),
    new_email VARCHAR(255)
);

Next, a set of triggers are installed for the users table that populate the outbox. For example:

CREATE TRIGGER after_insert_users
AFTER INSERT ON users
FOR EACH ROW
INSERT INTO users_cdc (change_type, old_id, old_username, old_email, new_id, new_username, new_email)
VALUES ('INSERT', NULL, NULL, NULL, NEW.id, NEW.username, NEW.email);

CREATE TRIGGER after_update_users
AFTER UPDATE ON users
FOR EACH ROW
INSERT INTO users_cdc (change_type, old_id, old_username, old_email, new_id, new_username, new_email)
VALUES ('UPDATE', OLD.id, OLD.username, OLD.email, NEW.id, NEW.username, NEW.email);

CREATE TRIGGER after_delete_users
AFTER DELETE ON users
FOR EACH ROW
INSERT INTO users_cdc (change_type, old_id, old_username, old_email, new_id, new_username, new_email)
VALUES ('DELETE', OLD.id, OLD.username, OLD.email, NULL, NULL, NULL);

From here, the change capture process is automated by the triggers and activated on any change to the users table. For example, inserting a record results in:

INSERT INTO users (id, username, email)
VALUES (1, "example", "user@example.com")
idusernameemail

01

example

user@example.com

change_idchange_timechange_typeold_idold_usernameold_emailnew_idnew_usernamenew_email

01

1970-01-01 00:00:00

INSERT

NULL

NULL

NULL

01

example

user@example.com

Modifying an existing record looks similar:

UPDATE users
SET username = "example2"
WHERE id = 1
idusernameemail

01

example2

user@example.com

change_idchange_timechange_typeold_idold_usernameold_emailnew_idnew_usernamenew_email

01

1970-01-01 00:00:00

INSERT

NULL

NULL

NULL

01

example

user@example.com

02

1970-01-01 00:00:01

UPDATE

01

example

user@example.com

01

example2

user@example.com

The outbox table is cumulative for all changes that happen in the source table. A separate system then processes the records in the outbox to make them available to other systems. The typical outbox processor will select one or more changes in change_id order, format them into a common structure and encoding (such as JSON), and then send the changes to an event stream. Once the processor has successfully handled a change then it removes it from the outbox table so that the table does not grow without bounds.

Notable Downsides

Using triggers for CDC is likely the easiest way to get started if you are building a custom CDC implementation from scratch because it doesn’t require any additional tooling or systems other than the code to process the outbox tables. However, this approach does come with operational challenges.

One challenge is synchronizing the outbox table schema with changes to the source tables. A trigger based approach relies on creating outbox tables that can record all the values from a source table. If the source table’s fields change then either the outbox no longer contains a complete record or the triggers can fail because they target fields that no longer exist in the source. Some process or automation must be put in place to keep source tables, triggers, and outbox tables in sync as a table’s schema changes.

Another challenge is managing the performance impacts of using triggers as the database usage grows. If a database has an infrequent and low rate of change then it may never experience issues with triggers. As write traffic increases, though, so does disk usage and lock contention. Every update to a record can more than triple the original storage requirements because the outbox contains the old version, new version, and change metadata. This storage is in use until the outbox processor can handle the records. Additionally, the outbox processor and the triggers that populate the outbox can contend for locks when a system is under heavy write traffic, depending on what types of lock options are available for the database. Some databases have locks that are more optimized for this type of use case than others.

It’s certainly not impossible to manage a trigger based CDC approach at scale but it does require more advanced operational knowledge of the specific database being used.

Comparison To Outbox Pattern

Trigger based CDC and the transactional outbox pattern[9] are nearly identical in behavior. Both models populate an outbox table and process outbox items in FIFO order. The primary difference between the two is that CDC triggers execute within the database server while the transactional outbox pattern is typically managed as a client-side feature. Common implementations of the outbox pattern work by starting a transaction, executing queries, writing any outbox records, and then finalizing the commit.

The advantage of trigger based CDC over a transactional outbox is that triggers provide a greater guarantee of accuracy and success. Triggers target entire tables and run after all committed changes regardless of where the changes originate. A database could have multiple systems writing to it and triggers will still run for each modification without each client needing to be aware of the process. A transactional outbox, on the other hand, is managed in code and every client must implement the correct outbox population logic. In my experience, it’s far more likely for typical outbox implementations to have bugs that result in missed changes or for a developer to forget to add outbox population for one or more code paths.

Log Based CDC

The log approach to CDC leverages a database’s transaction log[10] or write-ahead-log[11] to get a stream of committed changes to the database. Most implementations of log based CDC appear to a database as a read-replica and often use replication related APIs to interact with the database.

This form of CDC is now the most common and is usually more desirable than the trigger based version. Log based CDC provides all the same information and value as a trigger based solution but without directly impacting normal database operations. Reading the log and using replication APIs still carries a system resource cost but it’s often negligible in comparison to having triggers that execute on every data change.

Generally, log based CDC should be preferred when it is available.

CDC Outputs

The end product of CDC is a stream of structured change records. The exact structure and encoding of those records depends on which implementation, or engine, is being used in addition to the type of database being targeted.

The most common choice across CDC engines is to emit change records as JSON objects. The exact contents of the objects vary across engines but there is a common subset that is included by most. The common values include an indication of the type of change (INSERT, UPDATE, DELETE), a timestamp of when the change occurred in the source database, an optional snapshot of the previous version of the record, the current version of the record, and a container of engine specific metadata. For example, a CDC engine output object might look like this:

{
    "operation": "UPDATE",
    "timestamp": "1970-01-01T00:00:00Z",
    "old": {
        "id": 1,
        "username": "example",
        "email": "user@example.com"
    },
    "new": {
        "id": 1,
        "username": "example2",
        "email": "user@example.com"
    },
    "metadata": {
        // Engine specific values.
    }
}

Certainly, the exact output structure will vary by implementation but the same conceptual values should be present.

These change records are then directed to some destination, or sink. Most engines support a variety of sinks but they can usually be categorized into two groups: databases and streams.

Engines that support sinking changes into another database do so by converting the structure and field types of the source schema into something compatible in the destination before writing. For example, sinking MySQL change events into another MySQL database is equivalent to creating an exact read replica but sinking the same events into MongoDB may result in a collection per table. Engines that support sinking to databases usually provide tools for defining customized schema mappings, or transformations, from source to destination.

The more common type of sink is a stream. This includes open source software such as Kafka[12], Pulsar[13], and Redis Streams[14]. This also includes managed cloud products like AWS Kinesis[15] and GCP Pub/Sub[16]. These types of systems enable any number of readers to receive a copy of the change events as they occur. Streams also support strict ordering of events such that the order in which readers receive changes matches the order in which they are written to the stream which, in turn, matches the order in which they happened within the source database. Order is not required for all use cases but it does uniquely enable certain options.

To summarize, the most common output of a CDC engine is an ordered stream of JSON objects that each contain the details of a single change to the database.

CDC Engines

The common model for CDC engines is to provide a "multi-tool" setup where the engine has a large collection of supported sources and sinks rather than being specialized to a single type of database. The most prominent open source CDC engine (at the time of writing) is Debezium[17] which supports the most commonly used SQL and NoSQL databases[18] and the most commonly used streams[19]. There are also proprietary CDC engines. For example, AWS DMS[20] and GCP Datastream[21] are managed cloud products that support CDC for a variety of sources and sinks.

Prominent exceptions to the multi-tool approach of CDC are related to proprietary databases which are commonly offered by cloud providers. For example, AWS DynamoDB[22] and GCP Bigtable[23] do not expose their infrastructure and handle replication internally. As such, the providers are the only people who can implement CDC for those databases. In most cases, proprietary database providers offer some form of CDC through an add-on or up-sell. For example, DynamoDB has DynamoDB Streams[24] and Bigtable offers a change stream feature[25].

CDC Usage Models

The purpose of CDC is to distribute change events and data to one or more other systems. The two major use cases are creating replicas of the source data and triggering automation on data changes. In both of those cases, some signal or event must be produced to other systems. The details of how a CDC event becomes a signal to other systems has a large effect on the overall system design and usually fits into one of four patterns:

  • Direct Application: Forwarding of unprocessed CDC events.

  • Change Notifier: Minimum information transfer that triggers a "phone home" mechanism.

  • Streaming Deltas: An ordered stream of field level changes.

  • Full State Transfer: Distribution of logical object snapshots.

Direct Application

Stream consumers receive events directly from the CDC engine.

Direct application makes the CDC events from the engine directly available to other systems. This setup doesn’t require any extra components other than the CDC engine and the destination stream.

Generally, I do not recommend this method of CDC usage unless there is a specific value achieved by it. For example, the best use case for direct application is creating an exact read-replica of a source database. However, this use case is usually best served by having the engine target the replica, rather than a stream, as the sink.

The CDC engine applies all changes directly to a read-replica database instead of a stream.

There are also some features built on top of streams that are designed specifically to handle CDC data. For example, ksqlDB[26] is a Kafka based tool designed, at least in part, to consume CDC events directly, arrange them into a replica, and make the content available for query outside the original database.

My primary contention with direct application is that it establishes a database’s schema as an API contract for other systems. The contents of CDC events from an engine are direct representations of the underlying database schema (see CDC Outputs) and any system that handles those events comes to rely on that schema. A common technique in contemporary systems is to encapsulate a database with an API such that the underlying schema can change as needed so long as the externally facing API is preserved. Using direct application expands the contract to include both the API and the schema.

Change Notifier

The CDC engine publishes raw events to a stream and a stream processor converts them into notification events.

The change notifier pattern allows systems that encapsulate their data with an API to use CDC events without exposing their internal database schema. To do this, the application that encapsulates the source database becomes the sole consumer of the raw CDC events. The schema aware application can then safely handle the contents of the CDC events and convert them into minimal change notifications.

A change notification is a structure that contains the minimum amount of information required to determine what has changed but not how it changed. To illustrate, let’s consider the CDC event example from CDC Outputs:

{
    "operation": "UPDATE",
    "timestamp": "1970-01-01T00:00:00Z",
    "old": {
        "id": 1,
        "username": "example",
        "email": "user@example.com"
    },
    "new": {
        "id": 1,
        "username": "example2",
        "email": "user@example.com"
    },
    "metadata": {
        // Engine specific values.
    }
}

This event contains the full change to a record from the source database including the previous version of the record. The old and new objects are structured according to the underlying schema. The source application would drop most of this information and produce the following change notification:

{"type": "user", "id": 1}

This structure contains only enough information to know which user changed but not what changed about the user. The application would then write this notification to a separate stream that consumers read from. Consumers, now, do not need to understand the underlying schema used by the application and, instead, receive a simplified structure that indicates what has changed. A consumer would then make the same API call it normally makes to fetch a user with a specified id value in order to get the full and current user data.

Trade-Offs

This pattern can provide a manageable entry point into CDC usage for projects with more mature tools and practices dedicated to API contracts than event streaming contracts. It keeps event structures simple and re-uses existing code to fetch resources from other parts of the system. This is also the only pattern in the list that can operate over unordered streams without added complexity because the stream contains no state, only a notification of change that must be resolved using an API each time it is evaluated.

However, this pattern has some inherent scaling challenges. Notably, as the number of consumers of the change notification stream grows then so does the API traffic. This leads to some potential issues:

  • Read traffic scales as a multiple consumers. Every write triggers a read from every notification consumer.

  • Notifications tend to arrive at all consumers around the same time so the resulting API calls are often concurrent or closely clustered.

  • Bandwidth usage scales as a multiple of consumers with duplicates of the same response being transmitted for each API call.

Change notifications can work well enough for small scale systems with a low frequency of change. However, the challenges of making change notifications work in large or growing systems are great enough that I do not recommend using this pattern when introducing CDC.

Streaming Deltas

The CDC engine publishes raw events to a stream and a stream processor converts them into deltas for other systems to consume.

The streaming delta pattern is an extension of change notifier that results in events that indicate both what has changed and exactly how that thing has changed. In this model, the source application remains the sole consumer of raw CDC data and it then produces records of change, or deltas. Consumers read deltas from the stream and apply them to their internal representations of a resource rather than re-loading the entire resource on each change notification.

Shared Data Model

A prerequisite for streaming deltas is a shared data model for the resource under change. Most systems already have a canonical data model for a resource but it may not be formally defined. For example, let’s consider a hypothetical user resource that is managed by the source application. The database containing user data is encapsulated by the source application and all other systems interact with users through the application’s API. When any system requests a user through the API it gets the following structure:

{
    "id": 1,
    "username": "example",
    "email": "user@example.com"
}

If this is the common structure all systems expect when requesting a user resource then it is the de-facto data model. I recommend formalizing any data model using tools like OpenAPI[27], JSON Schema[28], and Protobuf[29] which provide a programming language agnostic description of a resource. For illustration, here’s what a JSON Schema model for the above user might look like:

{
    "type": "object",
    "properties": {
        "id": {
            "type": "number",
        },
        "username": {
            "type": "string"
        },
        "email": {
            "type": "string",
        }
    },
    "required": [
        "id",
        "username",
        "email"
    ]
}

Using a schema language to formally define the data model for a resource is not strictly required but I do highly recommend it.

Generating Deltas

With an agreed upon model for the resource, the next step is to create a data model for the deltas, themselves, so that consumers can interpret and apply the changes. Deltas can be structured any way they are needed and there is some prior art with tools like JSON Patch[30]. For example, a delta event using JSON Patch might look like:

{
    "type": "user",
    "operation": "CREATE",
    "id": 1,
    "changes": [
        { "op": "replace", "path": "/username", "value": "example" },
        { "op": "replace", "path": "/email", "value": "user@example.com" }
    ]
}

From here, the source application would generate deltas based on the CDC events it consumes from the stream. To illustrate, the application might receive the following CDC event:

{
    "operation": "UPDATE",
    "timestamp": "1970-01-01T00:00:00Z",
    "old": {
        "id": 1,
        "username": "example",
        "email": "user@example.com"
    },
    "new": {
        "id": 1,
        "username": "example2",
        "email": "user@example.com"
    },
    "metadata": {
        // Engine specific values.
    }
}

This event indicates that the username changed in an existing record so the resulting delta would be:

{
    "type": "user",
    "operation": "UPDATE",
    "id": 1,
    "changes": [
        { "op": "replace", "path": "/username", "value": "example2" }
    ]
}

These deltas are then written to a stream as the final output. Other systems can then consume the stream and apply the deltas to a local representation of the resource to create a replica of the original data.

Trade-Offs

This pattern of using CDC comes with some significant benefits over the change notifier pattern. Notably, it removes the need for consumers to make API calls to resolve notifications or to be directly connected to the source application in any way. Consumers only interact with the delta stream and are able to materialize a view of the data locally. This resolves most of the drawbacks of the change notification pattern.

The downside of streaming deltas is that it adds operational complexity to error recovery and data bootstrapping operations. For example, each consumer must keep a record of which deltas they have already applied so that they do not double apply when replaying a stream during error recovery or when subject to the standard at-least-once delivery of stream events.

I have yet to see or recommend a streaming delta CDC implementation. In my opinion, the operational complexity of delta stream management outweighs the advantages of the model compared to full state transfer.

Full State Transfer

The CDC engine publishes raw events to a stream and a stream processor converts them into full state representations for other systems to consume.

Full state transfer is conceptually similar to streaming deltas. In this model, the source application remains the sole consumer of raw CDC data and it then produces the current and full representation of the modified resource. Consumers read from the stream and overwrite their internal representation with the contents of the event without needing to contact the source application.

Like streaming deltas, full state transfer relies on having a shared data model across systems as described in Shared Data Model.

Generating State Representations

In full state transfer, each event received by consumers contains the full and complete representation of the resource. This means events either match or include the shared data model. To illustrate, let’s use the following shared model for a user resource:

{
    "id": 1,
    "username": "example",
    "email": "user@example.com"
}

A full state representation of this resource would then look something like:

{
    "type": "user",
    "operation": "CREATE",
    "id": 1,
    "resource": {
        "id": 1,
        "username": "example",
        "email": "user@example.com"
    }
}

The source application is responsible for determining the current state of a resource for any given CDC event and then generating the full state representation. To illustrate, the application might receive the following CDC event:

{
    "operation": "UPDATE",
    "timestamp": "1970-01-01T00:00:00Z",
    "old": {
        "id": 1,
        "username": "example",
        "email": "user@example.com"
    },
    "new": {
        "id": 1,
        "username": "example2",
        "email": "user@example.com"
    },
    "metadata": {
        // Engine specific values.
    }
}

Regardless of which fields actually changed, the application must produce the full state in the resulting event such as:

{
    "type": "user",
    "operation": "UPDATE",
    "id": 1,
    "resource": {
        "id": 1,
        "username": "example2",
        "email": "user@example.com"
    }
}

In this example, the full state of the resource is actually contained within the CDC event so the resulting state representation can be constructed directly from it. This can be the case for denormalized SQL schema, document database schema, or any resource small enough to fit into a single record. My experience, though, is that most SQL databases have a normalized enough representation of a resource that a database query is required to construct a state representation for each CDC event.

Trade-Offs

Full state transfer is a middle ground between the change notifier and streaming deltas patterns. Change notifier consumers would all request the full state representation for each notification where full state transfer puts that same content on the stream so consumers do not need to make further API calls. Streaming deltas requires that consumers handle every individual change in order to reconstruct the correct state while full state transfer only requires that they handle the most recent event for any given resource.

If the full state of a resource can be determined by individual CDC events then I believe this is the optimal model. However, if determining resource state requires additional queries then there are drawbacks to manage.

One drawback related to additional queries is that those queries often represent a new and higher usage of the underlying database. This can be especially true in cases of normalized schema where changes to multiple records may be related to a single resource but each results in an individual CDC event. These queries can significantly increase the read traffic and resource utilization of a database.

Another drawback is that the need to query a database in order to produce the current state makes the resulting stream imperfectly granular. For example, if two changes to a record happen in quick succession then it is possible that timing conditions cause the source application to query for the state of the first change only after both have occurred. The resulting stream would then contain two copies of the full state after the second change but no state representing the time immediately after the first change. This is usually inconsequential but can impact use cases such as security audit logs where perfect granularity is required.

Overall, this is the pattern that I recommend by default because its downsides are minimal compared to the other patterns.

Data Bootstrapping

The bulk of change-data-capture descriptions and overviews naturally focus on handling changes to data. Changes alone, though, don’t necessarily result in consumers getting a full copy of the source data unless every record is modified after starting the CDC engine. The process of exporting all existing records of a system is called initializing or bootstrapping and is something that will likely be needed multiple times throughout the lifetime of a CDC setup. Bootstrapping is generally needed when:

  • Setting up CDC for the first time.

  • Recovering from any fault that caused event processing to fail.

  • Adding a new consumer to the change events.

Most CDC engines include a bootstrap feature. The most common implementation of bootstrapping is for the engine to run the equivalent of a SELECT * from any targeted table or collection. Each record found is then emitted in the same CDC event format as a change and made to look like a creation or insertion. Most engines offer a combined bootstrap and CDC option that first exports all existing data and then begins streaming changes.

Bootstrap operations must read all data in a source database which means they can be slow to complete for large data sets. Some engines offer optimizations such as multi-threaded and partitioned scans of the database in order to make bootstrapping faster but this can still be a slow operation depending on the size of the data.

Note also that proprietary databases with their own internal engines, such as some managed cloud databases, may require a custom bootstrap process as the majority that offer a CDC feature only support delivering changes.

Conclusion

Change-data-capture is an effective tool for a wide array of data related challenges. The ability to stream data changes around a system in near real time allows for more accurate and safer implementations of database migrations, data analytics, cross-database read replicas, event sourcing, client side caching, and more.

There’s usually little need to implement CDC in terms of creating a new CDC engine. The open source and proprietary engines available cover a wide array of sources and sinks. "Implementing CDC" in a project will likely mean building systems or processes on top of the change stream emitted by an existing CDC engine. The choices for how and where to leverage change data have a larger impact than the choice of CDC engine.

For a deeper dive into integrating CDC into a system, I have also written an example system design for a project-wide CDC framework.