In the ever-evolving landscape of data integration, Change Data Capture (CDC) is redefining how businesses can synchronize their increasing data volumes. According to a research report, the global data integration market size is projected to grow from USD 11.6 billion in 2021 to USD 19.6 billion by 2026. The main driving factor for this growth is the need for real-time data analysis and insights.

CDC ETL enables the identification and extraction of the modified data and then subsequently moves those changes to a downstream process. CDC is very critical to maintaining data consistency, efficiency, and decision-making accuracy during the ETL process. This blog is your comprehensive guide to change data capture, including how to implement it in ETL.

Let’s get started!

What is Change Data Capture?

Change Data Capture (CDC) is a collection of software design patterns used to detect any data change in the database. It triggers the event associated with data so that a particular action will be taken for any Change Data Capture. Companies need access to real-time data streams for data analytics. It excludes the process of bulk data loading by implementing incremental loading of data in nearly real-time. It allows data warehouse or databases to stay active for some action to perform as soon as any Change Data Capture occurs.

CDC is a data integration approach that allows high-velocity data to achieve reliable, low latency, and scalable data replication using fewer computation resources. With the help of CDC, companies deliver new data changes to BI (Business Intelligence) tools and team members in real-time, keeping them up-to-date.

Looking for a platform that can perform CDC? Solve your data replication problems with Hevo’s reliable, no-code, automated pipelines with 150+ connectors.
Get your free trial right away!

Why Use Change Data Capture?

You can leverage CDC to stream data from your primary database to your data warehouse to solve the following problems:

  • Since the data is sent continuously in CDC and in much smaller batches, you don’t have to provision as much network to make it work, plus, you can save money on network costs.
  • CDC does not need you to perform high load queries on a periodic basis. Therefore, you don’t have to worry about spiky behaviors in the load. Although changefeeds aren’t free, they are much cheaper and are spread evenly throughout the day.
  • Since you are continuously streaming data from your database to your data warehouse, the data in your house needs to be up-to-date, allowing you to generate real-time insights, giving you a considerable edge over your competitors since you are making business decisions on fresher data.

Apart from this, you can also leverage CDC for the following supplementary applications:

  • Anomaly detection and streaming analytics.
  • Streaming updates to your online data stores for your production Machine Learning models.
  • Streaming updates of your search indexes.

How Does Change Data Capture Work?

When you update a source database- usually a relational database such as PostgreSQL, MySQL, Oracle, or Microsoft SQL Server- you might have to update various related resources such as a search index and a cache. With CDC, you can easily avoid issues like dual writes to, as opposed to updating resources concurrently and precisely.

It can execute this by tracking row-level changes in database-source tables categorized as update, insert, and delete events- and then making those change notifications available to any other services or systems that depend on the same data. The change notifications are sent in the same order as they were generated in the original database.

This way, CDC ensures that all the interested parties of a given dataset are precisely informed of the change and can react accordingly, by either refreshing their own version of the data or by triggering business processes.

There are two flavors of CDC: push and pull.

  • Push: The source database may push the updates to downstream services and applications.
  • Pull: The downstream services and applications poll the source database at fixed intervals to pull the updated data.

Steps to Perform Change Data Capture

Change Data Capture (CDC) can be implemented using the following 3 steps:

Step 1: Extract the Data

Extraction Process for any Change Data Capture
Image Source

Raw data is extracted from an array of sources and sometimes placed in a data lake. This data could be formatted in:

  • JSON – Social media (Facebook, etc.)
  • XML – Third-party sources
  • RDBMS – CRM

Step 2: Transform the Data

The transformation stage is where you apply any business rules and regulations to achieve.

Step 3: Load the Data

Load this extracted transformed data into a new home by executing a task (job) from a CLI or GUI interface.

Change Data Capture vs Change Tracking

Change Tracking is the lesser-known counterpart of Change Data Capture. It’s a feature that allows the net changes made to data to be returned from a query easily. Change Tracking will let you know that a given row has changed since your last query. But, it doesn’t have an idea about:

  • The different value changes that have been made to the row, or
  • The number of times it was changed.

Change Data Capture, as the name suggests, tracks when the complete data has changed. You can capture an entire table or subset of columns through Change Data Capture.

TL;DR- Change Tracking is a subset of Change Data Capture.

Change Data Capture in ETL

Change Data Capture Logo
Image Source

In the era of Big Data, data has become more important for Business Intelligence and Enterprise Data Analytics. Data plays an important role in nearly every business operation. For your data to be valuable, you need a way to gather data from an unlimited number of sources, organize it together, and centralize it to a single repository. This means you need ETL or Data Integration processes. To understand the difference between the two, take a look at this resource.

Traditionally, data warehouses do not contain up-to-date data. Up-to-date data often resides in operational systems and are then loaded into the data warehouse in a set frequency. 

To learn the best CDC tools for 2023, read this blog post.

Change Data Capture in Distributed System

Even though CDC captures the changes in a database, it still needs a messaging service to send those change notifications to the applicable systems. One way to accomplish this is by considering the changes as individual events and sending them asynchronously. The technical term for such an architecture would be event-driven architecture (EDA).

Apache Kafka would be the ideal way to provide this communication between the database and consumers of the data that need a replayable, high-volume consumption pattern. Kafka has been designed to handle data streams from many sources and deliver the data to many destinations, with high scalability and throughput.

Change Data Capture would ensure that the events transmitted by Kafka are aligned with the changes in the original database or source system. Events are decoupled from the consumers in Kafka (asynchronous communication), allowing for a more robust delivery of changes.

Business Benefits of Change Data Capture

CDC Generates More Savings

90% of the world’s data was generated in just the past 2 years. The infrastructure of the internet, built in a few cases, a few decades ago, does not possess the bandwidth to move massive volumes of data instantaneously. This can pose a serious problem for businesses that wish to undertake projects with high data volumes, for instance, database migrations. These data transfers can severely congest network traffic. This might lead to cloud migrations that are costlier and slower in nature.

Download the Whitepaper on Automating Data Integration
Download the Whitepaper on Automating Data Integration
Download the Whitepaper on Automating Data Integration
Learn the key benefits of automating Data Integration

As opposed to this approach, CDC loads the data incrementally instead of all at once. Every time a data point gets modified in the source system, it gets updated on the target, needing minimal bandwidth. So, with CDC, businesses never have to worry about large data transfers that impact the network bandwidths.

CDC Generates More Revenue

With Change Data Capture, teams get provided with updated data that allow them to make decisions and perform actions with the best data available. CDC plays an important role in improving the accuracy and speed of the data so that the data is both accurate and faster. With CDC, you can act on opportunities sooner. It allows you to beat your competitors to deals while tackling a higher volume of opportunities.

CDC Protects Business Assets

Change Data Capture plays a pivotal role in protecting data, from destruction and deletion. By tracking changes, to metadata and data, CDC offers companies that face data loss a chance to rejuvenate affected datasets. Once data is gone, it can’t be regenerated. However, by protecting the data, Change Data Capture ensures that companies can recover their integral data to propel business growth.

CDC Gets Rid of Opportunity Costs

Change Data Capture, specifically, the log-based type, never burdens a production data’s CPU. Log-based CDC is modified directly from the database logs and does not add any additional SQL loads to the system. Apart from this, incremental loading ensures that data transfers have minimal impact on performance. In business terms, CDC eliminates the opportunity costs that pop up when a business is forced to slow down vital tech infrastructure.

Change Data Capture Techniques

At a high level, there are several techniques and technologies for handling the Change Data Capture processes (CDC process).

The top 4 CDC implementation techniques are: 

1) Timestamp Based Technique

This technique depends on a timestamp field in the source to identify and extract the changed data sets.

Timestamp Based Technique

Image source

This example shows a snapshot of the table when a new record is created and a snapshot when the record with ID=101 is updated.

2) Triggers Based Technique

This technique requires the creation of database triggers to identify the changes that have occurred in the source system and then capture those changes into the target database.

The trigger functions are stored procedures that are automatically executed when a specific event, such as INSERT, UPDATE, or DELETE, occurs on a table. To capture any data changes, one trigger for each operation per table is needed. The data changes are stored in a separate table referred to as the event table or shadow table.

The implementation of this technique is specific to the database on which the triggers need to be created.

Log Based Technique

Image source

3) Snapshot Based Technique

This technique involves creating a complete extract of data from the source table in the target staging area.

Therefore, the next time the incremental data needs to be loaded, a second version or snapshot of the source table is compared to the original one for spotting the changes.

4) Log Based Technique

Almost all Database Management Systems have a transaction log file that records all changes and modifications in the database made by each transaction. 

In general, every DML operation such as CREATE, UPDATE, DELETE is captured in a log file in the database, along with the time-stamp or a database-specific unique identifier indicating when each of these operations was incurred.

This log-based technique depends on this log information to spot the changes and perform CDC operations.

When it comes to choosing the right CDC method, there is no clear-cut answer. It all boils down to the requirements for capturing changes and what the data in the target system will be used for.

If the use cases for the target system are based on the data being up-to-date at all times, then you should definitely implement a push-based CDC solution. If a push-based solution isn’t possible, you can always look to implement pull-based solutions. However, pull-based solutions are dependent on a large number of factors. First, if you can amend the source schema before adding update timestamps or row versions should be pretty simple by generating some database triggers. The overhead of managing an update timestamp system is far less than a row-versioning system, so using update timestamps is the way to go.

On the other hand, if modifying the source system isn’t possible then the only options at your disposal are: change scanning or leveraging any in-built change log capabilities of the source database.

Log based technique

Image source

When it comes to choosing the right CDC method, there is no clear-cut answer. It all boils down to the requirements for capturing changes and what the data in the target system will be used for.

Change Data Capture Real-World Use Cases

Following are the major use cases for implementing CDC in ETL:

1) Transactional Analysis

  • Fraud detection
    You want to analyze transactions in some sort of batch manner to see if credit cards are being used from multiple locations at the same time.
  • Kafka pipeline
    You want some sort of analysis done on a transaction level and not an aggregated level.

2) Data Duplication

  • Database mirroring
    Database mirroring is a strategy used in High Availability (HA) and Disaster Recovery (DR) database deployments. It involves two or three SQL Server instances where one acts as a primary instance (principal), the other as a mirrored instance (mirror), while the third instance acts as the witness.
  • Database replication
    Database replication is the process of copying data from a database in one server to a database in another server so that all users share the same level of information without any inconsistency. It can be a one-time operation or an ongoing process.

At this point, you might be wondering which is the better option, hand-coding the CDC infrastructure for ETL or investing in a tool that can handle this out of the box?

Hand coding comes with many challenges:

  • Managing, supporting, and reusing code is complex.
  • Having many coders onboard results in high maintenance costs.
  • Custom-code developers are scarce.

Therefore, the opportunities become endless when you empower yourself and your team with the right data pipeline platform.

Apart from these, here are a few more real-world use cases of Change Data Capture:

  • Microservices Integration: You can use CDC to sync microservices with traditional applications. This allows smooth transfer of data changes from legacy systems to microservices-based applications.
  • Compliance and Auditing: You need to save a history of data changes to follow the strict data compliance requirements. You can also use CDC to save data changes for archiving or auditing requirements.
  • Analytics Dashboards: You can use CDC to feed data changes to analytics dashboards to support effective decision-making for business intelligence.
  • Cache Invalidation: Capture Data Change can also be used to invalidate caches. This ensures that the outdated entries in a cache are removed or replaced to make way for the latest versions.
  • Full-text Search: You can use CDC Change Data Capture to automatically keep a full-text search index aligned with the database.
  • CQRS Model Updates: Change Data Capture can be used to keep Command Query Responsibility Separation (CQRS) read models aligned with the primary models.

Change Data Capture Tools

Here are a few popular open-source Change Data Capture tools that you can consider for your workforce:

  • Maxwell: Maxwell reads the MySQL binlogs and writes row updates as JSON to Kafka and Kinesis, to name a few platforms. Since Maxwell has low operational overhead, needing nothing other than MySQL and a space to write, it should be on the top of your ‘Change Data Capture Tools to Keep a Lookout For’.
  • Debezium: Debezium is an open-source CDC platform that’s built on top of Apache Kafka. It has connectors to pull a change stream from the databases like MySQL, PostgreSQL, and MongoDB, and send that to Kafka. Kafka Connect is used for change propagation and detection here. You can either use Debezium as a standalone server or you can embed it into your application code as a library.

Change Data Capture Best Practices

Here are a few best practices for Change Data Capture that ensures peak efficiency:

  • Make sure that it supports asynchronous pub/sub style change propagation to consumers.
  • The order of changes for change data capture needs to be preserved. This allows the data to be propagated to the target system in the same format.
  • Make sure that it supports light-weight message transformations as the event payload needs to match with the input format of target systems.

Change Data Capture FAQs

Now that we’ve highlighted all the major aspects of Change Data Capture, let’s round off the discussion with a bunch of frequently asked questions on the topic:

What is the Difference between CDC and SCD?

Change Data Capture (CDC) identifies and tackles only the data that has changed. It then makes this data available for further use. A Slowly Changing Dimension (SCD), on the other hand, is a dimension that manages and stores both historical and present data over time in a data warehouse. For instance, a customer dimension may hold attributes such as address, name, and phone number. With time, a customer’s details might change. They may change their phone numbers, move addresses, or change names (not that common, but it’s still there).

What is CDC Pattern?

In databases, change data capture is defined as a set of software design patterns. These patterns can then be used to find the data that has changed. Once the data gets tracked, an appropriate action can be taken on the changed data.

What is CDC Latency?

CDC Replication measures latency as the amount of time that passes between when data changes on a source table and when it changes on the target table.

Conclusion

This blog introduced you to Change Data Capture (CDC) and explained the steps to implement it. Moreover, the blog discussed the relationship between the CDC and ETL process and also listed the various use cases of both of them. Also, it mentioned the limitations that you will face if you will perform the CDC process manually using the ETL method. Here’s another blog that can help you understand the benefits of CDC.

Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations like warehouses but also transform & enrich your data, & make it analysis-ready.

Visit our Website to Explore Hevo

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.

What are your thoughts on change data capture in ETL? Let us know in the comments.

References

  1. Report on the global integration market
mm
Software Engineer, Hevo Data

With around a decade of experience, Sarad has designed and developed fundamental components of Hevo. His expertise lies in building lean solutions for various software problems, mentoring fellow engineers and exploring new technologies"

No-code Data Pipeline For Your Data-warehouse