Change Data Capture (CDC) for ETL: 3 Easy Steps

on Tutorials, Data Extraction, Data Warehouse, ETL • September 15th, 2021 • Write for Hevo

This blog is aimed at discussing how to implement Change Data Capture in ETL. Before we dive in, let us briefly understand the importance and need for such a paradigm.

Table of Contents

What is Change Data Capture(CDC)?

CDC (Change Data Capture) is a collection of software design patterns used to detect any data change in the database. It triggers the event associated with data so that a particular action will be taken for any Change Data Capture. Companies need access to real-time data streams for Data Analytics. Change Data Capture excludes the process of bulk data loading by implementing incremental loading of data in nearly real-time. It allows Data Warehouse or Databases to stay active for some action to perform as soon any Change Data Capture occurs.

CDC is a Data Integration approach that allows high-velocity data to achieve reliable, low latency, and scalable data replication using fewer computation resources. With the help of Change Data Capture (CDC), companies deliver new data changes to BI (Business Intelligence) tools and team members in real-time, keeping them up-to-date.

Introduction to Change Data Capture in ETL

Change Data Capture Logo
Image Source

In the Big Data era, data has become more important for Business Intelligence and Enterprise Data Analytics. Data plays an important role in nearly every business operation. For your data to be valuable, you need a way to gather data from an unlimited number of sources, organize it together, and centralize it to a single repository. This means you need ETL or Data Integration processes.

Traditionally, data warehouses do not contain up-to-date data. Up to date data often resides in operational systems and are then loaded into the data warehouse in a set frequency. 

To learn more about Change Data Capture, visit here.

Hevo Data for Change Data Capture in ETL

Using Hevo for your data pipelines allows you to complete integration jobs a lot faster than hand-coding, that too at a fraction of the cost. Hevo supports CDC out of the box and can bring data into your target data warehouse in real-time.

It is easy to set up and can be integrated with your data stack instantly. Hevo offers 100+ built-in connectors paired with enterprise-grade security and support. 

Get Started with Hevo for Free

Check out what makes Hevo amazing:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.

Hevo will allow your organization to scale and improve the speed with which you can ingest data in your data warehouse. 

Sign up here for a 14-Day Free Trial!

Methods to Load Data from Source to Target Tables

The data can be loaded from source to target Tables using the following 2 methods:

Method 1: Database Dump

Taking a Database dump is an easy solution that might come to mind e.g. Export the Database and import it to your new data Mart/Lake/Warehouse. This works fine while the data size is small. However, this approach doesn’t scale.

Method 2: Change Data Capture (CDC)

You will get to a point where doing a SQL dump is not a viable solution to meet your data needs. That is where CDC comes in. As the name suggests, Change Data Capture will only capture the change in the data.

CDC or Change Data Capture is an innovative mechanism for Data Integration. It is a technology for efficiently reading the changes made to a source Database and applying those to a target Database. It records the modifications that happen for one or more Tables in a Database. CDC records write, delete, and update events. It copies a selection of tables in their entirety from a source Database into the target database.

Types of Change Data Capture (CDC)

In a broader sense, CDC can be classified into 2 categories:

  • Query Based: In this CDC technique, executing SQL statements in one way or the other is required at the source.  Implementing CDC with this technique involves a performance impact on the source from which the data is extracted. In the real world, this involves performing an I/O operation at the Database by scanning through an entire Table containing a large volume of records. 
  • Log Based: The CDC process is a more non-intrusive approach and does not involve the execution of SQL statements at the source. Instead, this method involves reading log files of the source Database to identify the data that is being created, modified, or deleted from the source into the target Data Warehouse.

Implementation Techniques for Change Data Capture (CDC)

At a high level, there are several techniques and technologies for handling the Change Data Capture processes (CDC process). 

The top 4 change data capture implementation techniques are: 

1) Timestamp Based Technique

This technique depends on a timestamp field in the source to identify and extract the changed data sets.

2) Triggers Based Technique

This technique requires the creation of database triggers to identify the changes that have occurred in the source system and then capture those changes into the target database.

The implementation of this technique is specific to the database on which the triggers need to be created.

3) Snapshot Based Technique

This technique involves creating a complete extract of data from the source table in the target staging area.

Therefore, the next time the incremental data needs to be loaded, a second version or snapshot of the source table is compared to the original one for spotting the changes.

4) Log Based Technique

Almost all Database Management Systems have a transaction log file that records all changes and modifications in the database made by each transaction. 

In general, every DML operation such as CREATE, UPDATE, DELETE is captured in a log file in the database, along with the time-stamp or a database-specific unique identifier indicating when each of these operations was incurred.

This log-based technique depends on this log information to spot the changes and perform CDC operations.

ETL and Data Warehousing (DW)

CDC in ETL for Data Warehouses
Image Source

In an ETL process, the first step is the extraction of data from various source systems and storing the extracted data in staging tables. ETL stands for Extract Transform Load. Just as the name implies, ETL tools extract data from a source, transform the data while on transit, then load the data into the target storage of your choice.

CDC with ETL tools provides a new approach to moving information into a Data Warehouse. CDC delivers change data to a data pipeline tool either in batch or real-time. This approach drastically improves the efficiency of the entire data transfer process. It reduces the associated costs including computing, storage, network bandwidth, and human resources. 

These movements of data can be scheduled on a regular basis or triggered to occur.

Common Use Cases for ETL Tools

ETL tools have various applications but the following are the most common uses cases for them:

  • Rolling up transaction data for business people to work within Data Warehouses.
  • Migrating application data from old systems to new ones.
  • Integrating data from recent corporate mergers and acquisitions.
  • Integrating data from external suppliers or partners.

Steps to Perform Change Data Capture

Change Data Capture (CDC) can be implemented using the following 3 steps:

Step 1: Extract the Data

Extraction Process for any Change Data Capture
Image Source

Raw data is extracted from an array of sources and sometimes placed in a Data Lake. This data could be formatted in:

  • JSON – Social media (Facebook, etc.)
  • XML – Third-party sources
  • RDBMS – CRM

Step 2: Transform the Data

Transformation Process after CDC Process Occurs
Image Source

The transformation stage is where you apply any business rules and regulations to achieve.

  • Standardization
  • Deduplication
  • Verification
  • Sorting

Step 3: Load the Data

Loading Process of Change Data Capture
Image Source

Load this extracted transformed data into a new home by executing a task (job) from a CLI or GUI interface.

Use Cases for Change Data Capture in ETL

Following are the major use cases for implementing Change Data Capture in ETL:

1) Transaction Analysis

  • Fraud detection
    You want to analyze transactions in some sort of batch manner to see if credit cards are being used from multiple locations at the same time.
  • Kafka pipeline
    You want some sort of analysis done on a transaction level and not an aggregated level.

2) Data Duplication

  • Database mirroring
    Database mirroring is a strategy used in High Availability (HA) and Disaster Recovery (DR) database deployments. It involves two or three SQL Server instances where one acts as a primary instance (principal), the other as a mirrored instance (mirror), while the third instance acts as the witness.
  • Database replication
    Database replication is the process of copying data from a database in one server to a database in another server so that all users share the same level of information without any inconsistency. It can be a one-time operation or an ongoing process.

At this point, you might be wondering which is the better option, hand-coding the CDC infrastructure for ETL or investing in a tool that can handle this out of the box?

Hand coding comes with many challenges:

  • Managing, supporting, and reusing code is complex.
  • Having many coders onboard results in high maintenance costs.
  • Custom-code developers are scarce.

Therefore, the opportunities become endless when you empower yourself and your team with the right data pipeline platform.

Conclusion

This blog introduced you to Change Data Capture (CDC) and explained the steps to implement it. Moreover, the blog discussed the relationship between the CDC and ETL process and also listed the various use cases of both of them. Also, it mentioned the limitations that you will face if you will perform the CDC process manually using the ETL method.

Visit our Website to Explore Hevo

Using Hevo Data for your data pipelines allows you to complete integration jobs a lot faster than hand-coding, that too at a fraction of the cost. Hevo supports CDC out of the box and can bring data into your target Data Warehouse in real-time.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.

What are your thoughts on change data capture in ETL? Let us know in the comments.

No-code Data Pipeline For Your Data-warehouse