This blog is aimed at discussing how to implement Change Data Capture in ETL. Before we dive in, let us briefly understand the importance and need for such a paradigm.
Table of Contents
- What is Change Data Capture?
- Change Data Capture in ETL
- How does Change Data Capture work?
- Business Benefits of Change Data Capture
- Why use Change Data Capture?
- Methods to Load Data from Source to Target Tables
- Which CDC Method should I use?
- Types of Change Data Capture (CDC)
- Implementation Techniques for Change Data Capture (CDC)
- Benefits of Log-based Change Data Capture
- ETL and Data Warehousing (DW)
- Common Use Cases for ETL Tools
- Steps to Perform CDC
- Use Cases for Change Data Capture in ETL
What is Change Data Capture?
Change Data Capture (CDC) is a collection of software design patterns used to detect any data change in the database. It triggers the event associated with data so that a particular action will be taken for any Change Data Capture. Companies need access to real-time data streams for Data Analytics. It excludes the process of bulk data loading by implementing incremental loading of data in nearly real-time. It allows Data Warehouse or Databases to stay active for some action to perform as soon any Change Data Capture occurs.
CDC is a Data Integration approach that allows high-velocity data to achieve reliable, low latency, and scalable data replication using fewer computation resources. With the help of CDC, companies deliver new data changes to BI (Business Intelligence) tools and team members in real-time, keeping them up-to-date.
Change Data Capture in ETL
In the Big Data era, data has become more important for Business Intelligence and Enterprise Data Analytics. Data plays an important role in nearly every business operation. For your data to be valuable, you need a way to gather data from an unlimited number of sources, organize it together, and centralize it to a single repository. This means you need ETL or Data Integration processes.
Traditionally, data warehouses do not contain up-to-date data. Up to date data often resides in operational systems and are then loaded into the data warehouse in a set frequency.
To learn more about Change Data Capture, visit here.
Hevo Data for Change Data Capture in ETL
Using Hevo for your data pipelines allows you to complete integration jobs a lot faster than hand-coding, that too at a fraction of the cost. Hevo supports CDC out of the box and can bring data into your target data warehouse in real-time.
It is easy to set up and can be integrated with your data stack instantly. Hevo offers 100+ built-in connectors paired with enterprise-grade security and support.Get Started with Hevo for Free
Check out what makes Hevo amazing:
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects schema of incoming data and maps it to the destination schema.
- Minimal Learning: Hevo with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Hevo will allow your organization to scale and improve the speed with which you can ingest data in your data warehouse.Sign up here for a 14-Day Free Trial!
How does Change Data Capture work?
When you update a source database- usually a relational database such as PostgreSQL, MySQL, Oracle, or Microsoft SQL Server- you might have to update various related resources such as a search index and a cache. With CDC, you can easily avoid issues like dual writes to, as opposed to updating resources concurrently and precisely.
It can execute this by tracking row-level changes in database-source tables categorized as update, insert, and delete events- and then making those change notifications available to any other services or systems that depend on the same data. The change notifications are sent in the same order as they were generated in the original database. This way, CDC ensures that all the interested parties of a given dataset are precisely informed of the change and can react accordingly, by either refreshing their own version of the data or by triggering business processes.
Business Benefits of Change Data Capture
CDC Generates more Savings
90% of the world’s data was generated in just the past 2 years. The infrastructure of the internet, built in a few cases, a few decades ago, does not possess the bandwidth to move massive volumes of data instantaneously. This can pose a serious problem for businesses that wish to undertake projects with high data volumes, for instance, database migrations. These data transfers can severely congest network traffic. This might lead to cloud migrations that are costlier and slower in nature.
As opposed to this approach, CDC loads the data incrementally instead of all at once. Every time a data point gets modified in the source system, it gets updated on the target, needing minimal bandwidth. So, with CDC, businesses never have to worry about large data transfers that impact the network bandwidths.
CDC Generates more Revenue
With Change Data Capture, teams get provided with updated data that allow them to make decisions and perform actions with the best data available. CDC plays an important role in improving the accuracy and speed of the data so that the data is both accurate and faster. With CDC, you can act on opportunities sooner. It allows you to beat your competitors to deals while tackling a higher volume of opportunities.
CDC Protects Business Assets
Change Data Capture plays a pivotal role in protecting data, from destruction and deletion. By tracking changes, to metadata and data, CDC offers companies that face data loss a chance to rejuvenate affected datasets. Once data is gone, it can’t be regenerated. However, by protecting the data, Change Data Capture ensures that companies can recover their integral data to propel business growth.
CDC gets rid of Opportunity Costs
Change Data Capture, specifically, the log-based type, never burdens a production data’s CPU. Log-based CDC is modified directly from the database logs and does not add any additional SQL loads to the system. Apart from this, incremental loading ensures that data transfers have minimal impact on performance. In business terms, CDC eliminates the opportunity costs that pop up when a business is forced to slow down vital tech infrastructure.
Why use Change Data Capture?
You can leverage CDC to stream data from your primary database to your Data Warehouse to solve the following problems:
- Since the data is sent continuously in CDC and in much smaller batches, you don’t have to provision as much network to make it work, plus, you can save money on network costs.
- CDC does not need you to perform high load queries on a periodic basis. Therefore, you don’t have to worry about spiky behaviors in the load. Although changefeeds aren’t free, they are much cheaper and are spread evenly throughout the day.
- Since you are continuously streaming data from your database to your Data Warehouse, the data in your house needs to be up-to-date, allowing you to generate real-time insights, giving you a considerable edge over your competitors since you are making business decisions on fresher data.
Apart from this, you can also leverage CDC for the following supplementary applications:
- Anomaly detection and streaming analytics.
- Streaming updates to your online data stores for your production Machine Learning models.
- Streaming updates of your search indexes.
Methods to Load Data from Source to Target Tables
The data can be loaded from source to target Tables using the following 2 methods:
Method 1: Database Dump
Taking a Database dump is an easy solution that might come to mind e.g. Export the Database and import it to your new Data Mart/Lake/Warehouse. This works fine while the data size is small. However, this approach doesn’t scale.
Method 2: Change Data Capture (CDC)
You will get to a point where doing a SQL dump is not a viable solution to meet your data needs. That is where CDC comes in. As the name suggests, will only capture the change in the data.
CDC or Change Data Capture is an innovative mechanism for Data Integration. It is a technology for efficiently reading the changes made to a source Database and applying those to a target Database. It records the modifications that happen for one or more Tables in a Database. CDC records write, delete, and update events. It copies a selection of tables in their entirety from a source Database into the target database.
Which CDC Method should I use?
When it comes to choosing the right CDC method, there is no clear-cut answer. It all boils down to the requirements for capturing changes and what the data in the target system will be used for.
If the use cases for the target system are based on the data being up-to-date at all times, then you should definitely implement a push-based CDC solution. If a push-based solution isn’t possible, you can always look to implement pull-based solutions. However, pull-based solutions are dependent on a large number of factors. First, if you can amend the source schema before adding update timestamps or row versions should be pretty simple by generating some database triggers. The overhead of managing an update timestamp system is far less than a row-versioning system, so using update timestamps is the way to go.
On the other hand, if modifying the source system isn’t possible then the only options at your disposal are: change scanning or leveraging any in-built change log capabilities of the source database.
Types of Change Data Capture
In a broader sense, CDC can be classified into 2 categories:
- Query Based: In this CDC technique, executing SQL statements in one way or the other is required at the source. Implementing CDC with this technique involves a performance impact on the source from which the data is extracted. In the real world, this involves performing an I/O operation at the Database by scanning through an entire Table containing a large volume of records.
- Log Based: The CDC process is a more non-intrusive approach and does not involve the execution of SQL statements at the source. Instead, this method involves reading log files of the source Database to identify the data that is being created, modified, or deleted from the source into the target Data Warehouse.
Implementation Techniques for Change Data Capture
At a high level, there are several techniques and technologies for handling the Change Data Capture processes (CDC process).
The top 4 CDC implementation techniques are:
1) Timestamp Based Technique
This technique depends on a timestamp field in the source to identify and extract the changed data sets.
2) Triggers Based Technique
This technique requires the creation of database triggers to identify the changes that have occurred in the source system and then capture those changes into the target database.
The implementation of this technique is specific to the database on which the triggers need to be created.
3) Snapshot Based Technique
This technique involves creating a complete extract of data from the source table in the target staging area.
Therefore, the next time the incremental data needs to be loaded, a second version or snapshot of the source table is compared to the original one for spotting the changes.
4) Log Based Technique
Almost all Database Management Systems have a transaction log file that records all changes and modifications in the database made by each transaction.
In general, every DML operation such as CREATE, UPDATE, DELETE is captured in a log file in the database, along with the time-stamp or a database-specific unique identifier indicating when each of these operations was incurred.
This log-based technique depends on this log information to spot the changes and perform CDC operations.
Benefits of Log-based Change Data Capture
The most important benefit of leveraging Log-based Change Data Capture is the asynchronous nature of CDC: here, changes are captured independent of the source application making these changes. Log-based CDC is basically considered the superior approach to CDC that can be applied to all possible scenarios including systems with extremely high transaction volumes.
ETL and Data Warehousing (DW)
In an ETL process, the first step is the extraction of data from various source systems and storing the extracted data in staging tables. ETL stands for Extract Transform Load. Just as the name implies, ETL tools extract data from a source, transform the data while on transit, then load the data into the target storage of your choice.
CDC with ETL tools provides a new approach to moving information into a Data Warehouse. CDC delivers change data to a data pipeline tool either in batch or real-time. This approach drastically improves the efficiency of the entire data transfer process. It reduces the associated costs including computing, storage, network bandwidth, and human resources.
These movements of data can be scheduled on a regular basis or triggered to occur.
Common Use Cases for ETL Tools
ETL tools have various applications but the following are the most common uses cases for them:
- Rolling up transaction data for business people to work within Data Warehouses.
- Migrating application data from old systems to new ones.
- Integrating data from recent corporate mergers and acquisitions.
- Integrating data from external suppliers or partners.
Steps to Perform Change Data Capture
Change Data Capture (CDC) can be implemented using the following 3 steps:
Step 1: Extract the Data
Raw data is extracted from an array of sources and sometimes placed in a Data Lake. This data could be formatted in:
- JSON – Social media (Facebook, etc.)
- XML – Third-party sources
- RDBMS – CRM
Step 2: Transform the Data
The transformation stage is where you apply any business rules and regulations to achieve.
Step 3: Load the Data
Load this extracted transformed data into a new home by executing a task (job) from a CLI or GUI interface.
Use Cases for Change Data Capture in ETL
Following are the major use cases for implementing CDC in ETL:
1) Transaction Analysis
- Fraud detection
You want to analyze transactions in some sort of batch manner to see if credit cards are being used from multiple locations at the same time.
- Kafka pipeline
You want some sort of analysis done on a transaction level and not an aggregated level.
2) Data Duplication
- Database mirroring
Database mirroring is a strategy used in High Availability (HA) and Disaster Recovery (DR) database deployments. It involves two or three SQL Server instances where one acts as a primary instance (principal), the other as a mirrored instance (mirror), while the third instance acts as the witness.
- Database replication
Database replication is the process of copying data from a database in one server to a database in another server so that all users share the same level of information without any inconsistency. It can be a one-time operation or an ongoing process.
At this point, you might be wondering which is the better option, hand-coding the CDC infrastructure for ETL or investing in a tool that can handle this out of the box?
Hand coding comes with many challenges:
- Managing, supporting, and reusing code is complex.
- Having many coders onboard results in high maintenance costs.
- Custom-code developers are scarce.
Therefore, the opportunities become endless when you empower yourself and your team with the right data pipeline platform.
This blog introduced you to Change Data Capture (CDC) and explained the steps to implement it. Moreover, the blog discussed the relationship between the CDC and ETL process and also listed the various use cases of both of them. Also, it mentioned the limitations that you will face if you will perform the CDC process manually using the ETL method.Visit our Website to Explore Hevo
Using Hevo Data for your data pipelines allows you to complete integration jobs a lot faster than hand-coding, that too at a fraction of the cost. Hevo supports CDC out of the box and can bring data into your target Data Warehouse in real-time.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.
What are your thoughts on change data capture in ETL? Let us know in the comments.