This blog is aimed at discussing how to implement change data capture in ETL. Before we dive in, let us briefly understand the importance and need for such a paradigm.
Change Data Capture in ETL – Introduction
In the big data era, data has become more important for Business Intelligence and Enterprise Data Analytics. Data plays an important role in nearly every business operation. For your data to be valuable, you need a way to gather data from an unlimited number of sources, organize it together, and centralize it to a single repository. Which means you need ETL or Data Integration processes.
Traditionally, data warehouses do not contain up to date data. Up to date data often resides in operational systems and are then loaded into the data warehouse in a set frequency.
How to load data from source to target tables:
1. Database Dump
Taking a Database dump is an easy solution that might come to mind e.g. Export the database and import it to your new data mart/lake/warehouse. This works fine while the data size is small. However, this approach doesn’t scale.
2. Change Data Capture (CDC)
You will get to a point where doing a SQL dump is not a viable solution to meet your data needs. That is where CDC comes in. As the name suggests, change data capture will only capture the change in the data.
CDC or change Data capture is an innovative mechanism for data integration. It is a technology for efficiently reading the changes made to a source database and applying those to a target database. It records the modifications that happen for one or more tables in a database. CDC records write, delete, and update events. It copies a selection of tables in their entirety from a source database into the target database.
Change Data Capture – Classification
In a broader sense, CDC can be classified into two categories:
1. Query Based
In this CDC technique, executing SQL statements in one way or the other is required at the source. Implementing CDC with this technique involves a performance impact on the source from which the data is extracted from. In real-world this involves performing an I/O operation at the database by scanning through an entire table containing a large volume of records.
2. Log Based
The CDC process is a more non-intrusive approach and does not involve the execution of SQL statements at the source.
Instead, this method involves reading log files of the source database to identify the data that is being created, modified, or deleted from the source into the target Data Warehouse.
At a high level, there are several techniques and technologies for handling the change data capture processes (CDC process).
The top 4 change data capture implementation techniques are:
1. Timestamp Based Technique
This technique depends on a timestamp field in the source to identify and extract the changed data sets.
2. Triggers Based Technique
This technique requires the creation of database triggers to identify the changes that have occurred in the source system and then capture those changes into the target database.
The implementation of this technique is specific to the database on which the triggers need to be created.
3. Snapshot Based Technique
This technique involves creating a complete extract of data from the source table in the target staging area.
Therefore, the next time the incremental data needs to be loaded, a second version or snapshot of the source table is compared to the original one for spotting the changes.
4. Log Based Technique
Almost all database management systems have a transaction log file that records all changes and modifications in the database made by each transaction.
In general, every DML operation such as CREATE, UPDATE, DELETE is captured in a log file in the database, along with the time-stamp or a database-specific unique identifier indicating when each of these operations were incurred.
This log-based technique depends on this log information to spot the changes and perform CDC operations.
ETL and Data Warehousing (DW)
In an ETL process, the first step is the extraction of data from various source systems and storing the extracted data in staging tables. ETL stands for Extract Transform Load. Just as the name implies, ETL tools extract data from a source, transform the data while on transit, then it loads the data into target storage of your choice.
CDC with ETL tools provides a new approach to moving information into a data warehouse. CDC delivers change data to a data pipeline tool (example, Hevo Data) either in batch or real-time. This approach drastically improves the efficiency of the entire data transfer process. It reduces the associated costs including compute, storage, network bandwidth, and human resources.
These movements of data can be scheduled on a regular basis or triggered to occur.
ETL tools are commonly used for:
- Rolling up transaction data for business people to work within Data Warehouses.
- Migrating application data from old systems to new ones.
- Integrating data from recent corporate mergers and acquisitions.
- Integrating data from external suppliers or partners.
STEP – 1: Extract
Raw data is extracted from an array of sources and sometimes placed in a Data Lake. This data could be formatted in:
- JSON – Social media (Facebook, etc.)
- XML – Third-party sources
- RDBMS – CRM
STEP – 2: Transformation
The transformation stage is where you apply any business rules and regulations to achieve.
- … and other custom tasks
STEP – 3: Load
Load this extracted transformed data into a new home by executing a task (job) from a CLI or GUI interface.
Change Data Capture in ETL – Use Cases
1. Transaction Analysis
- Fraud detection
You want to analyze transactions in some sort of batch manner to see if credit cards are being used from multiple locations at the same time.
- Kafka pipeline
You want some sort of analysis done on a transaction level and not an aggregated level.
2. Data Duplication
- Database mirroring
Database mirroring is a strategy used in High Availability (HA) and Disaster Recovery (DR) database deployments. It involves two or three SQL Server instances where one acts as a primary instance (principal), the other as a mirrored instance (mirror), while the third instance acts as the witness.
- Database replication
Database replication is the process of copying data from a database in one server to a database in another server so that all users share the same level of information without any inconsistency. It can be a one-time operation or an ongoing process.
At this point, you might be wondering which is the better option, hand-coding the CDC infrastructure for ETL or investing in a tool that can handle this out of the box?
Hand coding comes with many challenges:
- Managing, supporting and reusing code is complex.
- Having many coders onboard results in high maintenance costs.
- Custom-code developers are scarce.
Therefore, the opportunities become endless when you empower yourself and your team with the right data pipeline platform.
Not all platforms are the same. For a robust infrastructure, you need:
- Full cloud compatibility.
- Swift Setup – A platform that can be set up instantly – in a few minutes or hours and not months.
- Scalability – So that the platform can handle the growing data volumes of your business.
- Integrations – Lots of connectors to accommodate the many data sources (systems and applications) your company implements.
- Easy of Use – An ETL tool that is easy to learn and easy to use with a GUI interface.
Hevo Data for Change Data Capture in ETL
Using Hevo for your data pipelines allows you to complete integration jobs a lot faster than hand-coding, that too at a fraction of the cost. Hevo supports CDC out of the box and can bring data into your target data warehouse in real-time.
Hevo will allow your organization to scale and improve the speed with which you are able to ingest data in your data warehouse. Sign up for a 14-day free trial to experience the simple, yet powerful Hevo Data platform.
What are your thoughts on change data capture in ETL? Let us know in the comments.