Kafka CDC: A Comprehensive 101 Guide

on Change Data Capture, Data Streaming, Kafka • February 2nd, 2022 • Write for Hevo

Kafka CDC - Featured Image

In Cloud data migrations, where data is constantly changing, time-sensitive Data Replication is a key consideration. Businesses have traditionally leveraged batch-based ways to transport data once or several times each day. However, this batch movement causes delay and lowers the organization’s operational value.

Change Data Capture (CDC) has shown to be an excellent option for moving data from Relational Databases to Data Warehouses, Data Lakes, and other destinations in near real-time. Even while CDC captures Database changes, it still needs a messaging service to send those changes to the appropriate systems and applications. The most effective approach to do this is to consider the changes as events and broadcast them asynchronously, as in an Event-Driven Architecture (EDA).

Apache Kafka is a great way to offer asynchronous communication between the Database and the data consumers that need a high-volume & repeatable consumption pattern. Kafka is a real-time distributed streaming platform that allows you to publish, subscribe to, store, and process Event Streams. It’s built to take data streams from a variety of sources and transport them to a variety of destinations, all while maintaining high throughput and scalability. Kafka CDC guarantees that the events sent by Kafka match the changes in the original source system or Database.

This article will provide you with a detailed description of Kafka CDC, why do you need it, its benefits, and how you can get started with it. Moreover, you will gain a basic understanding of different types of Kafka CDC and understand some challenges faced while working with it. So, let’s deep dive into the ocean of Kafka CDC and gain insights into how it can be useful for your use cases.

Table of Contents

What is Apache Kafka?

Kafka CDC - Kafka Logo
Image Source

Apache Kafka is a Distributed Event Streaming system that helps applications manage enormous quantities of data effectively. Its fault-tolerant, scalable design can handle billions of events with ease. The Apache Kafka framework is a distributed Publish-Subscribe Messaging system based on Java and Scala that takes Data Streams from several sources and provides real-time analysis of Big Data streams. It can scale up rapidly and with little downtime. Kafka’s worldwide appeal has developed as a result of its failure tolerance and little data redundancy. Kafka is used by thousands of companies, including more than 60% of the Fortune 100. Among them are Box, Goldman Sachs, Target, Cisco, Intuit, and others. 

Key Features of Apache Kafka

With high throughput, low latency, and fault tolerance, Apache Kafka is the most popular Open-Source Stream-processing platform. Take a peek at some of these impressive features:

  • Fault-Tolerant & Durable: Kafka protects data from server failure and makes it fault-tolerant by distributing partitions and replicating data over other servers. It is capable of restarting the server on its own.
  • Highly Scalable with Low Latency: Kafka’s partitioned log model distributes data over several servers, allowing it to scale beyond a single server’s capabilities. Because it separates data streams, Kafka has low latency and high throughput.
  • Robust Integrations: Third-party integrations are supported by Kafka. It also has a lot of APIs. As a result, you’ll be able to add new features in a matter of seconds. Take a look at how you can use Kafka with Amazon Redshift, Cassandra, and Spark.
  • Detailed Analysis: Kafka is a reliable approach for tracking operational data. It allows you to collect data in real-time from many platforms and structure it into consolidated feeds while keeping track using metrics.

Want to explore more about Apache Kafka? You can visit the Kafka website or refer to Kafka documentation

What is Change Data Capture?

Kafka CDC - Change Data Capture
Image Source: Self

Change Data Capture (CDC) is the process of observing changes in a database and making them available in a format that other systems can use such as making them available as a stream of events.  As the data is written to the Database, you may, for example, detect the events and update a search index.

By employing a source Database’s binary log (binlog) or relying on Trigger Functions to ingest just the data that has changed since the previous ETL operation, CDC reduces the resources required for the ETL process. Change Data Capture improves the consistency and functionality of all data-driven systems. Instead of dealing with concerns like dual writes, CDC allows you to update resources concurrently and properly.

CDC does this by detecting row-level changes in Database source tables, which are characterized as Insert, Update, and Delete events. It then notifies any other systems or services that rely on the same data. The change alerts are sent out in the same order that they were made in the Database. As a result, CDC guarantees that all parties interested in a given data set are reliably notified of the change and may respond appropriately, either by refreshing their own version of the data or by activating business processes.

Importance of Kafka CDC

Change Data Capture refers to a collection of techniques that enable you to discover and record data that has changed in your Database so that you can act on it later. CDC  can help you streamline and optimize your data and application infrastructures.

More and more enterprises are turning to Change Data Capture when an Apache Kafka system requires continuous and real-time data intake from corporate Databases. The following are the main reasons why Kafka CDC is superior to other methods:

  • Kafka is a messaging system that allows you to handle events and transmit data to applications in real-time. Kafka CDC transforms Databases into streaming data sources, delivering new transactions to Kafka in real-time rather than batching them and causing delays for Kafka consumers.
  • When done non-intrusively via reading the Database redo or transaction logs, Kafka CDC has the least impact on source systems.  Performance degradation or change of your production sources is avoided with log-based Kafka CDC.
  • You can make better use of your network bandwidth by using Kafka CDC by transferring just changed data continually rather than big quantities of data in batches.
  • When you transfer changed data constantly rather than utilizing Database snapshots, you obtain more specific information about what happened in the period between snapshots. Granular data flow helps downstream Analytics systems to produce more accurate and richer insights.

Simplify Kafka ETL and Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline, helps load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ Data Sources including Apache Kafka, Kafka Confluent Cloud, and other 40+ Free Sources. You can use Hevo Pipelines to replicate the data from your Apache Kafka Source or Kafka Confluent Cloud to the Destination system. It loads the data onto the desired Data Warehouse/destination and transforms it into an analysis-ready form without having to write a single line of code.

Hevo’s fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. Hevo supports two variations of Kafka as a Source. Both these variants offer the same functionality, with Confluent Cloud being the fully-managed version of Apache Kafka.

GET STARTED WITH HEVO FOR FREE

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled securely and consistently with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.

Simplify your ETL & Data Analysis with Hevo today! 
SIGN UP HERE FOR A 14-DAY FREE TRIAL!

Types of Kafka CDC

Kafka CDC allows you to capture everything that is currently in the Database, as well as any fresh data updates. There are 2 types of Kafka CDC:

1) Query-Based Kafka CDC

Kafka CDC - Query-Based Kafka CDC
Image Source

Query-based Kafka CDC pulls fresh data from the Database using a Database query. A predicate will be included in the query to determine what has changed. This will be based on a timestamp field or an identifier column that will be incremented (or both). The JDBC connector for Kafka Connect provides Query-based Kafka CDC. It is offered as a fully managed service in Confluent or as a self-managed connector.

2) Log-Based Kafka CDC

Kafka CDC - Log-Based Kafka CDC
Image Source

Log-based Kafka CDC leverages the transaction log of the Database to extract details of every modification performed. The implementation and specifications of the transaction log will differ with every Database, but all are built on the same concepts. The transaction log records every modification made to the Database.

Kafka CDC - Logs
Image Source

Insertions, updates, and even deletes are all recorded in the transaction log. As a result, when 2 rows are written to the Database, 2 entries in the transaction log are created. These 2 transaction log entries are decoded, and the actual data from the Database record is published to 2 new Apache Kafka events. One of the many advantages of Log-based Kafka CDC is that it can record not only the current state of the table rows but also the state of the table rows before they were altered.

To read the pros and cons of these Kafka CDC methods, refer to Log-Based vs. Query-Based CDC for Apache Kafka. Refer to the video below, for more details about Kafka CDC.

Steps to Set Up Kafka CDC using Kafka Streams

Kafka CDC (Change Data Capture) is a great approach to integrate streaming Analytics into your existing Database. In this section, you will learn to capture data changes in MySQL Database using the Kafka MySQL CDC Connector. So, follow the steps below to understand the key steps to get started with MySQL Kafka CDC:

Step 1: Set Up your MySQL Database

Before proceeding further to build the MySQL Kafka CDC Connector, make sure you have a MySQL database ready. If not, you can refer to How to Set Up a Streaming Data Pipeline on Confluent documentation to set a sample MySQL Database.

 After creating the database and connecting it with Confluent, make sure that the customer data is present using the command given below:

mysql> SELECT first_name, last_name, email, club_status 
         FROM demo.CUSTOMERS 
         LIMIT 5;

Step 2: Create Kafka Topics

Now, after connecting the MySQL Database, let’s create a Kafka Topic for this customer data. Follow the steps below:

  • Go to the Confluent Cloud Cluster, select Topics, and click on Add Topic.
  • Now, give a suitable name to your Topic and set the number of partitions to 6 as shown below.
  • Next, navigate to Customize Settings → Storage → Cleanup Policy. Set it to Compact as shown below.
Kafka CDC - Create New Kafka Topic
Image Source

Step 3: Set Up the Kafka MySQL CDC Connector

Now, it’s time that you can create the Kafka MySQL CDC Connector. Follow the steps below:

  • Go to Connectors and click on Add Connector
  • Now, search for MySQL CDC Source Connector and click on it to add as shown below.
Kafka CDC - MySQL CDC Source Connector
Image Source
  • Next, configure the MySQL Kafka CDC Connector with the help of the table below.
Kafka CDC - Configure MySQL CDC Source Connector Table1
Kafka CDC - Configure MySQL CDC Source Connector Table2
Image Source

Step 4: Test & Launch the MySQL Kafka CDC Connector

After configuring the MySQL Kafka CDC Connector, follow the steps below to launch it:

  • Click on Next and check if the connection is validated. If Kafka CDC Connector is successfully connected, the following JSON summary will be displayed:
{
  "name": "MySqlCdcSourceConnector_0",
  "config": {
    "connector.class": "MySqlCdcSource",
    "name": "MySqlCdcSourceConnector_0",
    "kafka.api.key": "****************",
    "kafka.api.secret": "**************************",
    "database.hostname": "kafka-data-pipelines.xxxxx.rds.amazonaws.com",
    "database.port": "3306",
    "database.user": "admin",
    "database.password": "********************",
    "database.server.name": "mysql01",
    "database.ssl.mode": "preferred",
    "table.include.list": "demo.CUSTOMERS",
    "snapshot.mode": "when_needed",
    "output.data.format": "AVRO",
    "after.state.only": "true",
    "tasks.max": "1"
  }
}
  • Next, click on Launch. After a few seconds, the status will be changed to Running as shown below.
Kafka CDC - Running Status of MySQL CDC Connector
Image Source
  • No click on Topics and click on the database you created. Then click on MessagesOffset. Set the Offset to 0 and select the 1st option on the list as shown below.
Kafka CDC - Set Offset
Image Source

Now, you will see the messages captured using MySQL Kafka CDC present on the Kafka Topic as shown below:

Kafka CDC - Messages
Image Source

Hurray!! You have successfully captured changes made in MySQL using the MySQL Kafka CDC Connector. To learn how you can capture changes using Kafka CDC from other Databases, refer to the following blogs:

Benefits of Kafka CDC

Change Data Capture systems like Debezium observe the transaction log when updates are committed to tracking changes in the Database. A simple poll-based or query-based procedure is an alternative to this method. Kafka CDC, which is based on transaction logs, has various benefits over these alternatives, including:

  • Low Overhead: The combination of CDC with Kafka allows for near-real-time data transmission. This prevents CPU overload as a result of frequent polling.
  • Maximize Data Value: Kafka CDC assists businesses in maximizing the value of data by allowing them to use it for different purposes. Kafka CDC lets the company get the most out of its data while maintaining data integrity by offering a mechanism to constantly update the same data in separate silos.
  • Keep the Business Up to Date: Kafka CDC allows many Databases and applications to keep in sync with the newest data, providing the most up-to-date information to business stakeholders.
  • Make Better & Faster Decisions: Kafka CDC enables business users to make more accurate and timely choices using the most up-to-date data. As decision-making data typically loses its value quickly, it’s critical to make it available to all stakeholders as soon as feasible. Kafka CDC lets you have access to precise, near-real-time information when it comes to establishing and retaining a competitive edge.

Challenges Faced with Kafka CDC

There are a few reasons why you might face some challenges initially using a Kafka CDC tool when connecting a database with Kafka. Some of the challenges have been highlighted below:

  • The Kafka CDC connector can be more complicated than the Kafka JDBC connector. Given your objectives, this intricacy may be justified, but keep in mind that you’re introducing additional moving pieces to your viable solution.
  • Due to the sheer nature of the interaction with the relatively low-level log files, Kafka CDC can initially be more difficult to set up
  • Kafka CDC can be overkill for quick prototyping. Initial Kafka CDC setup frequently necessitates administrative access to the Database, which might slow down quick prototyping.
  • Many Kafka CDC solutions, especially those that operate with proprietary sources, are commercial offers. Hence, the cost can be a major consideration.

Conclusion

In a nutshell, you gained a deep understanding of Kafka CDC(Change Data Capture). You understood the key features of Apache Kafka and why Kafka CDC is important. You also explored the key types of Kafka CDC and learned the key steps to capture changes in your MySQL database using the MySQL Kafka CDC Connector. At the end of this article, you discovered various benefits and challenges faced while working with Kafka CDC.

You might have observed that streaming data and performing Kafka CDC from various sources/databases to Apache Kafka or vice versa can be quite challenging and cumbersome. If you are facing these challenges and are looking for some solutions, then check out a simpler automated alternative like Hevo.

Hevo Data is a No-Code Data Pipeline that offers a faster way to move data from 100+ Data Sources including Apache Kafka, Kafka Confluent Cloud, and other 40+ Free Sources, into your Data Warehouse to be visualized in a BI tool. You can use Hevo Pipelines to replicate the data from your Apache Kafka Source or Kafka Confluent Cloud to the Destination system. Hevo is fully automated and hence does not require you to code. 

VISIT OUR WEBSITE TO EXPLORE HEVO

Want to take Hevo for a spin?

SIGN UP and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Feel free to share your experience with Kafka CDC (Change Data Capture) with us in the comments section below!

No-code Data Pipeline For Apache Kafka