Debezium is an open-source Event Streaming Platform to track real-time changes in Databases. To capture the changes, it uses different connectors like MySQL, SQL, Oracle, and MongoDB and stores them in Kafka Topics. Kafka Topics are further used as categories to organize data change events and stream events to subscribers. While streaming the messages, usually Kafka Connects are used, but you can also leverage the Kafka-less approach by embracing other message servers like Amazon Kinesis, Apache Pulsar, Google Pub/Sub, and Redis.
This tutorial will teach you CDC with the Debezium Server through a Kafka-less process using Apache Pulsar.
Table of Contents
- What is Debezium?
- Getting started with Debezium Server
Basics of Kafka.
What is Debezium?
Debezium is a Change Data Capture (CDC) tool that replicates data between Databases in real-time. It consists of different connectors for different Databases like MongoDB, MySQL, Oracle, SQL, PostgreSQL, and many more. When Debezium Connectors are connected to Databases, they generate events from real-time changes in Databases and store them to Kafka Topics. These change events are then accessed by applications for further processing desired tasks. Today, along with Debezium, there are many other CDC tools like Talend, Oracle GoldenGate, Qlik Replicate, StreamSets, etc.
Simplify Kafka Data Analysis with Hevo’s No-code Data Pipeline
Hevo Data is a No-code Data Pipeline that offers a fully managed solution to set up data integration 100+ Data Sources (including 30+ Free Data Sources) such as Kafka and will let you directly load data to a Data Warehouse or the destination of your choice. It will automate your data flow in minutes without writing any line of code. Its fault-tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully automated solution to manage data in real-time and always have analysis-ready data.Get started with hevo for free
Let’s look at some of the salient features of Hevo:
- Fully Managed: It requires no management and maintenance as Hevo is a fully automated platform.
- Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer.
- Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
- Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
- Scalable Infrastructure: Hevo has in-built integrations for 100’s of sources that can help you scale your data infrastructure as required.
- Live Monitoring: Advanced monitoring gives you a one-stop view to watch all the activities that occur within Data Pipelines.
- Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Getting started with Debezium Server
Debezium Server provides a ready-to-use application that can stream events from source Databases to message servers like Amazon Kinesis, Google pub-sub, Apache Pulsar, or Redis. However, you can also use Kafka as a message server with Kafka Connect for streaming change data events.
Follow the below-mentioned steps to get started with CDC using Debezium Server, a Kafka-less approach.
- Source Configuration
- Format Configuration
- Transformation Configuration
- Message Servers – Kafka Less
- CDC using Debezium server with Apache Pulsar Server
- Install the Debezium Server and unpack the server distribution.
- Create a directory named debezium-server with the below contents.
- From above, the server is started using the run.sh script, dependencies are stored in the lib directory, and the conf directory contains configuration files.
For configuration, Denezium Server uses Microprofile configuration, which means the server application can be configured from sources like environment variables, system properties configuration files, etc.
The configuration file in conf/application.properties consists of the below attributes.
- debezium.source: It is the source connector configuration where each instance of the Debezium Server runs exactly one connector.
- debezium.sink: It is for the sink system configuration.
- debezium.format: It is for the output serialization format.
- debezium.transform: It is for the configuration of the message transformation.
The configuration file looks like the below image.
When the Debezium Server starts, it generates some log messages as below.
The source configuration uses the configuration properties of Debezium Connectors with debezium.source prefix. It also has some more specific configurations necessary for running outside Kafka Connect mentioned below.
- debezium.source,connector.class: It consists of the name of the Java class implementing the source connector.
- debezium.source.offset.storgae.file.filename: It consists of the file in which connector offsets are stored for non-Kafka deployments.
- debezium.source.offset.flush.interval.ms: It defines how offsets are flushed into the file frequently.
Key and value separately configure the message output format. The output is in JSON format by default, but you can also use arbitrary implementation of Kafka Connect’s Converter.
The format configuration consists of the below property.
- debezium.format.key: It is the name of the output format for the key.
- debezium.format.key.*: It is the configuration properties that are passed to the key converter.
- debezium.format.value.*: It is the configuration properties that are passed to the value converter.
Messages run through a sequence of transformations before they are delivered to sink connectors. Debezium Server allows the use of Single Message Transformation (SMT) through Kafka Connect.
The transformation configuration consists of a list of transformations, the implementation class for each transformation, and the configuration. Properties of transformations are as follows.
- debezium.transforms: It is the list of symbolic names of transformations.
- debezium.transforms.<name>.type: It is the name of the java class which implements the transformation.
- debezium.transforms.<name>.*: It is to pass configuration properties to the transformation with the name.
Message Servers – Kafka Less
Let’s take a look at some of the Kafka-less Message Servers.
Redis is an open-source data structure used as a message broker, cache, and database. Redis is considered a stream in a data type that abstractly models a log data structure and overcomes the drawbacks of log files by implementing operations.
It consists of the below properties.
- debezium.sink.type: It is the sink type that should be set to Redis.
- debezium.sink.redis.address: It is an address at which Redis targets are provided.
- debezium.sink.redis.user: It is to add a username that communicates with Redis.
- debezium.sink.redis.password: It is to add a password that communicates with Redis
Google Pub/Sub stands for Publish and Subscribe. It is a communication platform that allows messaging between applications. Publisher applications send messages to topics, while subscriber applications access those messages. Google Pub/Sub is mainly designed for scalable batch and streaming applications. It consists of REST API and a Java SDK to implement the sink.
It consists of the below properties.
- debezium.sink.type: It is the sink type that should be set to pub-sub.
- debezium.sink.pubsub.project.id: It is the project name where the target topics are created.
- debezium.sink.pubsub.ordering.enabled: It is a message key used to guarantee the delivery of messages in the same order as they were sent with the same order key. Users can also disable this feature.
- debezium.sink.pubsub.null.key: Pub/Sub uses the surrogate key since the primary key can send messages with null values.
Amazon Kinesis provides real-time processing of a large amount of data and allows users to take any data from different sources. It captures, stores, and processes data from large distributed systems. Kineses distribute these data to multiple customers after processing them. Kinesis supports stream sharding and other techniques that increase scalability. It consists of REST API and a Java SDK to implement the sink.
It consists of the below properties.
- debzium.sink.type : It is the sink type that should be set to Kinesis.
- debezium.sink.kinesis.region: It is the region name where the target streams are provided.
- debezium.sink.kinesis.endpoint: It is an endpoint URL where the target streams are provided.
- debezium.sink.kinesis.credentials.profile: It provides a profile name to communicate with Kinesis.
- debezium.sink.kinesis.null.key: It is used as a message key from tables without a primary key.
Note: The above three are a few message servers, but we will be using Apache Pulsar Server (with Debezium Server) in this tutorial to explain in detail.
CDC using Debezium Server with Apache Pulsar Server
Apache Pulsar consists of a set of built-in connectors based on Pulsar IO Framework, a part of Apache Kafka Connect.
Apache Pulsar is used with the MySQL Connector through the below steps.
- Start a MySQL Server.
Open a new terminal to start a new container that runs the MySQL Database.
docker run --rm --name mysql -p 3306:3306 -e MYSQL_ROOT_PASSWORD=debezium -e MYSQL_USER=mysqluser -e MYSQL_PASSWORD=mysqlpw debezium/example-mysql:0.9
- Start the Pulsar Service.
Start the Pulsar Service in standalone mode. You can download Pulsar 2.3.0, and the IO connectors are packaged as separate TAR files as follows.
- Start the Debezium MySQL Connector in Pulsar.
debezium-mysql-source-config.yaml file contains all the configuration and main parameters that are listed under
.yaml file consists of the
task.class parameter. The configuration files also contain the MySQL-related parameters like server, username, password, and two names of Pulsar for history and offset.
Run the configuration file.
bin/pulsar-admin source localrun --sourceConfigFile debezium-mysql-source-config.yaml
debezium-mysql-source-config.yaml the file consists of the below contents.
Since tables are created automatically in the MySQL server, the MySQL Debezium Connector can read history records from MySQL binlog from the beginning. In the following output, you can see that the connector is already processed in 47 seconds.
The records captured by the MySQL Connector can be automatically published to the Pulsar Topic.
You can start a new terminal to list the current topics in the Pulsar topic with the following command.
bin/pulsar-admin topics list public/default
Each table that has been changed is stored in a separate Pulsar Topic. Another two topics, called History and Offset, are used to store the History and Offset as the following command.
- Subscribe to Pulsar Topic to Monitor the MySQL changes.
persistent://public/default/dbserver1.inventory.products topic and use the below command to consume the topic and monitor the changes as the ‘products’ table changes.
bin/pulsar-client consume -s "sub-products" public/default/dbserver1.inventory.products -n 0
You can also use the offset topic to monitor offset changes when the table changes are stored in
persistent://public/default/dbserver1.inventory.products Pulsar topic using the below command.
bin/pulsar-client consume -s "sub-offset" offset-topic -n 0
- Make the changes in the MySQL Server and verify them in the Pulsar Topic.
Start the MySQL docker connector.
docker run -it --rm --name mysqlterm --link mysql --rm mysql:5.7 sh -c 'exec mysql -h"$MYSQL_PORT_3306_TCP_ADDR" -P"$MYSQL_PORT_3306_TCP_PORT" -uroot -p"$MYSQL_ENV_MYSQL_ROOT_PASSWORD"'
Change the names of the two items in the ‘products’ table.
The above changes will be added to the table as follows.
Check the changes in the Pulsar topic.
You find the two offsets added when you consume the offset topic from the above.
Check in the MySQL connector, and you find two more records have been processed, as in the below image.
- Clean Up.
Close the terminal and use ‘docker ps,’ ‘docker kill’ to stop MySQL-related containers.
Delete the data directory in Pulsar binary directory to delete the Pulsar data.
In this tutorial, you have learned about the Change Data Capture approach using the Debezium Server. When using the Debezium Server and message servers like Amazon Kinesis, Apache Pulsar, Google Pub/Sub, you do not require Kafka. Therefore, it is called a Kafka-less approach using Debezium Server. Among all these message servers, the Debezium Server with Apache Pulsar is explained in the tutorial.
To get a complete overview of your business performance, it is important to consolidate data from various Data Sources such as Kafka into a Cloud Data Warehouse or a destination of your choice for further Business Analytics. This is where Hevo comes in.visit our website to explore hevo
Hevo Data with its strong integration with 100+ Sources & BI tools, such as Kafka, allows you to not only export data from sources & load data in the destinations, but also transform & enrich your data, & make it analysis-ready so that you can focus only on your key business needs and perform insightful analysis using BI tools.
Share your experience of working with Debezium Server in the comments section below.