What is Debezium: The Ultimate Guide

Manjiri Gaikwad • Last Modified: December 29th, 2022

Debezium

Organizations can streamline several business processes by keeping track of the changes in databases. Any modification in databases can be used as a trigger to start or stop other associated services to automate tedious tasks. Debezium is one such solution that allows data-driven organizations to quickly act on the recent changes and provide better services in real-time.

Today, almost every organization leverages Debezium to not only enhance the capabilities of databases but also reduce operational costs through automation. 

Table of Contents

Prerequisites

Understanding of events streams.

What is Debezium?

debezium: logo
Image Source: i0.wp.com

Debezium is an open-source distributed event streaming platform built by Red Hat for monitoring the changes in databases. Today, several applications are dependent on Debezium to work flawlessly and provide superior performance.

Applications use Debezium to react immediately whenever there are changes in databases. Debezium contains a transaction log in which the changes of databases are stored.

All the dependent applications can choose the changes from the transaction log and start reacting to them. Such changes can be operations like insert, update, and delete events in databases. Debezium ensures that all such modifications are noticed and trapped.

Learn more about Debezium.

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services and simplifies the ETL process. It supports 100+ data sources (including 40+ free data sources) like Asana and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

GET STARTED WITH HEVO FOR FREE[/hevoButton]

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
SIGN UP HERE FOR A 14-DAY FREE TRIAL

What is the need for Debezium?

Debezium follows the concept of Change Data Capture (CDC) to help create existing databases into event streams. CDC is the process that indicates changes to data in databases, thereby allowing you to process different real-time tasks effectively.

Debezium is a CDC tool that monitors real-time changes in databases and stores them into other destinations. Today, Debezium connectors are used for event streaming of databases. 

Features of Debezium

Debezium uses Apache Kafka Connect source connectors that track changes from different databases using CDC. Depending on the application’s requirements, you can use one of the two CDC methods – log-based and polling. In a log-based method, the results of the queries are fetched from the database logs, whereas, in the polling method, it is fetched from the newly inserted or retrieved tables of databases.

Debezium uses the log-based method as it retrieves results from the database logs. As database logs consist of the database changes in exact order, Debezium captures all the changes in the data.

However, in the polling method, if the data is stored or deleted between two polls, there might be some events like insert and update that may not be recorded or tracked. Hence, frequent pooling is needed to decrease the chances of missing changes.

But, it takes more CPU time, which leads to performance issues. Consequently, organizations often use a log-based system like Debezium for applications to react to the data changes in the near-real-time.

Getting Started with Debezium

To start with Debezium, you need to install the latest version of Docker, an open-source platform that uses client-server technology to manage and deploy containerized applications. To get the latest version of Docker, one can refer to Docker’s installation page.

Services

To start the Debezium services, you need to start three unique services: ZooKeeper, Kafta, and Debezium connector services. 

  1. ZooKeeper connector service

Apache ZooKeeper aims to build an open-source server that enables highly reliable distributed coordination. It is used for clusters in distributed systems to share group services like configuration information, naming, etc. Apache ZooKeeper is considered a centralized service and follows Master-Slave architecture.

  1. Kafka connector service

Kafka connectors are ready-to-use components and aim for scalable and reliable data streaming between Apache and other data systems. Kafka connector is a JDBC source connector that enables users to import data from the external systems to Kafka topics and export data from Kafka topics to external systems. Kafka topics are categories that are used to organize messages. 

  1. Debezium connector service

Debezium is a collection of source connectors for Apache Kafka. Every connector in Debezium indicates the changes from different databases. It monitors specific database management systems that record the Kafka log’s data changes, a collection of data segments on the disk.

In this tutorial, you will use the Debezium and the Docker container images to set up the instance of each service.

Start ZooKeeper

            To start the ZooKeeper with a container, use the following command.

debezium: start zookeeper
Image Source: Self

Note: This command makes use of version 1.8 of the ZooKeeper.

Here:

  • -it: It stands for interactive, which is used to attach to the container with the terminal’s input and output.
  • -rm: It is used to remove the container when it gets stopped.
  • –name: It is used to set the name of the container.
  • -p 2181:2181 -p 2888:2888 -p 3888:3888: It ensures the containers and the application out of the container communicate with the ZooKeeper. It maps three container ports to the same port of Docker’s host.

To ensure that ZooKeeper started through port number 2181, you should get a similar output.

debezium: starting zookeeper 2
Image Source: Self

Start Kafka in the new container

To check the compatibility between Kafka and Debezium versions, you can go through the Debezium Test Matrix website. In this tutorial, Debezium 1.8.0 has been used along with Kafka Connect.

To start the Kafka in a container, open the terminal and run the following command.

debezium: start kafka in new container
Image Source: Self

This command makes use of the Debezium Kafka image of version 1.8 in a new container.

  • –name Kafka: It is used to name the container.
  • -p 9092:9092: It is used to communicate the applications outside the container to Kafka, port 9092 in the container is mapped to the same port on Docker’s host.
  • –link zookeeper: It is used to tell the container to find ZooKeeper in the ZooKeeper container that is running in the same Docker host.

The above containers can connect with Kafka by linking them to Kafka. To connect Kafka from outside the Docker container, you use the -e option that specifies the Kafka address through Docker’s host. i.e. -e ADVERTISED_HOST_NAME=.

If you see the following output, your Kafka has started successfully.

debezium: kafka new container
Image Source: Self

Start a MySQL Database

Open a new terminal and use it to start a new container using the below command, which runs a MySQL database server with an inventory database.

debezium: start mysql
Image Source: Self

The above command runs a new version 1.8 and is based on the MySQL 8.0 image. It has a sample inventory database. The sample inventory database is a centralized collection of data that stores all kinds of changes in databases.

  • –name mysql: It is used to set the container’s name.
  • -p 3306:3306: It communicates among applications outside the container to the database server. This command maps the port in the container to the same port of the Docker host.
  • -e MYSQL_ROOT_PASSWORD=debezium -e MYSQL_USER=mysqluser -e MYSQL_PASSWORD=mysqlpw: It is used to create a new username and password for the Debezium MySQL connector.

If you see the following output, your MySQL server has started successfully.

debezium: start mysql database
Image Source: Self

Start MySQL command-line client

After starting the MySQL server, you need to start the MySQL command-line client to access the inventory database.

Open the new terminal and start a MySQL command-line client in the container.

debezium: start mysql commandline
Image Source: Self
  • –name mysqlterm: it is the name of the container.
  • –link mysql: it is used to link the container to MySQL container.

If you get the following output, the MySQL command-line client started successfully.

debezium: start mysql commandline client
Image Source: Self

When you are into the MySQL command prompt, you need to access the inventory database. 

Use the following command.

mysql> use inventory;

Look at the tables in the database.

mysql> show tables;

It shows:

debezium: show table
Image Source: Self

Use MySQL command to view the data in the database, i.e.:

Select * from customers;

debezium: select statement
Image Source: Self

Kafka Connect

Start the Kafka Connect service after connecting MySQL to the inventory database through the 

command-line client. The Kafka service consists of an API used to manage the Debezium connector.

Steps to Kafka connect:

  1. Open the terminal.
  2. Connect the Kafka Connect server in the container through the following command.
debezium: kafka connect
Image Source: Self
  • –name connect: It is the name of the container.
  • -p 8083:8083: It is used to ensure communication among the containers and the applications outside the containers by using Kafka Connect API. It maps the 8083 port in the container to the port of the Docker host.
  • -e CONFIG_STORAGE_TOPIC=my_connect_configs -e OFFSET_STORAGE_TOPIC=my_connect_offsets -e STATUS_STORAGE_TOPIC=my_connect_statuse:It sets the environment variables needed by the Debezium image.
  • –link zookeeper: zookeeper –link kafka:kafka –link mysql:mysql: It links this container to the already running containers: MySQL, ZooKeeper, and Kafka.

If you get the following output, your Kafka is running successfully.

debezium: kafka run successfully
Image Source: Self

Open a new terminal to check the status of Kafka connect. 

debezium: kafka connect status
Image Source: Self

To check the list of connectors registered with Kafka connect, you can use the below commands.

debezium: kafka connectors
Image Source: Self

Deploy the MySQL connector

After starting the Debezium and MySQL services, you have to deploy the MySQL connector to monitor the inventory database. However, you have to register the MySQL connector to watch the inventory database.

After that, it will monitor MySQL’s server binlog, a binary log in a database that keeps track of all the operations in which they are committed to the database.

Steps:

  1. Open the terminal.
  2. Use the curl command below to register the Debezium MySQL connector. 
debezium: deploy mysql
Image Source: Self

The above command uses the Kafka Connect service API’s to submit a post request. This post request is sent against the connector resource with a JSON comment that specifies the new connector called the inventory connector.

The above command uses the localhost to connect to the Docker host.

To check whether the inventory connectors are present in the list of connectors, you must use the following command.

debezium: connectors inventory
Image Source: Self

Review the connector’s task using the below command.

debezium: inventory
Image Source: Self

To verify, check the following output.

debezium: no of processes
Image Source: Self

By following the below output, you can understand the number of processes the connector goes through when it is created and starts reading the MySQL server binlog.

debezium: binlog
Image Source: Self

The above output shows everything about the inventory connector being created and started. 

The below output shows different processes of the inventory connector after the connector has started. 

debezium: binlog after start
Image Source: Self

The above is the Debezium connector output log. It provides thread-specific information in the log. Debezium makes it easier to understand the multithreaded Kafka Connect service. It also includes messages of MySQL connectors, the logical name of the connector, and the connector’s activity like task, snapshot, and binlog.

The above output consists of the few lines of the task activity and includes the connector’s snapshot activity. It further reports that a snapshot is being started using Debezium and MySQL databases.

After deploying the MySQL connector, monitoring the inventory database for changes in data events is finished. The connector monitors the events like deleting or inserting records in the database, updating records in the database, etc.

If you want to stop the services, use the following command.

$ docker stop mysqlterm watcher connect mysql kafka zookeeper

To verify that all the processes are stopped and removed, use the below command.

$ docker ps -a

You can also use the below command to stop any process or container.
docker stop <process-name> or docker stop <containerId>.

Conclusion

This tutorial gives you an idea about Debezium, its needs, features, and services. Along with Debezium, there are other similar tools like Kebola, Oracle Goldeneck, Talend, HVR, etc. In this tutorial, only the MySQL Debezium connector is explained, but there are different connectors for Oracle, SQL, and PostgreSQL databases, which can be used instead of MySQL.

The Automated data pipeline helps in solving this issue of creating easy ETL and this is where Hevo comes into the picture. Hevo Data is a No-code Data Pipeline and has awesome 100+ pre-built Integrations that you can choose from.

visit our website to explore hevo[/hevoButton]

Hevo can help you Integrate your data from numerous sources and load them into a destination to Analyze real-time data with a BI tool such as Tableau. It will make your life easier and data migration hassle-free. It is user-friendly, reliable, and secure.

SIGN UP for a 14-day free trial and see the difference!

Share your experience of learning about debezium in the comments section below.

No-code Data Pipeline For Your Data Warehouse