Change Data Capture (CDC) is a well-known software design pattern for a system that monitors and records data changes so that other software can react to them. CDC captures row-level changes in database tables and sends them to a data streaming bus as change events. These change event streams may be read by applications, which can then access the change in the sequence they occur. As a result, CDC aids in integrating old data storage with new cloud-native event-driven systems. Today, there are many CDC tools, but Debezium is one of the most popular applications that record every event (Insert, Update, Delete) from the source database and sends it to Kafka using connectors.
Upon a complete walkthrough of this article, you will gain a decent understanding of Debezium and Debezium MySQL Connector, along with the salient features that they offer. This article will also provide you with a step-by-step guide on how to install and configure the Debezium MySQL Connector. Read along
Prerequisites
- Understanding of Event Streams
Understanding the Need for CDC
CDC is a widely used approach for replicating data to other databases, feeding analytics systems, extracting microservices from monoliths, and invalidating caches. As your organization’s databases grow to house the ever-increasing data, you will rely on CDC for querying and facilitating seamless operations of cloud data services activities. As the heart of your business centers around how that business-critical data is changed, modifications are prompted and triggered for further processing. Therefore, it is important for modern application designs to have features that allow such processes.
To combat real-time data streaming issues, the CDC employs write-ahead logs to keep track of the changes. It manages modifications using the datastore (don’t lose data if you’re offline) and pushes changes instantly. The modifications can also be made accessible in a format that data integration tools can consume throughout the ETL (Extract, Transform, and Load) process. It uses the binary log of the source database or trigger functions to ingest just the data that has changed since the last ETL operation, rather than the complete database; CDC minimizes the resources required for ETL procedures as well.
Why Debezium?
Image Source
RedHat’s Debezium is a popular tool that captures real-time change data from multiple data sources and forms data stream output. Debezium is an Open-Source distributed CDC tool developed on top of Kafka Connect that can stream changes in real-time from MySQL, PostgreSQL, MongoDB, Oracle, and Microsoft SQL Server into Kafka. It connects to the database through Kafka, and because this CDC tool offers low latency and high data fidelity, it is exceptionally fast and durable, reacts to events rapidly, and efficiently captures database changes.
On the database side, Debezium’s functionality is determined by the database it employs. For example, it reads the commit log in MySQL to figure out the nature of the transactions. However, it uses MongoDB’s native replication method in MongoDB.
Built on top of Apache Kafka (Kafka Connect REST API), Debezium has a connector library that captures changes from a number of databases and produces events with fairly similar patterns, making it easy for applications to consume and respond to the events regardless of where the changes came from. Debezium enables detecting changes and translating them to events in a single connection. The following connections are presently available on Debezium:
- MySQL Connector
- MongoDB Connector
- PostgreSQL Connector
- SQL Server Connector
- Oracle
- Cassandra (Incubating)
- Vitess (Incubating)
Regardless of the fact that multiple connections exist, they all create events with fairly similar formats, making it much easier for your applications to receive and respond to the events irrespective of where the changes came from.
Key Features of Debezium
- Database Consistency: Debezium ensures eventual consistency by relying on the database’s transaction log to catch changes. Debezium replays the WAL (Write-Ahead Log, a transactional file) in the same way as a read replica or a database backup does. In fact, one can consider the resultant topic as a read replica of the database table.
- Failure Handling: Debezium keeps track of which transaction file items it had processed earlier and stores the read offset in Kafka. Debezium will continue from where it left off in case of an event failure like poor network connectivity. As a result, you can be confident that no modifications will be missed during downtime.
Debezium Architecture
Image Source
Debezium is made up of the following key components:
- Source: This is a database that we commonly use in projects, such as MySQL, PostgreSQL, Oracle, MariaDB, and others.
- Sink: Stores all of the data we acquire from the Source.
- Intermediate elements: This section includes:
- Debezium library that reads a MySQL binlog file or a transaction log from another database to transform CRUD operations into Debezium events. Generally, a database’s transactions are recorded in the binlog (such as changes to individual rows and changes to the schemas). It also includes information on how the database engine performs backup and replication of data.
- To get data from Debezium, use Kafka Connect. Kafka Connect is a scalable and reliable data streaming tool that connects Apache Kafka to other systems. Like the Kafka broker, Kafka Connect is a standalone service. It’s more of a framework for Source connectors and Sink connectors.
- At present, Debezium has three deployment options. You may use it as a library in your application, run it on its own server, or deploy it as an Apache Kafka Connect service for business use cases.
Hevo Data is a No-code Data Pipeline that offers a fully managed solution to set up data integration from MySQL and 150+ Data Sources (including 40+ Free Data Sources) and will let you directly load data to a Data Warehouse or the destination of your choice. It will automate your data flow in minutes without writing any line of code. Its fault-tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully automated solution to manage data in real-time and always have analysis-ready data.
Get started with hevo for free
Let’s look at some of the salient features of Hevo:
- Fully Managed: It requires no management and maintenance as Hevo is a fully automated platform.
- Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer.
- Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
- Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
- Scalable Infrastructure: Hevo has in-built integrations for 100’s of sources that can help you scale your data infrastructure as required.
- Live Monitoring: Advanced monitoring gives you a one-stop view to watch all the activities that occur within Data Pipelines.
- Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-day free trial!
Understanding Debezium MySQL Connector
Debezium MySQL Connector is a Source connector that can take a snapshot of existing data and track all row-level changes in databases on a MySQL server/cluster. It reads a consistent snapshot of all databases the first time it connects to a MySQL server. In other words, the MySQL connection takes an initial consistent snapshot of each of your databases because MySQL is often set up to remove binlogs after a defined length of time. The MySQL connection reads the binlog from the snapshot’s starting point.
When the snapshot is complete, the connector reads the changes committed to MySQL and creates INSERT, UPDATE, and DELETE events as needed. Each table’s events are stored in their own Kafka topic, where applications and services may readily access them.
The binary log (binlog) in MySQL records all database actions in the order in which they are committed. This includes both modifications to table schemas and changes to table data. The binlog is used by MySQL for replication and recovery.
The MySQL connector also ensures against event failure. The connector records the binlog position with each event as it reads the binlog and creates events. When a connector shuts down (for example, due to communication problems, network difficulties, or crashes), it reads the binlog from where it left off after being resumed. This implies if a snapshotting is not finished when the connector is stopped, a new snapshot will be started when the connector is restarted.
To learn about Debezium SQL Server Connector, click here.
Setting up the Debezium MySQL Connector
Before the Debezium MySQL connector can be used to track changes on a MySQL server, it must be configured to use row-level binary logging and have a database user with the required permissions. If MySQL is set up to utilize global transaction identifiers (GTIDs), the Debezium connector will be able to reconnect more quickly if one of the MySQL servers goes down. GTIDs were introduced in MySQL 5.6.5 and are used to uniquely identify a transaction that takes place on a specific server within a cluster. The use of GTIDs substantially simplifies replication and makes it easy to verify that masters and slaves are in sync.
All INSERT, UPDATE, and DELETE operations from a single table are written to a single Kafka topic using the Debezium MySQL connection. The following is the Kafka topic naming convention:
serverName.databaseName.tableName
There are some MySQL setup tasks required before you can install and run a Debezium connector:
A) Creating User
A MySQL user account is required for a Debezium MySQL connector. All databases for which the Debezium MySQL connector captures updates must have proper permissions for this MySQL user.
mysql> CREATE USER 'user'@'localhost' IDENTIFIED BY 'password';
- Grant the required permissions to the user:
mysql> GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'user' IDENTIFIED BY 'password';
- Finalize the user’s permissions:
mysql> FLUSH PRIVILEGES;
B) Enabling binlog
For MySQL replication, binary logging must be enabled.
- Check whether the log-bin option is already on:
mysql> SELECT variable_value as "BINARY LOGGING STATUS (log-bin) ::"
FROM information_schema.global_variables WHERE variable_name='log_bin';
- If it is OFF, configure your MySQL server configuration file with the following properties, which are described below:
server-id = 223344
log_bin = mysql-bin
binlog_format = ROW
binlog_row_image = FULL
expire_logs_days = 10
Here,
- server-id: Each server and replication client in MySQL cluster must have a unique server-id value.
- log_bin: Its value is the base name of the binlog file series.
- binlog_format: The binlog-format must be set to ROW or row.
- binlog_row_image: The binlog_row_image must be set to FULL or full.
- expire_logs_days: This indicates the number of days during which the binlog file will be automatically removed. The default value is 0, implying that there will be no automated removal.
- Confirm your changes by checking the binlog status once more:
mysql> SELECT variable_value as "BINARY LOGGING STATUS (log-bin) ::"
FROM information_schema.global_variables WHERE variable_name='log_bin';
C) Configuring Sessions Timeouts
Your established connection might timeout while the tables are being read when an initial consistent snapshot is taken for big databases. The interactive timeout and wait timeout parameters in your MySQL configuration file can be used to prevent this occurrence.
- Configure interactive_timeout:
mysql> interactive_timeout=<duration-in-seconds>
mysql> wait_timeout=<duration-in-seconds>
D) Enabling Query Log Events
Enabling the binlog_rows_query_log_events option in the MySQL configuration file allows you to see the original SQL statement for each binlog event.
mysql> binlog_rows_query_log_events=ON
Here, ON refers to enabled status.
Installing the Debezium MySQL connector
You need to download the JAR, extract it to your Kafka Connect environment, and make sure the plug-in’s parent directory is specified in your Kafka Connect environment to install the Debezium MySQL connector. However, ensure you have Zookeeper, Kafka, and Kafka Connect installed before proceeding. You must also have MySQL Server installed and configured.
- Download the Debezium MySQL Connector plug-in.
- In your Kafka Connect environment, extract the files.
- In Kafka Connect’s plugin.path, add the directory containing the JAR files.
- Configure the connector and add it to your Kafka Connect cluster’s settings.
- Start the Kafka Connect procedure again. This guarantees that the new JARs are recognized.
The Debezium MySQL connector is typically configured in a .yaml file via the connector’s configuration settings.
- Set the connector’s “name” in the a.yaml file.
- Set the setup parameters for your Debezium MySQL connector that you need.
Adding Connector Configuration
Configure a connector configuration and add it to your Kafka Connect cluster to start using a MySQL connector.
- Construct a Debezium MySQL connector setup.
- To add that connector configuration to your Kafka Connect cluster, use the Kafka Connect REST API.
When the DEbezium MySQL connector first starts up, it takes a consistent snapshot of the MySQL databases for which it is configured. After that, the connection generates data change events for row-level operations and streams change event records to Kafka topics.
For more info on Debezium MySQL Connector, visit here.
Conclusion
In this article, you learned about the need to leverage CDC tools and the basics of Debezium. You also learned that the Debezium MySQL connector monitors the table structure, takes snapshots, converts binlog events to Debezium change events, and keeps track of where those events are stored in Kafka.
Migrating large volumes of data from MySQL to a Cloud-based Data Warehouse is a time-consuming and inefficient operation, but with a Data Integration tool like Hevo Data, you can do it in no time and with no effort.
visit our website to explore hevo
Hevo Data with its strong integration with 150+ Sources & BI tools such as MySQL, PostgreSQL, MS SQL Server, etc., allows you to not only export data from sources & load data to the destinations, but also transform & enrich your data, & make it analysis-ready so that you can focus only on your key business needs and perform insightful analysis using BI tools.
Give Hevo Data a try and sign up for a 14-day free trial today. Hevo offers plans & pricing for different use cases and business needs; check them out!
Share your experience of learning about Debezium MySQL Connector. Tell us in the comments below!
Preetipadma is passionate about freelance writing within the data industry, expertly delivering informative and engaging content on data science by incorporating her problem-solving skills.