MySQL Change Data Capture (CDC): Steps to Stream Data

on Tutorials • March 4th, 2020 • Write for Hevo

MySQL is an open-source relational database system that finds a place in most enterprise tech stack world wide. Even though the recent developments in licensing restrictions imposed by Oracle have caused a slight drop in popularity, it still remains one of the most used structured databases. In the typical architecture, MySQL is used as the transactional database with a separate data warehouse to support the analytical and reporting requirements. This architecture means that there is a recurring need to move or copy data from MySQL to the data warehouse. In some cases, this copying operation needs to happen in real-time because of the criticality of reports involved. The paradigm called change data capture is the answer for such real-time sync requirements. This post is about the methods to implement MySQL CDC to different target databases.

Methods to Set Up Change Data Capture (CDC) in MySQL

This blog will touch upon all 3 aspects in detail. Additionally, the blog also highlights the pros and cons of each approach so that you can evaluate all options and choose the path that suits you best.

MySQL CDC – Using Triggers

The simplest way to implement a change data capture is to use a timestamp column with the table. The timestamp column should be changed every time there is a change in a row. An external script can poll the table and take appropriate actions whenever the timestamp changes from the last access time. The problem with this approach is that developers usually will not have the flexibility to modify existing tables to accommodate CDC logic. The solution for this is MySQL triggers. 

Triggers are database hooks that get executed when predefined events happen in the database. Triggers can be implemented to listen to INSERT, UPDATE or DELETE operations. While implementing CDC, triggers are configured to insert a new row into an external table every time an INSERT, UPDATE or DELETE happens. Let us consider a simple example of how to create a trigger for an INSERT operation. For now, let us assume there is a source table ‘students’ with id, age, and name as columns.

  1. Create a second table student_cdc to capture the updates as shown below.

     

    CREATE TABLE IF NOT EXISTS student_cdc (
    
     cdc_id BIGINT NOT NULL AUTO_INCREMENT,
    
     action varchar(100),
    
      id BIGINT,
    
      age BIGINT,
    
      name VARCHAR(100),
    
      last_modified_dt timestamp,
    
     PRIMARY KEY (‘cdc_id’)
    
     ) ENGINE=InnoDB;
  2. Create a trigger as below.

     

    CREATE TRIGGER ‘student_insert’
    
     AFTER INSERT ON ‘student’
    
     FOR EACH ROW BEGIN
    
     insert into ‘student_cdc’ (‘action’, id, age, name, last_modified_dt)
    
     values ('insert', NEW.id, NEW.age, NEW.name,now());

Once this configuration is done, a script can be implemented to poll the second table and take appropriate actions to write the data to the target database.

Even though this is a straightforward approach, there are multiple disadvantages to this. 

  1. Triggers are usually performance-intensive and it can lead to a significant drop in performance of the transactional database.
  2. Triggers are executed along with the transaction. So if there is anything wrong with the trigger, the entire transaction can fail.

MySQL Change Data Capture Using Binary Logs

MySQL binary logs provide a very efficient way to track data changes. They contain events that describe the modifications to data. In short, binary logs contain all the information that is required to trace the global status of the server right from the time it was started. To enable binary logs, MySQL server needs to be started with –log-bin option. 

The best way to understand how the binary logs works would be to use the mysqlbinlog that can print the binary logs in a user-readable form. There are a number of open-source tools that are built on top of binary log functionality to ease the developer burden in implementing MySQL CDC. A great example is the maxwell tool which reads the binary log and writes to Kafka. Developers can implement Kafka consumers to process the events generated by maxwell and write to appropriate target databases. 

Before using these binary log applications, use the below snippet to change the MySQL server configuration to enable binary logs. 

[mysqld]

server-id  = 1

log_bin  = /var/log/mysql/mysql-bin.log

expire_logs_days = 10

max_binlog_size  = 100M

binlog-format    = row 

Once the server is started with the above configuration, log in to MySQL shell and perform an insert operation.

insert into students (age,nam) values (15,’alex’);

Assuming maxwell is installed, use the below command to view the maxwell output in the terminal.

./bin/maxwell --user=’mysql_maxwell_user’' --password=’maxwell_passwordl' --host='127.0.0.1' --producer=stdout      

The output will be as below.

{"database":"school","table":"students","type":"insert","ts":1472937475,"xid":211209,
"commit":true,"data":{"age":15,”name”:”alex”}}

Another great example of a binary log based application is the python-mysql-replication that is available here. Once installed, this python tool can provide a binary log stream which can be processed to write to target databases. The code to print the stream will be as simple as the below snippet.

from pymysqlreplication import BinLogStreamReader

mysql_settings = {'host': '127.0.0.1', 'port': 3306, 'user': mysql_user, 'passwd': 'mysql_password'}

log_stream = BinLogStreamReader(connection_settings = mysql_settings, server_id=100)

for binlogevent in log_stream:

    binlogevent.dump()

log_stream.close()

As simple as it may seem, this is only a very small percentage of the job done. Some of the challenges here are as follows.

  1. Once the binary log listening is implemented, there is the tedious job of learning the basics of the target database and implementing custom logic to write to the target database.
  2.  Another caveat here is that there needs to be a mechanism to keep track of the extent to which binary log was processed so that there is some way of restarting the process if something goes wrong.

In comparison to the above approaches, a much better way would be to implement an automated ETL solution like Hevo, that can perform a CDC based copy operation seamlessly.

An Easier Approach to Implement MySQL Change Data Capture

Hevo is a fully-managed Data Integration Platform that masks away all the above challenges and provides an easy to use interface to perform MySQL CDC for a variety of target databases and data warehouses. Hevo’s point and click interface allows developers to implement such a real-time data load with the lowest time to production possible. Hevo can help you set up MySQL CDC in 3 simple step:

  1. Authenticate and connect your MySQL data source
  2. Select Change Data Capture (CDC) as your replication mode
  3. Point to the destination where you want to move data

Sign up for a 14-day free trial to experience hassle-free MySQL CDC to your destination database or data warehouse.

Move any Data into your Warehouse without any code

Get Started For Free

Here are some of the additional aspects that Hevo can take care for you:

  1. Easy Implementation – Hevo is a completely managed self serve platform. You can have your MySQL replication infrastructure up and running in just a few minutes.
  2. Zero Maintenance – Once set up, Hevo takes care of monitoring and maintaining your data pipelines from MySQL to your target data warehouse. In case of any schema changes, Hevo can automatically handle it and replicate the same schema on the target data warehouse.
  3. Granular Logging  – Hevo provides granular activity logs that allow you to observe the data flow from MySQL to your destination. This will allow you to always stay on top of your data.
  4. Notification and Alerts – Should something need your attention, Hevo sends notifications over email. This will allow you to take immediate action to ensure that the data from MySQL data is always up-to-date in your target database.
  5. Scalable Infrastructure – While your ETL requirements have started with MySQL CDC, they often do not end there. As your business grows, you would need to bring data from many different data sources into your target/destination. Hevo natively integrates with 100’s of data sources so that you can scale as and when your needs change.

Sign up for a 14-day free trial with Hevo to experience a hassle-free MySQL CDC to any target of your choice. 

No-code Data Pipeline For MySQL