MySQL CDC: 3 Easy Methods to Stream Data

on Tutorials, Data Integration • February 1st, 2022 • Write for Hevo

MySQL CDC Change Data Capture

MySQL is an open-source relational database system that finds a place in most enterprise tech stacks worldwide. Even though the recent developments in licensing restrictions imposed by Oracle have caused a slight drop in popularity, it still remains one of the most used structured databases. In the typical architecture, MySQL is used as the transactional database with a separate data warehouse to support the analytical and reporting requirements.

This architecture means that there is a recurring need to move or copy data from MySQL to the data warehouse. In some cases, this copying operation needs to happen in real-time because of the criticality of reports involved. The paradigm called change data capture is the answer for such real-time sync requirements. This post is about the methods to implement MySQL CDC to different target databases. This blog will touch upon all 3 aspects in detail. Additionally, the blog also highlights the pros and cons of each approach so that you can evaluate all options and choose the path that suits you best.

Table of Contents

Introduction to MySQL

MySQL Logo
Image Source

MySQL was used by more than 39% of developers in the world back in 2019, making it the world’s most popular Database. It certainly lacks the extensive features that are provided by PostgreSQL, but it is still useful for a large number of applications like web applications. 

Since it comes standard in the LAMP stack, where LAMP stack is an Open-Source suite of web applications that consists of Linux, Apache HTTP Server, MySQL, PHP; MySQL is the go-to choice for scalable web applications. Let’s talk about a few salient features that make MySQL such a great catch.    

Key Features of MySQL

  • Maintained by Oracle: Oracle owns and maintains MySQL. It also offers premium versions of MySQL with additional services, proprietary plugins, user support, and extensions. 
  • Long History: MySQL has been around for over 20 years since its first release in 1995.
  • Frequent Updates: MySQL is made more robust with frequent updates with new features and security improvements. The latest release is Version 8.0.23 released on 18 January 2021.
  • MVCC Features: MySQL recently started offering MVCC (Multi-Version Concurrency Control) features. 
  • A Supportive Community: A devoted community of developers is available to help with troubleshooting as and when needed.
  • Open-Source: MySQL is also a free and Open-Source Relational Database Management System (RDBMS).
  • Users: MySQL is used widely by Google, NASA, Flickr, GitHub, Netflix, Tesla, Twitter, Uber, Wikipedia, YouTube, Zendesk to name a few.

Key Advantages of Using MySQL

MySQL is a lightweight database that can be installed and used by developers on production application servers with large multi-tier applications. Leveraging MySQL as a part of your workflow has several advantages:  

  • Features like Master-Slave Replication, Scale-Out are supported by MySQL.
  • Offload Reporting, Geographic Data Distribution is also supported by MySQL.
  • There’s a very low overhead with the MyISAM storage engine when used for read-only applications.
  • For frequently used tables support is provided for the Memory Storage Engine.
  • For repeatedly used statements there exists a Query Cache. 
  • MySQL is easy to learn and troubleshoot given a wide number of helpful sources like blogs, white papers, and books on the subject. 
  • MySQL is a highly flexible and scalable Database Management System.

Understanding the Methods to Set up MySQL CDC

CDC Diagram
Image Source

These are the methods you can use to setup MySQL CDC in a seamless fashion:

MySQL CDC Setup: Using MySQL Triggers

The simplest way to implement a change data capture is to use a timestamp column with the table. The timestamp column should be changed every time there is a change in a row. An external script can poll the table and take appropriate actions whenever the timestamp changes from the last access time. The problem with this approach is that developers usually will not have the flexibility to modify existing tables to accommodate CDC logic. The solution for this is MySQL triggers. 

Triggers are database hooks that get executed when predefined events happen in the database. Triggers can be implemented to listen to INSERT, UPDATE or DELETE operations. While implementing CDC, triggers are configured to insert a new row into an external table every time an INSERT, UPDATE, or DELETE happens. Let us consider a simple example of how to create a trigger for an INSERT operation. For now, let us assume there is a source table ‘students’ with id, age, and name as columns.

  1. Create a second table student_cdc to capture the updates as shown below.

     

    CREATE TABLE IF NOT EXISTS student_cdc (
    
     cdc_id BIGINT NOT NULL AUTO_INCREMENT,
    
     action varchar(100),
    
      id BIGINT,
    
      age BIGINT,
    
      name VARCHAR(100),
    
      last_modified_dt timestamp,
    
     PRIMARY KEY (‘cdc_id’)
    
     ) ENGINE=InnoDB;
  2. Create a trigger as below.

     

    CREATE TRIGGER ‘student_insert’
    
     AFTER INSERT ON ‘student’
    
     FOR EACH ROW BEGIN
    
     insert into ‘student_cdc’ (‘action’, id, age, name, last_modified_dt)
    
     values ('insert', NEW.id, NEW.age, NEW.name,now());

Once this configuration is done, a script can be implemented to poll the second table and take appropriate actions to write the data to the target database.

Even though this is a straightforward approach, there are multiple disadvantages to this. 

  • Triggers are usually performance-intensive and it can lead to a significant drop in the performance of the transactional database.
  • Triggers are executed along with the transaction. So if there is anything wrong with the trigger, the entire transaction can fail.

MySQL CDC Setup: Using Binary Logs

MySQL binary logs provide a very efficient way to track data changes for MySQL CDC. They contain events that describe the modifications to data. In short, binary logs contain all the information that is required to trace the global status of the server right from the time it was started. To enable binary logs, MySQL server needs to be started with –log-bin option. 

The best way to understand how the binary logs works would be to use the mysqlbinlog that can print the binary logs in a user-readable form. There are a number of open-source tools that are built on top of binary log functionality to ease the developer burden in implementing MySQL CDC. A great example is the maxwell tool which reads the binary log and writes to Kafka. Developers can implement Kafka consumers to process the events generated by maxwell and write to appropriate target databases. 

Before using these binary log applications, use the below snippet to change the MySQL server configuration to enable binary logs. 

[mysqld]

server-id  = 1

log_bin  = /var/log/mysql/mysql-bin.log

expire_logs_days = 10

max_binlog_size  = 100M

binlog-format    = row 

Once the server is started with the above configuration, log in to MySQL shell and perform an insert operation.

insert into students (age,nam) values (15,’alex’);

Assuming maxwell is installed, use the below command to view the maxwell output in the terminal.

./bin/maxwell --user=’mysql_maxwell_user’' --password=’maxwell_passwordl' --host='127.0.0.1' --producer=stdout      

The output will be as below.

{"database":"school","table":"students","type":"insert","ts":1472937475,"xid":211209,
"commit":true,"data":{"age":15,”name”:”alex”}}

Another great example of a binary log-based application is the python-mysql-replication that is available here. Once installed, this python tool can provide a binary log stream that can be processed to write to target databases. The code to print the stream will be as simple as the below snippet.

from pymysqlreplication import BinLogStreamReader

mysql_settings = {'host': '127.0.0.1', 'port': 3306, 'user': mysql_user, 'passwd': 'mysql_password'}

log_stream = BinLogStreamReader(connection_settings = mysql_settings, server_id=100)

for binlogevent in log_stream:

    binlogevent.dump()

log_stream.close()

As simple as it may seem, this is only a very small percentage of the job done. Some of the challenges here are as follows.

  • Once the binary log listening is implemented, there is the tedious job of learning the basics of the target database and implementing custom logic to write to the target database.
  •  Another caveat here is that there needs to be a mechanism to keep track of the extent to which the binary log was processed so that there is some way of restarting the process if something goes wrong.

In comparison to the above approaches, a much better way would be to implement an automated ETL solution like Hevo, that can perform a CDC-based copy operation seamlessly.

MySQL CDC Setup: Using Hevo Data

Hevo Logo
Image Source

Hevo Data is a fully managed Data Integration Platform that masks away all the above challenges and provides an easy-to-use interface to perform MySQL CDC for a variety of target databases and data warehouses. Hevo’s point and click interface allow developers to implement such a real-time data load with the lowest time to production possible.

Hevo can help you set up MySQL CDC in 3 simple steps:

  1. Authenticate and connect your MySQL data source
  2. Select Change Data Capture (CDC) as your replication mode
  3. Point to the destination where you want to move data
SIGN UP HERE FOR A 14-DAY FREE TRIAL

Here are some of the additional aspects that Hevo can take care of you:

  1. Easy Implementation – Hevo is a completely managed self-serve platform. You can have your MySQL replication infrastructure up and running in just a few minutes.
  2. Zero Maintenance – Once set up, Hevo takes care of monitoring and maintaining your data pipelines from MySQL to your target data warehouse. In case of any schema changes, Hevo can automatically handle it and replicate the same schema on the target data warehouse.
  3. Granular Logging  – Hevo provides granular activity logs that allow you to observe the data flow from MySQL to your destination. This will allow you to always stay on top of your data.
  4. Notification and Alerts – Should something need your attention, Hevo sends notifications over email. This will allow you to take immediate action to ensure that the data from MySQL data is always up-to-date in your target database.
  5. Scalable Infrastructure – While your ETL requirements have started with MySQL CDC, they often do not end there. As your business grows, you would need to bring data from many different data sources into your target/destination. Hevo natively integrates with 100’s of data sources so that you can scale as and when your needs change.

Conclusion

This blog talks about the different ways you can use to setup MySQL CDC in a seamless fashion namely: MySQL Triggers, Binary Logs, and with the help of a third-party tool, Hevo.

visit our website to explore hevo

Extracting complex data from a diverse set of data sources can be a challenging task and this is where Hevo saves the day! Hevo offers a faster way to move data from Databases or SaaS applications into your Data Warehouse to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.

SIGN UP for a 14-day free trial and see the difference!

You can also have a look at our unbeatable pricing that will help you choose the right plan for your business needs! 

No-code Data Pipeline For MySQL