While handling big data, some of the biggest challenges that many Data Engineers face are data inconsistency and monitoring the huge databases. These challenges tend to get more complicated when data maintenance is considered in large companies.

With Debezium being a distributed platform, data engineers are able to tackle the big data challenges. Debezium also serves as an absolute solution for those applications that require fault tolerance, high performance, scalable and reliable platform while still tracking changes in events.

Today, because of its unique features, Debezium is used by many data engineers of several organizations.

Understanding Debezium

  • Debezium is a distributed services platform that is open-source and released under the Apache License, Version 2.0. Randall Hauch founded the Debezium project while working as a Software Developer at Red Hat to solve the data monitoring challenge.
  • Debezium is a low-latency data streaming platform for change data capture (CDC). As a result, it can convert the current databases into event streams, enabling applications to monitor and respond to changes immediately at each row level in databases.
  • Being a CDC platform, Debezium derives its durability, reliability, and fault tolerance. These Debezuim features are leveraged by Kafka and Kafka Connect.
  • Each Kafka Connect is a distributed, scalable, and fault-tolerant connector that monitors a single database server and records all changes in (one or more) Kafka topics.

Top 5 Key Debezium Features to Consider in 2024

1. Log-Based CDC

  • Change Data Capture or CDC system, in databases, monitors and captures the changes that occur in the data such that other systems or applications can respond to the changes. The CDC system is implemented in the data warehouses to stay up-to-date as the data changes in databases.
  • Debezium is a modern, distributed open source change data capture platform that monitors various database systems. There are different CDC approaches, such as Query-based and Log-based for data monitoring. However, the log-based CDC approach is implemented by Debezium Features.
  • Following are some of the Debezium Features provided by the log-based CDC:

i. Every data change is captured

  • When dealing with a huge database, one of the most crucial problems is to record the changes made in the database.
  • A log-based CDC allows users to view the entire list of changes and the exact order in which they were made in the database log.

ii. Low Delays of Events While Avoiding Increased CPU Load

  • In log-based CDC, users are allowed to react to data changes in near real-time instead of expending CPU time for continuously running polling queries.
  • As a result, the running events have low delays while the CPU load is reduced.

iii. Can Capture Old Record State And Further Meta Data

  • Log-based CDC provides the entire history of the changes made in the databases. It can also provide the old record state for updating and deleting events, depending on the capabilities of the source database.
  • Moreover, log-based approaches can offer streams of schema changes as well as additional metadata such as transaction IDs.

2. Revised CDC capabilities

Incremental Snapshots

  • Debezium captures existing data in tables during the snapshot phase. The goal of this phase is to capture consistent data at a certain moment in time, and it can only be executed once only when the first connection is started.
  • However, this might be a lengthy process. To solve this problem, Debezium adopted Netflix’s in-house CDC framework: DBLog snapshotting approach. With this approach, the data is snapshotted in chunks as the connector starts, and the snapshotting can be resumed even if the connector crashes or is terminated.

Message Filtering 

  • Debezium, by default, delivers every received data change event to the Kafka broker.
  • However, in case the user demands only a subset of the events released by the producer, Debezium provides a message filtering option: Single message transform (SMT).
  • As a filter, SMT handles the event streaming, evaluates each event against the filter conditions, and only allows events that meet the criteria of the filter conditions to be passed to the broker. 

Masking the data 

  • The risk of breaching sensitive data is one of the biggest concerns when crucial data is concerned.
  • Debezium provides a masking feature for sensitive data where users can mask any values from the columns, allowing later retrieval with the specific key. This feature strengthens data security. 

Monitoring Debezium 

  • To monitor the Debezium, users can leverage the JMX metrics provided by Apache Kafka, Apache Zookeeper, and Kafka Connect with their built-in support.
  • Each connector provides a few additional metrics that you can use to monitor their activities.
  • Some of the connector metrics are SQL Server connector metrics, MySQL connector metrics, PostgreSQL connector metrics, and MongoDB connector metrics.

Message Transformations

  • Debezium supports a number of single message transformations (SMTs) that may be used to alter records either before they are transmitted to Apache Kafka (by applying them to the Debezium connectors) or when they are retrieved from Kafka by a sink connector.

3. Apache Kafka support

  • Kafka Connect is a framework that allows data to be streamed reliably and scalably between Apache Kafka and other systems.
  • In a Kafka Connect service, a RESTful API maintains and deploys connections among different systems.
  • This service can be clustered, and the connectors will be automatically distributed across the cluster, assuring that the connector is always active.
  • Debezium is built on top of Apache Kafka and provides Kafka Connect connectors that monitor the database management systems.
  • It keeps track of data changes in Kafka logs, from which users may consume them through applications.
  • Additionally, in Kafka logs, the entire history of data changes is stored even while the application is not operating, ensuring that all events are processed correctly and completely.
  • All Debezium connectors are Kafka Connector source connectors, which means they can be managed and deployed using the Kafka Connect service. 

4. Monitors ‘n’ number of databases

  • The number of connectors that are to be deployed to a single Kafka connect service cluster is determined by the volume and rate of events.
  • Debezium can handle multiple Kafka connect service clusters and, if necessary, various Kafka clusters.
  • When many applications share a single database, it becomes difficult for one application to be aware of the changes made by the other applications.
  • One approach is to consider a ‘message bus’ approach for such situations. With Debezium, it becomes quite easy for any application to monitor databases and respond to the changes.

5. Integrated Platform

  • Debezium supports a range of data sources, including MySQL, PostgreSQL, MongoDB, Oracle, and SQL Server databases.
  • It provides customers with standard technology and a platform for grid data integration, eliminating the need for writing custom code to link each data source.
  • Data is frequently kept in several locations, mainly when it is utilized for multiple purposes and takes on somewhat different formats. Keeping numerous systems aligned might be difficult, but simple ETL-type solutions can be developed easily with Debezium and simple event processing logic.

Use Cases of Debezium

Following are a wide range of companies and organizations using Debezium.

Debezium at Myntra

  • Myntra realized that a Data Ingestion Platform was required to act as a central gateway for all data ingestion into the Data Platform. With the help of Debezium, Myntra could capture change data from MySQL sources.
  • The Debezium connector is used for each database, and it supports stream ingestion and does not rely on the latest changed timestamp. In addition, the data ingestion strategy was also changed from ETL to ELT model.

Debezium at Delhivery 

  • Delhivery is one of the leading logistics and e-commerce supply chain services companies in India. At Delhivery, all the transactional data is primarily maintained in a document database such as MongoDB and in PostgreSQL for various services.
  • However, in order to surface insights, there was a requirement for efficient and real-time analysis of transactional data across all services. To solve this problem, Delhivery uses Debezium to perform Change Data Capture on transactional data and make it available in Kafka.

Debezium at Reddit

  • Reddit is a popular website known for social news aggregation and web content rating platform. At Reddit, two major issues that they were dealing with were snapshotting the data and maintaining their fragile infrastructure.
  • For snapshotting the data, Debezium’s streaming change data capture (CDC) is used for leveraging the existing Kafka infrastructure using Kafka Connect. Moreover, Debezium listens to any changes in the schema and writes them to a Kafka topic.
  • The fragile infrastructure problem of Reddit is also addressed since they are now able to manage small, lightweight Debezium pods reading directly from the primary Postgres instance rather than large EC2 instances. 

Debezium at Bolt

  • Bolt is a mobility company based in Europe, Africa, Western Asia, and Latin America that provides vehicle rental, car-sharing, and food delivery services.
  • In recent years, the amount of data written to MySQL at Bolt has grown significantly, resulting in manual database sharding, which is an expensive, time-consuming, and error-prone task.
  • To tackle this problem, Bolt utilizes the Debezium MySQL Connector to capture data change events and send them to Kafka once they have been committed to the database. This makes it simple and reliable to communicate changes amongst back-end microservices. 

Conclusion

  • Through this blog, we learned that Debezium is designed to be a fault- and failure-tolerant and a distributed system is the only way to accomplish it efficiently.
  • It distributes monitoring processes, or connectors, among different machines so that they can be resumed if something goes wrong.
  • And to reduce the chance of data loss, the events are captured and replicated across numerous machines.

SIGN UP for a 14-day free trial and see the difference! Share your experience of learning about Debezium Features in the comments section below.

mm
Freelance Technical Content Writer, Hevo Data

Shravani is a data science enthusiast who loves to delve deeper into complex topics on data science and solve the problems related to data integration and analysis through comprehensive content for data practitioners and businesses.

No-code Data Pipeline For Your Data Warehouse