Debezium is an open-source, distributed system that can convert real-time changes of existing databases into event streams so that various applications can consume and respond immediately. Debezium uses connectors like PostgreSQL, SQL, MySQL, Oracle, MongoDB, and more for respective databases to stream such changes.
When the Debezium connector is connected, it releases a lot of information about change data events through messages.
However, since messages can create confusion among data change events, organizations use Single Message Transformation (SMT) to segregate messages.
Debezium SMT is an API tool used to operate on every single message that has to be passed to the Kafka topic. Debezium provides the privilege to users to change or modify these messages through the Debezium SMT before sending them to the Kafka topic.
Prerequisites
- Basic understanding of Kafka
What is Debezium?
Debezium is an open-source, distributed platform used for event streaming of real-time changes in databases. It uses connectors that keep track of real-time changes and writes them into the Kafka topic as events.
Once you start the Debezium connector, it generates helpful information about events called log messages.
Debezium uses Kafka topics to organize these messages, a unique name in the Kafka clusters. Kafka clusters consist of one or more servers called Kafka brokers.
Looking for simple ways to connect Kafka? Rest assured, Hevo’s no-code platform helps streamline your ETL process. Try Hevo and equip your team to:
- Integrate data from 150+ sources(60+ free sources).
- Simplify data mapping with an intuitive, user-friendly interface.
- Instantly load and sync your transformed data into your desired destination.
You can see it for yourselves by looking at our 2000+ happy customers, such as Airmeet, Cure.Fit, and Pelago.
Get Started with Hevo for Free
Debezium Single Message Transformation (SMT) allows for targeted modification of event records before they are sent to Kafka. Users configure connectors to apply transformations selectively based on predefined predicates, which specify conditions for processing specific subsets of messages.
When setting up Debezium SMT, predicates play a crucial role in determining when transformations are applied. These predicates can be reused across configurations and include an option to negate conditions, effectively inverting them.
By default, Debezium SMT applies transformations to every record emitted by the connector. However, users can configure it to selectively modify subsets of data rather than all records.
Starting from Apache Kafka 2.6, it’s possible to apply transformations selectively based on conditions defined in predicates. When a Debezium connector emits a data change event, Kafka Connect evaluates the message against these predicates. If the condition evaluates to true, the specified transformation is applied before the message is sent to the Kafka topic; otherwise, the message is sent unchanged.
This approach ensures that transformations are applied precisely where needed, enhancing flexibility and efficiency in managing data flows within Kafka.
What types of Debezium SMT does Debezium Provide?
Transform | Description |
Topic Routing | Re-routes records to different topics based on a regular expression applied to the original topic name. |
Content-Based Routing | Reroute selected events to other topics, based on the event content. |
Message Filtering | Applies a filter to the change events emitted by the connectors, based on their content. This lets you propagate only those records that are relevant to you. |
New Record State Extraction | Extracts the flat structure of field names and values from Debezium change events, facilitating sink connectors which cannot process Debezium’s complex event structure. |
Outbox Event Router | Provides a way to safely and reliably exchange data between multiple (micro) services. |
MongoDB New Document State Extraction | The MongoDB-specific counter-part to the New Record State Extraction SMT. |
1. Topic routed SMT
Every data change event in Kafka consists of the default destination called Kafka topic. However, users can re-route the topic before the record reaches Kafka Connect. Debezium can provide such topic routing Debezium SMT through the Kafka Connect configuration.
Use case
A topic receives records from one or more physical tables. When you want a particular topic to receive records from more than one physical table, you should configure the Debezium connector to reroute the records from that topic.
Logical tables
It is a common use case for routing records for multiple physical tables. Logical tables consist of multiple sharded tables having the same schema. These tables are physically distinct, but they form a logical table together. Users can re-route change data event records for tables in any shards to the same topic.
Partitioned PostgreSQL tables
When the Debezium PostgreSQL connector captures any data change in the partitioned table, the default behavior tells that the change data records are routed to a different topic for each partition.
For example, to configure the topic routing transformation in Kafka Connect configuration for the Debezium connector, you need to route change data events records for multiple physical tables to the same topic.
Configuration topic routing Debezium SMT consists of regular expressions that determine the following.
- The tables to route records.
- The destination topic name.
The configuration in the a.properties file consists of the following configuration.
Topic.regex: It specifies a regular expression that the transformation should apply to each data change event to determine if it should be routed to a particular topic.
From the example above, (.*)customers_shard(.*) matches records for changes to tables that include the name customers_shard string. It would reroute records for the following tables.
Topic.replacement: It consists of a regular expression that specifies the destination topic. From the above example, records for the three sharded tables listed above would be routed to the
myserver.mydb.customers_all_shards topic
2. Content-based SMT
Debezium streams all the change events that it reads to a single topic by default, but there might be a situation where you want to re-route the selected events to other topics based on the event content.
Therefore, the process of routing messages based on their content is described in a content-based routing messaging pattern. To apply content-based routing in Debezium, you can use the expression that the change events evaluate.
To use Content-based Debezium SMT, you need to follow the below steps.
- Open the RedHat Integration site and download the Debezium scripting SMT.
debezium-scripting-1.5.4.Final.tar.gz
- You have to extract the archive’s content into the Debezium plugin directories of your Kafka Connect environment.
- Restart the Kafka connection to connect the new JAR files.
The Groovy language needs the following libraries on the classpath in the environment.
groovy
groovy-json (optional)
groovy-jsr223
Similarly, the Javascript language needs the following libraries on the classpath.
graalvm.js
graalvm.js.scriptengine
To enable content-based Debezium SMT on the Debezium connector, you must configure the ContentBasedRouter in the Kafka Connect environment.
For example, to re-route all updated records to an updated topic, you need the below configuration for your Debezium Connector.
Integrate Kafka to Azure Synapse Analytics
Integrate Kafka to BigQuery
Integrate Kafka to Databricks
3. Message Filtering SMT
Debezium delivers every data change that it receives to the Kafka broker by default. But in some cases, users might be interested in only a subset of the event.
Therefore, to enable users to only process relevant records, Debezium provides message filtering Debezium SMT.
As the Message Filtering Debezium SMT processes the event streams, it evaluates each event against the filter condition. Only those events that meet the filter conditions can be passed to the broker.
To use the Message Filtering Debezium SMT, follow the below steps.
- Open the RedHat Integration site and download the Debezium scripting SMT. It can be as follows.
debezium-scripting-1.5.4.Final.tar.gz
- You need to extract the archive into the Debezium plugin directories of your Kafka environment.
You have to obtain the SR-223 script engine implementation and add its content to the Debezium plugin directories of your Kafka environment.
- You have to restart Kafka Connect to enable the new JAR files.
- The Groovy language needs the following libraries on the classpath in the environment.
groovy
groovy-json (optional)
groovy-jsr223
- The JavaScript language needs the following libraries on the classpath in the environment.
graalvm.js
graalvm.js.scriptengine
To configure the Debezium connector to the filter change event records, you must configure the filter Debezium SMT in the Kafka Connect configuration for the Debezium connector.
Add the below configuration to your connector configuration.
The above example uses Groovy expression language. The regular expression value.op == ‘u’ && value.before.id == 2 removes all the messages except the one that represents update(u) records with id values that are equal to 2.
4. New record state extraction SMT
A Debezium data change event consists of a complex structure of information, which are stored in Kafka records to convey Debezium about change data events.
Some Kafka records might expect Kafka records to provide a flat structure of field names and values. Therefore, Debezium can configure the event flattening Debezium SMT to provide such records.
Change event structure
Debezium consists of the data change events in a complex structure that consists of the below parts.
- The operation that made the change.
- Source information such as the name of the database and table where the change was made.
- Timestamp for when the change was made.
- Row data before and after the change.
To configure the event flattening Debezium SMT on the Kafka Connect source or sink connector, you can add the configuration to the connector’s configuration.
In the a.properties file, you should specify the below configuration.
You can set the transforms= multiple commas separated SMT alias in the order you want.
The below .properties example sets several events flattening Debezium SMT options.
drop.tombstones=false: It keeps the tombstones records for DELETE operation in the event stream.
delete.handling.mode=rewrite: For the DELETE operation, you can edit the Kafka record by flattening the field in the change event. The value field consists of the key-value pairs.
The Debezium SMT adds __deleted and sets it to true, for example, below.
add.fields=table,lsn: It adds the change event metadata for the table, and lsn is to simplify the Kafka record.
5. Outbox event router
The outbox pattern is safe and provides reliable data exchange between the multiple microservices. Its implementation avoids inconsistencies between the service’s internal state and state in events.
To configure the outbox pattern to Debezium, you need to configure the Debezium connector to:
- Capture changes in the outbox table
- Apply the Debezium outbox event router Debezium SMT
Consider the below Debezium outbox message.
A Debezium connector configured to apply the outbox event router Debezium SMT generates the above message by transforming a Debezium raw message as below.
The above example is based on the default outbox event routing configuration that assumes an outbox table structure and event routing based on aggregates.
Users can add customization to the behavior of SMY by modifying the values of the default configuration options.
Outbox table structure
The outbox table should have the below columns to apply the default outbox event router Debezium SMT configuration to the Debezium connector.
6. MongoDB New Document State Extraction
When the Debezium MongoDB connector is connected, it generates data in the complex structure.
The message consists of:
- Operation and metadata.
- The whole data after the insert has been executed.
- The patch element describes the altered field.
The after and patch elements are strings that contain JSON representations of the inserted or altered data.
[source_Destination_Banner]
The General message structure for the insert event looks as follows.
When the above structure represents changes to the MongoDB schema-less collections, the existing sink connectors do not understand it.
As a result, Debezium provides an SMT that converts the after/patch information from MongoDB CDC events into a suitable structure for the sink connectors.
Therefore, the Debezium SMT parses the JSON strings and reconstructs correctly typed Kafka Connect records from which the connectors can consume.
If the emitted record structure uses JSON, it looks like the below output.
You have to add Debezium SMT on a sink connector.
Configuration
The configuration consists of the below properties.
Array encoding
SMT converts MongoDB arrays into an array defined by Apache Connect schema. Such arrays must contain elements of the same type, but MongoDB stores elements of different kinds into the same array.
Thus, to overcome this issue, the array can be encoded in two different ways using array.encoding configuration.
- Value array: It encodes the array as the array datatype.
- Value document: It converts the array into a struct of structs. The main struct contains field names _0,_1,_2, etc. The name represents the index of the element.
Consider an example of a source MongoDB document with an array of a heterogeneous type, as shown below.
It can be encoded as shown in the below image.
Learn More About:
Distributed Tracing in microservice applications using Debezium
Conclusion
In this tutorial, you have learned about Single Message Transformation in Debezium with its types, examples, and configuration properties. Some Debezium SMT is available by default in the Debezium container image, while some Debezium SMT like Message Filtering and Content-based routing need to be downloaded.
Debezium is a trusted source that a lot of companies use as it provides many benefits but transferring data from it into a data warehouse is a hectic task. The Automated data pipeline helps in solving this issue and this is where Hevo comes into the picture.
Hevo Data is a No-code Data Pipeline and has awesome 150+ pre-built Integrations that you can choose from.
Want to take Hevo for a spin? Sign Up or a 14-day free trial and experience the feature-rich Hevo suite firsthand. Also checkout our unbeatable pricing to choose the best plan for your organization.
Frequently Asked Questions
1. What is SMT in Debezium?
SMT (Single Message Transform) in Debezium is a mechanism that allows you to modify Kafka records as they pass through the Kafka Connect pipeline
Example: SMT can be used to add a field to all records or drop specific fields not needed in downstream systems.
2. What is Debezium connector used for?
The Debezium connector is used for change data capture (CDC). It monitors databases (e.g., MySQL, PostgreSQL, MongoDB) and streams changes (inserts, updates, deletes) to Kafka topics in real-time, allowing downstream consumers to process and react to database changes efficiently.
3. What happens if the Debezium connector fails?
If a Debezium connector fails, it stops capturing changes and sending them to Kafka. The failure could be due to various reasons such as network issues, database connectivity problems, or misconfigurations.
Manjiri is a proficient technical writer and a data science enthusiast. She holds an M.Tech degree and leverages the knowledge acquired through that to write insightful content on AI, ML, and data engineering concepts. She enjoys breaking down the complex topics of data integration and other challenges in data engineering to help data professionals solve their everyday problems.