Debezium is an open-source, distributed system that can convert real-time changes of existing databases into event streams so that various applications can consume and respond immediately. Debezium uses connectors like PostgreSQL, SQL, MySQL, Oracle, MongoDB, and more for respective databases to stream such changes.

When the Debezium connector is connected, it releases a lot of information about change data events through messages.

However, since messages can create confusion among data change events, organizations use Single Message Transformation (SMT) to segregate messages.

Debezium SMT is an API tool used to operate on every single message that has to be passed to the Kafka topic. Debezium provides the privilege to users to change or modify these messages through the Debezium SMT before sending them to the Kafka topic.

Prerequisites

  • Basic understanding of Kafka 

What is Debezium?

debezium smt: debezium logo

Debezium is an open-source, distributed platform used for event streaming of real-time changes in databases. It uses connectors that keep track of real-time changes and writes them into the Kafka topic as events.

Once you start the Debezium connector, it generates helpful information about events called log messages.

Debezium uses Kafka topics to organize these messages, a unique name in the Kafka clusters. Kafka clusters consist of one or more servers called Kafka brokers.

What is Debezium SMT (Single Message Transformation)?

Debezium Single Message Transformation (SMT) allows for targeted modification of event records before they are sent to Kafka. Users configure connectors to apply transformations selectively based on predefined predicates, which specify conditions for processing specific subsets of messages.

When setting up Debezium SMT, predicates play a crucial role in determining when transformations are applied. These predicates can be reused across configurations and include an option to negate conditions, effectively inverting them.

By default, Debezium SMT applies transformations to every record emitted by the connector. However, users can configure it to selectively modify subsets of data rather than all records.

Starting from Apache Kafka 2.6, it’s possible to apply transformations selectively based on conditions defined in predicates. When a Debezium connector emits a data change event, Kafka Connect evaluates the message against these predicates. If the condition evaluates to true, the specified transformation is applied before the message is sent to the Kafka topic; otherwise, the message is sent unchanged.

This approach ensures that transformations are applied precisely where needed, enhancing flexibility and efficiency in managing data flows within Kafka.

Debezium provides the following types of Debezium SMT:

1. Topic routed SMT

Every data change event in Kafka consists of the default destination called Kafka topic. However, users can re-route the topic before the record reaches Kafka Connect. Debezium can provide such topic routing Debezium SMT through the Kafka Connect configuration. 

Use case

A topic receives records from one or more physical tables. When you want a particular topic to receive records from more than one physical table, you should configure the Debezium connector to reroute the records from that topic.

Logical tables

It is a common use case for routing records for multiple physical tables. Logical tables consist of multiple sharded tables having the same schema. These tables are physically distinct, but they form a logical table together. Users can re-route change data event records for tables in any shards to the same topic.

Partitioned PostgreSQL tables

When the Debezium PostgreSQL connector captures any data change in the partitioned table, the default behavior tells that the change data records are routed to a different topic for each partition.

For example, to configure the topic routing transformation in Kafka Connect configuration for the Debezium connector, you need to route change data events records for multiple physical tables to the same topic.

Configuration topic routing Debezium SMT consists of regular expressions that determine the following.

  • The tables to route records.
  • The destination topic name.

The configuration in the a.properties file consists of the following configuration.

debezium smt: a.properties configration
Debezium SMT : a.properties configration

Topic.regex: It specifies a regular expression that the transformation should apply to each data change event to determine if it should be routed to a particular topic.

From the example above, (.*)customers_shard(.*) matches records for changes to tables that include the name customers_shard string. It would reroute records for the following tables.

debezium smt: customers_shard code
Customers_shard code

Topic.replacement: It consists of a regular expression that specifies the destination topic. From the above example, records for the three sharded tables listed above would be routed to the

myserver.mydb.customers_all_shards topic

2. Content-based SMT

Debezium streams all the change events that it reads to a single topic by default, but there might be a situation where you want to re-route the selected events to other topics based on the event content.

Therefore, the process of routing messages based on their content is described in a content-based routing messaging pattern. To apply content-based routing in Debezium, you can use the expression that the change events evaluate.

To use Content-based Debezium SMT, you need to follow the below steps.

  1. Open the RedHat Integration site and download the Debezium scripting SMT. 
debezium-scripting-1.5.4.Final.tar.gz
  1. You have to extract the archive’s content into the Debezium plugin directories of your Kafka Connect environment.
  1. Restart the Kafka connection to connect the new JAR files.

The Groovy language needs the following libraries on the classpath in the environment.

groovy                    
groovy-json (optional)                        
groovy-jsr223    

Similarly, the Javascript language needs the following libraries on the classpath.

graalvm.js                        
graalvm.js.scriptengine    

To enable content-based Debezium SMT on the Debezium connector, you must configure the ContentBasedRouter in the Kafka Connect environment. 

For example, to re-route all updated records to an updated topic, you need the below configuration for your Debezium Connector.

Debezium connectoer configration
Debezium connectoer configration

3. Message Filtering SMT

Debezium delivers every data change that it receives to the Kafka broker by default. But in some cases, users might be interested in only a subset of the event.

Therefore, to enable users to only process relevant records, Debezium provides message filtering Debezium SMT.

As the Message Filtering Debezium SMT processes the event streams, it evaluates each event against the filter condition. Only those events that meet the filter conditions can be passed to the broker.

To use the Message Filtering Debezium SMT, follow the below steps.

  1. Open the RedHat Integration site and download the Debezium scripting SMT. It can be as follows.
debezium-scripting-1.5.4.Final.tar.gz
  1. You need to extract the archive into the Debezium plugin directories of your Kafka environment.

You have to obtain the SR-223 script engine implementation and add its content to the Debezium plugin directories of your Kafka environment.

  1. You have to restart Kafka Connect to enable the new JAR files.
  1. The Groovy language needs the following libraries on the classpath in the environment.
groovy 
groovy-json (optional)       
groovy-jsr223       
  1. The JavaScript language needs the following libraries on the classpath in the environment.   
graalvm.js
graalvm.js.scriptengine         

To configure the Debezium connector to the filter change event records, you must configure the filter Debezium SMT in the Kafka Connect configuration for the Debezium connector.

Add the below configuration to your connector configuration.

Connecter configration
Connecter Configration

The above example uses Groovy expression language. The regular expression value.op == ‘u’ && value.before.id == 2 removes all the messages except the one that represents update(u) records with id values that are equal to 2.  

4. New record state extraction SMT

A Debezium data change event consists of a complex structure of information, which are stored in Kafka records to convey Debezium about change data events.

Some Kafka records might expect Kafka records to provide a flat structure of field names and values. Therefore, Debezium can configure the event flattening Debezium SMT to provide such records.

Change event structure

Debezium consists of the data change events in a complex structure that consists of the below parts.

  • The operation that made the change.
  • Source information such as the name of the database and table where the change was made.
  • Timestamp for when the change was made.
  • Row data before and after the change.

To configure the event flattening Debezium SMT on the Kafka Connect source or sink connector, you can add the configuration to the connector’s configuration.

In the a.properties file, you should specify the below configuration.

Properties config file
A.properties config file

You can set the transforms= multiple commas separated SMT alias in the order you want.

The below .properties example sets several events flattening Debezium SMT options.

Flattening smt
Flattening smt

drop.tombstones=false: It keeps the tombstones records for DELETE operation in the event stream.

delete.handling.mode=rewrite: For the DELETE operation, you can edit the Kafka record by flattening the field in the change event. The value field consists of the key-value pairs.

The Debezium SMT adds __deleted and sets it to true, for example, below.

Deleted = true
Deleted = true

add.fields=table,lsn:  It adds the change event metadata for the table, and lsn is to simplify the Kafka record.

5. Outbox event router

The outbox pattern is safe and provides reliable data exchange between the multiple microservices. Its implementation avoids inconsistencies between the service’s internal state and state in events.

To configure the outbox pattern to Debezium, you need to configure the Debezium connector to:

  • Capture changes in the outbox table
  • Apply the Debezium outbox event router Debezium SMT

Consider the below Debezium outbox message.

Outbox message
Outbox Message

A Debezium connector configured to apply the outbox event router Debezium SMT generates the above message by transforming a Debezium raw message as below.

debezium smt: raw message
Raw Message

The above example is based on the default outbox event routing configuration that assumes an outbox table structure and event routing based on aggregates.

Users can add customization to the behavior of SMY by modifying the values of the default configuration options.

Outbox table structure

The outbox table should have the below columns to apply the default outbox event router Debezium SMT configuration to the Debezium connector.

Outbox table structure
Outbox table structure

6. MongoDB New Document State Extraction

When the Debezium MongoDB connector is connected, it generates data in the complex structure. 

The message consists of:

  • Operation and metadata.
  • The whole data after the insert has been executed.
  • The patch element describes the altered field.

The after and patch elements are strings that contain JSON representations of the inserted or altered data.

The General message structure for the insert event looks as follows.

debezium smt: insert event
Insert event

When the above structure represents changes to the MongoDB schema-less collections, the existing sink connectors do not understand it.

As a result, Debezium provides an SMT that converts the after/patch information from MongoDB CDC events into a suitable structure for the sink connectors.

Therefore, the Debezium SMT parses the JSON strings and reconstructs correctly typed Kafka Connect records from which the connectors can consume.

If the emitted record structure uses JSON, it looks like the below output.

Emitted record structure
Emitted record structure

You have to add Debezium SMT on a sink connector.

Configuration

The configuration consists of the below properties.

debezium smt: configration
Configration

Array encoding

SMT converts MongoDB arrays into an array defined by Apache Connect schema. Such arrays must contain elements of the same type, but MongoDB stores elements of different kinds into the same array.

Thus, to overcome this issue, the array can be encoded in two different ways using array.encoding configuration.

  • Value array: It encodes the array as the array datatype.
  • Value document: It converts the array into a struct of structs. The main struct contains field names _0,_1,_2, etc. The name represents the index of the element.

Consider an example of a source MongoDB document with an array of a heterogeneous type, as shown below.

Array encoding hetrogeneous
Array encoding hetrogeneous

It can be encoded as shown in the below image.

Array encoding
Array encoding

Conclusion

In this tutorial, you have learned about Single Message Transformation in Debezium with its types, examples, and configuration properties. Some Debezium SMT is available by default in the Debezium container image, while some Debezium SMT like Message Filtering and Content-based routing need to be downloaded.

Debezium is a trusted source that a lot of companies use as it provides many benefits but transferring data from it into a data warehouse is a hectic task. The Automated data pipeline helps in solving this issue and this is where Hevo comes into the picture.

Hevo Data is a No-code Data Pipeline and has awesome 150+ pre-built Integrations that you can choose from.

Hevo can help you Integrate your data from numerous sources and load them into a destination to Analyze real-time data with a BI tool such as Tableau. It will make your life easier and data migration hassle-free. It is user-friendly, reliable, and secure.

Want to take Hevo for a spin? Sign Up or a 14-day free trial and experience the feature-rich Hevo suite firsthand. Also checkout our unbeatable pricing to choose the best plan for your organization.

Share your experience of learning about Debezium SMT in the comments section below.

Manjiri Gaikwad
Technical Content Writer, Hevo Data

Manjiri is a proficient technical writer and a data science enthusiast. She holds an M.Tech degree and leverages the knowledge acquired through that to write insightful content on AI, ML, and data engineering concepts. She enjoys breaking down the complex topics of data integration and other challenges in data engineering to help data professionals solve their everyday problems.

No-code Data Pipeline For Your Data Warehouse