Debezium is an open-source, distributed system that can convert real-time changes of existing databases into event streams so that various applications can consume and respond immediately. Debezium uses connectors like PostgreSQL, SQL, MySQL, Oracle, MongoDB, and more for respective databases to stream such changes.
When the Debezium connector is connected, it releases a lot of information about change data events through messages.
However, since messages can create confusion among data change events, organizations use Single Message Transformation (SMT) to segregate messages.
Debezium SMT is an API tool used to operate on every single message that has to be passed to the Kafka topic. Debezium provides the privilege to users to change or modify these messages through the Debezium SMT before sending them to the Kafka topic.
Table of contents
Prerequisites
- Basic understanding of Kafka
What is Debezium?
Image Source: Debezium
Debezium is an open-source, distributed platform used for event streaming of real-time changes in databases. It uses connectors that keep track of real-time changes and writes them into the Kafka topic as events.
Once you start the Debezium connector, it generates helpful information about events called log messages.
Debezium uses Kafka topics to organize these messages, a unique name in the Kafka clusters. Kafka clusters consist of one or more servers called Kafka brokers.
Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services and simplifies the ETL process. It supports 150+ data sources (including 50+ free data sources) and replicates data in 3 steps: Select the data source, provide valid credentials, and choose the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.
GET STARTED WITH HEVO FOR FREE[/hevoButton]
Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.
Check out why Hevo is the Best:
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional customer support through chat, email, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
SIGN UP HERE FOR A 14-DAY FREE TRIAL
Debezium SMT uses many Single Message Transformation (SMT) for the modification of events. Users need to configure the connector to apply a transformation to modify the record before sending it to Kafka.
You can use the Debezium SMT as a sink connector for changing records.
When users configure the single message transformation for the connector, they have to define a predicate for the transformation. The predicate specifies the transformation conditionally on the specific subset of messages processed by the connectors.
When you define a predicate for the first time, you can reuse it multiple times. Predicate also includes a negate option used to invert the condition written in the predicate statement.
Debezium SMT is used to modify the event records before Kafka Connect saves the record to the Kafka topic. When configuring the Debezium SMT for a Debezium connector, Kafka applies that transformation to every record that the connector emits by default.
But if you want to use transformation selectively, you can also modify a subset of released information instead of the entire data.
With Apache Kafka 2.6 or higher, you can have transformation in event messages from a particular table. You can use a predicate statement to inform Kafka Connect about the transformation of certain records and the condition for the messages.
When the Debezium connector emits a data change message, Kafka Connect checks messages against the predicate conditions.
If the condition is true, the transformation is applied. The message is either modified or sent without modification to the Kafka topic.
Debezium provides the following types of Debezium SMT:
1. Topic routed SMT
Every data change event in Kafka consists of the default destination called Kafka topic. However, users can re-route the topic before the record reaches Kafka Connect. Debezium can provide such topic routing Debezium SMT through the Kafka Connect configuration.
Use case
A topic receives records from one or more physical tables. When you want a particular topic to receive records from more than one physical table, you should configure the Debezium connector to reroute the records from that topic.
Logical tables
It is a common use case for routing records for multiple physical tables. Logical tables consist of multiple sharded tables having the same schema. These tables are physically distinct, but they form a logical table together. Users can re-route change data event records for tables in any shards to the same topic.
Partitioned PostgreSQL tables
When the Debezium PostgreSQL connector captures any data change in the partitioned table, the default behavior tells that the change data records are routed to a different topic for each partition.
For example, to configure the topic routing transformation in Kafka Connect configuration for the Debezium connector, you need to route change data events records for multiple physical tables to the same topic.
Configuration topic routing Debezium SMT consists of regular expressions that determine the following.
- The tables to route records.
- The destination topic name.
The configuration in the a.properties file consists of the following configuration.
Image Source: RedHat
Topic.regex: It specifies a regular expression that the transformation should apply to each data change event to determine if it should be routed to a particular topic.
From the example above, (.*)customers_shard(.*) matches records for changes to tables that include the name customers_shard string. It would reroute records for the following tables.
Image Source: Self
Topic.replacement: It consists of a regular expression that specifies the destination topic. From the above example, records for the three sharded tables listed above would be routed to the
myserver.mydb.customers_all_shards topic
2. Content-based SMT
Debezium streams all the change events that it reads to a single topic by default, but there might be a situation where you want to re-route the selected events to other topics based on the event content.
Therefore, the process of routing messages based on their content is described in a content-based routing messaging pattern. To apply content-based routing in Debezium, you can use the expression that the change events evaluate.
To use Content-based Debezium SMT, you need to follow the below steps.
- Open the RedHat Integration site and download the Debezium scripting SMT.
debezium-scripting-1.5.4.Final.tar.gz
- You have to extract the archive’s content into the Debezium plugin directories of your Kafka Connect environment.
- Restart the Kafka connection to connect the new JAR files.
The Groovy language needs the following libraries on the classpath in the environment.
groovy
groovy-json (optional)
groovy-jsr223
Similarly, the Javascript language needs the following libraries on the classpath.
graalvm.js
graalvm.js.scriptengine
To enable content-based Debezium SMT on the Debezium connector, you must configure the ContentBasedRouter in the Kafka Connect environment.
For example, to re-route all updated records to an updated topic, you need the below configuration for your Debezium Connector.
Image Source: RedHat
3. Message Filtering SMT
Debezium delivers every data change that it receives to the Kafka broker by default. But in some cases, users might be interested in only a subset of the event.
Therefore, to enable users to only process relevant records, Debezium provides message filtering Debezium SMT.
As the Message Filtering Debezium SMT processes the event streams, it evaluates each event against the filter condition. Only those events that meet the filter conditions can be passed to the broker.
To use the Message Filtering Debezium SMT, follow the below steps.
- Open the RedHat Integration site and download the Debezium scripting SMT. It can be as follows.
debezium-scripting-1.5.4.Final.tar.gz
- You need to extract the archive into the Debezium plugin directories of your Kafka environment.
You have to obtain the SR-223 script engine implementation and add its content to the Debezium plugin directories of your Kafka environment.
- You have to restart Kafka Connect to enable the new JAR files.
- The Groovy language needs the following libraries on the classpath in the environment.
groovy
groovy-json (optional)
groovy-jsr223
- The JavaScript language needs the following libraries on the classpath in the environment.
graalvm.js
graalvm.js.scriptengine
To configure the Debezium connector to the filter change event records, you must configure the filter Debezium SMT in the Kafka Connect configuration for the Debezium connector.
Add the below configuration to your connector configuration.
Image Source: RedHat
The above example uses Groovy expression language. The regular expression value.op == ‘u’ && value.before.id == 2 removes all the messages except the one that represents update(u) records with id values that are equal to 2.
A Debezium data change event consists of a complex structure of information, which are stored in Kafka records to convey Debezium about change data events.
Some Kafka records might expect Kafka records to provide a flat structure of field names and values. Therefore, Debezium can configure the event flattening Debezium SMT to provide such records.
Change event structure
Debezium consists of the data change events in a complex structure that consists of the below parts.
- The operation that made the change.
- Source information such as the name of the database and table where the change was made.
- Timestamp for when the change was made.
- Row data before and after the change.
To configure the event flattening Debezium SMT on the Kafka Connect source or sink connector, you can add the configuration to the connector’s configuration.
In the a.properties file, you should specify the below configuration.
Image Source: RedHat
You can set the transforms= multiple commas separated SMT alias in the order you want.
The below .properties example sets several events flattening Debezium SMT options.
Image Source: RedHat
drop.tombstones=false: It keeps the tombstones records for DELETE operation in the event stream.
delete.handling.mode=rewrite: For the DELETE operation, you can edit the Kafka record by flattening the field in the change event. The value field consists of the key-value pairs.
The Debezium SMT adds __deleted and sets it to true, for example, below.
Image credit: RedHat
add.fields=table,lsn: It adds the change event metadata for the table, and lsn is to simplify the Kafka record.
5. Outbox event router
The outbox pattern is safe and provides reliable data exchange between the multiple microservices. Its implementation avoids inconsistencies between the service’s internal state and state in events.
To configure the outbox pattern to Debezium, you need to configure the Debezium connector to:
- Capture changes in the outbox table
- Apply the Debezium outbox event router Debezium SMT
Consider the below Debezium outbox message.
Image Source: RedHat
A Debezium connector configured to apply the outbox event router Debezium SMT generates the above message by transforming a Debezium raw message as below.
Image Source: RedHat
The above example is based on the default outbox event routing configuration that assumes an outbox table structure and event routing based on aggregates.
Users can add customization to the behavior of SMY by modifying the values of the default configuration options.
Outbox table structure
The outbox table should have the below columns to apply the default outbox event router Debezium SMT configuration to the Debezium connector.
Image Source: RedHat
When the Debezium MongoDB connector is connected, it generates data in the complex structure.
The message consists of:
- Operation and metadata.
- The whole data after the insert has been executed.
- The patch element describes the altered field.
The after and patch elements are strings that contain JSON representations of the inserted or altered data.
The General message structure for the insert event looks as follows.
Image Source: Debezium
When the above structure represents changes to the MongoDB schema-less collections, the existing sink connectors do not understand it.
As a result, Debezium provides an SMT that converts the after/patch information from MongoDB CDC events into a suitable structure for the sink connectors.
Therefore, the Debezium SMT parses the JSON strings and reconstructs correctly typed Kafka Connect records from which the connectors can consume.
If the emitted record structure uses JSON, it looks like the below output.
Image Source: Debezium
You have to add Debezium SMT on a sink connector.
Configuration
The configuration consists of the below properties.
Image Source: Debezium
Array encoding
SMT converts MongoDB arrays into an array defined by Apache Connect schema. Such arrays must contain elements of the same type, but MongoDB stores elements of different kinds into the same array.
Thus, to overcome this issue, the array can be encoded in two different ways using array.encoding configuration.
- Value array: It encodes the array as the array datatype.
- Value document: It converts the array into a struct of structs. The main struct contains field names _0,_1,_2, etc. The name represents the index of the element.
Consider an example of a source MongoDB document with an array of a heterogeneous type, as shown below.
Image Source: Debezium
It can be encoded as shown in the below image.
Image Source: Debezium
Conclusion
In this tutorial, you have learned about Single Message Transformation in Debezium with its types, examples, and configuration properties. Some Debezium SMT is available by default in the Debezium container image, while some Debezium SMT like Message Filtering and Content-based routing need to be downloaded.
Debezium is a trusted source that a lot of companies use as it provides many benefits but transferring data from it into a data warehouse is a hectic task. The Automated data pipeline helps in solving this issue and this is where Hevo comes into the picture. Hevo Data is a No-code Data Pipeline and has awesome 150+ pre-built Integrations that you can choose from.
visit our website to explore hevo
Hevo can help you Integrate your data from numerous sources and load them into a destination to Analyze real-time data with a BI tool such as Tableau. It will make your life easier and data migration hassle-free. It is user-friendly, reliable, and secure.
SIGN UP for a 14-day free trial and see the difference!
Share your experience of learning about Debezium SMT in the comments section below.