Debezium SMT (Single Message Transformations): Syntax and Usage Simplified

on API, Data Streaming, Debezium, Kafka, Single Message Transformation, SMT • February 23rd, 2022 • Write for Hevo

debezium smt - featured image

Debezium is an open-source, distributed system that can convert real-time changes of existing databases into event streams so that various applications can consume and respond immediately. Debezium uses connectors like PostgreSQL, SQL, MySQL, Oracle, MongoDB, and more for respective databases to stream such changes. When the Debezium connector is connected, it releases a lot of information about change data events through messages.

However, since messages can create confusion among data change events, organizations use Single Message Transformation (SMT) to segregate messages. Debezium SMT is an API tool used to operate on every single message that has to be passed to the Kafka topic. Debezium provides the privilege to users to change or modify these messages through the Debezium SMT before sending them to the Kafka topic.

Table of contents

Prerequisites

Basic understanding of Kafka 

What is Debezium?

debezium smt: debezium logo
Image Source: Debezium

Debezium is an open-source, distributed platform used for event streaming of real-time changes in databases. It uses connectors that keep track of real-time changes and writes it into the Kafka topic as events. Once you start the Debezium connector, it generates helpful information about events called log messages. Debezium uses Kafka topics to organize these messages, a unique name in the Kafka clusters. Kafka clusters consist of one or more servers called Kafka brokers.

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services and simplifies the ETL process. It supports 100+ data sources (including 40+ free data sources) like Asana and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

GET STARTED WITH HEVO FOR FREE[/hevoButton]

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
SIGN UP HERE FOR A 14-DAY FREE TRIAL

What is Debezium SMT (Single Message Transformation)?

Debezium SMT uses many Single Message Transformation (SMT) for modification of events. Users need to configure the connector to apply a transformation to modify the record before sending it to Kafka. You can use the Debezium SMT as a sink connector for changing records. When users configure the single message transformation for the connector, they have to define a predicate for the transformation. The predicate specifies the transformation conditionally on the specific subset of messages processed by the connectors. When you define a predicate for the first time, you can reuse it multiple times. Predicate also includes a negate option used to invert the condition written in the predicate statement.

Debezium SMT is used to modify the event records before Kafka Connect saves the record to Kafka topic. When configuring the Debezium SMT for a Debezium connector, Kafka applies that transformation to every record that the connector emits by default. But if you want to use transformation selectively, you can also modify a subset of released information instead of the entire data.

With Apache Kafka 2.6 or higher, you can have transformation in event messages from a particular table. You can use a predicate statement to inform Kafka Connect about the transformation of certain records and the condition for the messages. When the Debezium connector emits a data change message, Kafka Connect checks messages against the predicate conditions. If the condition is true, the transformation is applied. The message is either modified or sent without modification to the Kafka topic.

Debezium provides the following types of Debezium SMT

1. Topic routed SMT

Every data change event in Kafka consists of the default destination called Kafka topic. However, users can re-route the topic before the record reaches Kafka Connect. Debezium can provide such topic routing Debezium SMT through the Kafka Connect configuration. 

Use case

A topic receives records from one or more physical tables. When you want a particular topic to receive records from more than one physical table, you should configure the Debezium connector to reroute the records from that topic.

Logical tables

It is a common use case for routing records for multiple physical tables. Logical tables consist of multiple sharded tables having the same schema. These tables are physically distinct, but they form a logical table together. Users can re-route change data event records for tables in any shards to the same topic.

Partitioned PostgreSQL tables

When the Debezium PostgreSQL connector captures any data change in the partitioned table, the default behavior tells that the change data records are routed to a different topic for each partition.

For example, to configure the topic routing transformation in Kafka Connect configuration for the Debezium connector, you need to route change data events records for multiple physical tables to the same topic. Configuration topic routing Debezium SMT consists of regular expressions that determine the following.

  • The tables to route records.
  • The destination topic name.

The configuration in the a.properties file consists of the following configuration.

debezium smt: a.properties configration
Image Source: RedHat

Topic.regex: It specifies a regular expression that the transformation should apply to each data change event to determine if it should be routed to a particular topic.

From the example above, (.*)customers_shard(.*) matches records for changes to tables that include the name customers_shard string. It would reroute records for the following tables.

debezium smt: customers_shard code
Image Source: Self

Topic.replacement: It consists of a regular expression that specifies the destination topic. From the above example, records for the three sharded tables listed above would be routed to the

myserver.mydb.customers_all_shards topic

2. Content-based SMT

Debezium streams all the change events that it reads to a single topic by default, but there might be a situation where you want to re-route the selected events to other topics based on the event content. Therefore, the process of routing messages based on their content is described in a content-based routing messaging pattern. To apply content-based routing in Debezium, you can use the expression that the change events evaluate.

To use Content-based Debezium SMT, you need to follow the below steps.

  1. Open the RedHat Integration site and download the Debezium scripting SMT. 
debezium-scripting-1.5.4.Final.tar.gz
  1. You have to extract the archive’s content into the Debezium plugin directories of your Kafka Connect environment.
  1. Restart the Kafka connection to connect the new JAR files.

The Groovy language needs the following libraries on the classpath in the environment.

groovy                    
groovy-json (optional)                        
groovy-jsr223    

Similarly, the Javascript language needs the following libraries on the classpath.

graalvm.js                        
graalvm.js.scriptengine    

To enable content-based Debezium SMT on the Debezium connector, you must configure the ContentBasedRouter in the Kafka Connect environment. 

For example, to re-route all updated records to an updated topic, you need the below configuration for your Debezium Connector.

debezium smt: debezium connectoer configration
Image Source: RedHat

3. Message Filtering SMT

Debezium delivers every data change that it receives to the Kafka broker by default. But in some cases, users might be interested in only a subset of the event. Therefore, to enable users to only process relevant records, Debezium provides message filtering Debezium SMT. As the Message Filtering Debezium SMT processes the event streams, it evaluates each event against the filter condition. Only those events that meet the filter conditions can be passed to the broker.

To use the Message Filtering Debezium SMT, follow the below steps.

  1. Open the RedHat Integration site and download the Debezium scripting SMT. It can be as follows.
debezium-scripting-1.5.4.Final.tar.gz
  1. You need to extract the archive into the Debezium plugin directories of your Kafka environment.

You have to obtain the SR-223 script engine implementation and add its content to the Debezium plugin directories of your Kafka environment.

  1. You have to restart Kafka Connect to enable the new JAR files.
  1. The Groovy language needs the following libraries on the classpath in the environment.
groovy 
groovy-json (optional)       
groovy-jsr223       
  1. The JavaScript language needs the following libraries on the classpath in the environment.   
graalvm.js
graalvm.js.scriptengine         

To configure the Debezium connector to the filter change event records, you must configure the filter Debezium SMT in the Kafka Connect configuration for the Debezium connector.

Add the below configuration to your connector configuration.

debezium smt: connecter configration
Image Source: RedHat

The above example uses Groovy expression language. The regular expression value.op == ‘u’ && value.before.id == 2 removes all the messages except the one that represents update(u) records with id values that are equal to 2.  

4. New record state extraction SMT

A Debezium data change event consists of a complex structure of information, which are stored in Kafka records to convey Debezium about change data events. Some Kafka records might expect Kafka records to provide a flat structure of field names and values. Therefore, Debezium can configure the event flattening Debezium SMT to provide such records.

Change event structure

Debezium consists of the data change events in a complex structure that consists of the below parts.

  • The operation that made the change.
  • Source information such as the name of the database and table where the change was made.
  • Timestamp for when the change was made.
  • Row data before and after the change.

To configure the event flattening Debezium SMT on Kafka Connect source or sink connector, you can add the configuration to the connector’s configuration.

In a.properties file, you should specify the below configuration.

debezium smt: a.properties config file
Image Source: RedHat

You can set the transforms= multiple commas separated SMT alias in the order you want.

The below .properties example sets several events flattening Debezium SMT options.

debezium smt: flattening smt
Image Source: RedHat

drop.tombstones=false: It keeps the tombstones records for DELETE operation in the event stream.

delete.handling.mode=rewrite: For the DELETE operation, you can edit the Kafka record by flattening the field in the change event. The value field consists of the key-value pairs.

The Debezium SMT adds __deleted and sets it to true, for example, below.

debezium smt: deleted = true
Image credit: RedHat

add.fields=table,lsn:  It adds the change event metadata for the table, and lsn is to simplify the Kafka record.

5. Outbox event router

The outbox pattern is safe and provides reliable data exchange between the multiple microservices. Its implementation avoids inconsistencies between the service’s internal state and state in events.

To configure the outbox pattern to Debezium, you need to configure the Debezium connector to:

  • Capture changes in the outbox table
  • Apply the Debezium outbox event router Debezium SMT

Consider the below Debezium outbox message.

debezium smt: outbox message
Image Source: RedHat

A Debezium connector configured to apply the outbox event router Debezium SMT generates the above message by transforming a Debezium raw message as below.

debezium smt: raw message
Image Source: RedHat

The above example is based on the default outbox event routing configuration that assumes an outbox table structure and event routing based on aggregates. Users can add customization to the behavior of SMY by modifying the values of the default configuration options.

Outbox table structure

The outbox table should have the below columns to apply the default outbox event router Debezium SMT configuration to the Debezium connector.

debezium smt: outbox table structure
Image Source: RedHat

6. MongoDB New Document State Extraction

When the Debezium MongoDB connector is connected, it generates data in the complex structure. 

The message consists of:

  • Operation and metadata.
  • The whole data after the insert has been executed.
  • The patch element describes the altered field.

The after and patch elements are strings that contain JSON representations of the inserted or altered data.

The General message structure for the insert event looks as follows.

debezium smt: insert event
Image Source: Debezium

When the above structure represents changes to the MongoDB schema-less collections, the existing sink connectors do not understand it. As a result, Debezium provides an SMT that converts the after/patch information from MongoDB CDC events into a suitable structure for the sink connectors. Therefore, the Debezium SMT parses the JSON strings and reconstructs correctly typed Kafka Connect records from which the connectors can consume.

If the emitted record structure uses JSON, it looks like the below output.

Debezium smt: emitted record structure
Image Source: Debezium

You have to add Debezium SMT on a sink connector.

Configuration

The configuration consists of the below properties.

debezium smt: configration
Image Source: Debezium

Array encoding

SMT converts MongoDB arrays into an array defined by Apache Connect schema. Such arrays must contain elements of the same type, but MongoDB stores elements of different kinds into the same array. Thus, to overcome this issue, the array can be encoded in two different ways using array.encoding configuration.

Value array: It encodes the array as the array datatype.

Value document: It converts the array into a struct of structs. The main struct contains field names _0,_1,_2, etc. The name represents the index of the element.

Consider an example of a source MongoDB document with an array of a heterogeneous type, as shown below.

debezium smt: array encoding hetrogeneous
Image Source: Debezium

It can be encoded as shown in the below image.

debezium smt: array encoding
Image Source: Debezium

Conclusion

In this tutorial, you have learned about Single Message Transformation in Debezium with its types, examples, and configuration properties. Some Debezium SMT is available by default in the Debezium container image, while some Debezium SMT like Message Filtering and Content-based routing need to be downloaded.

Debezium is a trusted source that a lot of companies use as it provides many benefits but transferring data from it into a data warehouse is a hectic task. The Automated data pipeline helps in solving this issue and this is where Hevo comes into the picture. Hevo Data is a No-code Data Pipeline and has awesome 100+ pre-built Integrations that you can choose from.

visit our website to explore hevo

Hevo can help you Integrate your data from numerous sources and load them into a destination to Analyze real-time data with a BI tool such as Tableau. It will make your life easier and data migration hassle-free. It is user-friendly, reliable, and secure.

SIGN UP for a 14-day free trial and see the difference!

Share your experience of learning about Debezium SMT in the comments section below.

No-code Data Pipeline For Your Data Warehouse