The main objective of this post is to introduce you to Apache Kafka so that you can build a strong foundation and understand how to stream change data in your SQL Server database.

  1. In this post, you will learn how to leverage Apache Kafka to continuously stream DML and DDL events that are committed to your SQL Server tables in real-time so that they can be easily consumed by other applications and services.
  2. The change data capture functionality was introduced in SQL Server 2016 SP1. This feature allows SQL Server users to monitor databases and tables and it stores the changes that are committed into specifically created CDC tables.

What is Kafka and its Uses?

Kafka is a distributed streaming platform that is built on the principles of a messaging system.

  • New log entries from your web servers.
  • New sensor data from your IoT systems.
  • New stock trades.
  • Kafka can be used as an enterprise Pub/Sub messaging service. 
  • Kafka can be used for stream processing. 
  • Kafka provides connectors to import and export change records from databases and other systems as they occur.

You will utilize the Debezium SQL Server connector, an open-source distributed platform built on top of Kafka Connect. Debezium is used for streaming rows from SQL Server Change Data Capture (CDC) to Kafka topics.

Use Debezium and Kafka

Use Debezium and Kafka to propagate CDC (Change Data Capture) data from SQL Server to Materialize. Debezium captures row-level changes, such as INSERT, UPDATE, and DELETE operations, in the upstream database and publishes them as events to Kafka through Kafka Connect-compatible connectors.

Prerequisites

  • You will need a Linux VM (or install one in VirtualBox).
  • You will need SQL Server Enterprise Edition installed and set up.
  • You will need to install the Microsoft ODBC Driver for SQL Server.

Method 1: SQL CDC Using Kafka

Step 1: Enable CDC For The Table(s) That Should Be Captured By The Connector

1. Enable CDC For The SQL Server Database

USE TestDB 
GO 
EXEC sys.sp_cdc_enable_db 
GO

2. Enable CDC For The SQL Server Database Table That You Are Tracking

USE TestDB 
GO 
EXEC sys.sp_cdc_enable_table 
@source_schema = N'dbo', 
@source_name = N'Posts', 
@role_name = N'Admin', 
@supports_net_changes = 1 
GO

Step 2: Install Java On Your Linux VM

Kafka libs are based on Java and therefore we should download it and add it to an environment variable.

> sudo yum install java-1.8.0
> export JAVA_HOME=/usr/lib/jvm/jre-1.8.0

Sign Up for a 14-day free trial and experience the feature-rich Hevo suite firsthand.

Step 3: Download And Install Apache Kafka

Download the 2.5.0 release and un-tar it.

> mkdir kafka/
> cd kafka
> wget https://archive.apache.org/dist/kafka/2.5.0/kafka_2.12-2.5.0.tgzz
> tar -xzf kafka_2.12-2.5.0.tgz
> cd kafka_2.12-2.5.0

Step 4: Download And Install A Debezium Connector

Download the SQL Server Connector plug-in archive and extract the files into your Kafka Connect environment. Next, add the parent directory of the extracted plug-in(s) to Kafka Connect’s plugin path.

> wget https://repo1.maven.org/maven2/io/debezium/debezium-connector-sqlserver/1.2.0.Final/debezium-connector-sqlserver-1.2.0.Final-plugin.tar.gz
> tar -xzf /kafka/connect/debezium-connector-sqlserver
> plugin.path=/kafka/connect

Step 5: Start A Kafka Server

The next step is to start the Kafka server but Kafka uses Zookeeper — another open-source project that came out of the Hadoop project. Zookeeper is used to coordinate various processes on Kafka’s distributed cluster of brokers. 

Kafka provides a CLI tool to start Zookeeper which is available in your bin directory. On a different CLI window run the following command.

> bin/zookeeper-server-start.sh config/zookeeper.properties
[2020-07-19 19:08:32,351] INFO Reading configuration from: config/zookeeper.properties (org.apache.zookeeper.server.quorum.QuorumPeerConfig)
...

This tool takes some configurations and we provide a config file with all the default values.

Now start the Kafka server.

> bin/kafka-server-start.sh config/server.properties
[2020-07-19 19:08:48,052] INFO Verifying properties (kafka.utils.VerifiableProperties)
[2020-07-19 19:08:48,101] INFO Property socket.send.buffer.bytes is overridden to 1044389 (kafka.utils.VerifiableProperties)
...

Step 6: Configure Kafka Connect

The Kafka Connect configuration can be loaded into Kafka Connect via the REST API. The following example adds the Debezium SQL Server connector configuration in a JSON format. 

This example specifies that the Debezium connector instance should monitor a SQL Server instance running at port 1433 on 102.5.232.115.

curl -k -X POST -H "Accept:application/json" -H "Content-Type:application/json" https://localhost:8083/connectors/ -d “
{
  "name": "sqlserver-posts-connector",  
  "config": {
    "connector.class": "io.debezium.connector.sqlserver.SqlServerConnector", 
    "database.hostname": "102.5.232.115", 
    "database.port": "1433", 
    "database.user": "admin", 
    "database.password": "Supersecretpassword!", 
    "database.dbname": "TestDB", 
    "database.server.name": "staging", 
    "table.whitelist": "dbo.Posts", 
    "database.history.kafka.bootstrap.servers": "kafka:9092", 
    "database.history.kafka.topic": "dbhistory.Posts" 
  }
}
”

Now check that the connector is running successfully using this command.

curl -k https://localhost:8083/connectors/sqlserver-posts-connector/status

Get a demo that’s customized to your unique data integration challenges and experience the feature-rich Hevo suite firsthand.

Step 7: Monitor Kafka Logs For Change Data

If the connector is in RUNNING state, you should see the Connect worker logs indicating the successful ingestion of data from SQL Server to Kafka.

Method 2: SQL CDC Using Hevo

  1. You need tech bandwidth and expertise to manage a clean execution. Otherwise, there is the unnecessary risk of getting bogged down in the complexity and the details.
  2. So things become a lot smoother and opportunities suddenly become abundant when you let the right platform handle this.
  3. Hevo supports SQL Server CDC out of the box and provides an intuitive graphical interface to enable this feature. Also, you will get a granular activity trail of all the changes made and synced to any target database or data warehouse. 

Here’s how easy it is to set up SQL Server CDC with Hevo:

  • Authenticate and connect your SQL Server data source.
  • Select Change Data Capture as your replication mode and configure your destination.

Hevo’s point-and-click interface ensures the lowest time for production possible.

SQL Server CDC Source (Debezium) connector offers the following Features

  • Automatic Topic Creation: Kafka topics are automatically created using the naming convention <database.server.name>.<schemaName>.<tableName>, with default properties: topic.creation.default.partitions=1 and topic.creation.default.replication.factor=3.
  • Table Inclusion/Exclusion: Configure which tables to monitor for changes. By default, all non-system tables are monitored.
  • Task Limits: Supports running multiple connectors, each limited to one task ("tasks.max": "1").
  • Snapshot Mode: Specify criteria for running snapshots.
  • Tombstones on Delete: Option to generate tombstone events after delete operations (default is true).
  • Database Authentication: Supports password authentication.
  • Output Formats: Supports Avro, JSON Schema, Protobuf, or JSON (schemaless) for record values and keys. Requires Schema Registry for Avro, JSON Schema, or Protobuf formats.
  • Incremental Snapshot: Supports incremental snapshotting via signaling.

Conclusion

  1. In this blog, You would have got a detailed understanding of Kafka.
  2. The two methods to set up SQL Server CDC using Kafka and Hevo are explained in detail. Get started with Hevo for seamless SQL Server CDC set up.
Talha
Software Developer, Hevo Data

Talha is a Software Developer with over eight years of experience in the field. He is currently driving advancements in data integration at Hevo Data, where he has been instrumental in shaping a cutting-edge data integration platform for the past four years. Prior to this, he spent 4 years at Flipkart, where he played a key role in projects related to their data integration capabilities. Talha loves to explain complex information related to data engineering to his peers through writing. He has written many blogs related to data integration, data management aspects, and key challenges data practitioners face.

No-code Data Pipeline for MS SQL