DB2 Change Data Capture Kafka Simplified

on Change Data Capture, Data Integration, Data Streaming, IBM DB2, Kafka • February 4th, 2022 • Write for Hevo

db2 change data capture-fea

It is generally well known today that if you’re going to have any successful application or business, you’re going to need a database and a way to integrate its data with other datasets for business analytics and machine learning workloads.

This guide goes over how to implement a data pipeline using DB2 change data capture with Kafka. You will go through a demo project that shows you how to ingest data from a DB2 database into Kafka using CDC and the Outbox pattern.

Table of Contents

What is DB2?

DB2 Change Data Capture: ibm db2
Image Source: assets-global.website-files.com

DB2 is IBM’s enterprise-grade SQL database. DB2 was first released in 1983 and it is one database among a family of database servers developed by IBM. DB2 is both suitable for transactional and analytics processing. It is mostly used for e-commerce, online banking, and data warehousing purposes. DB2 is ACID-compliant and this means that it supports atomicity, consistency, isolation, and durability. Being a relational database means that DB2 works with structured data but it also supports unstructured data.

DB2 is platform-independent and can run on multiple platforms:

  • DB2 luw – This version runs on Linux/Unix/Windows and it is the one that is used by most organizations. This option gives you full management control over your DB2 instance.
  • DB2 z/os – This version is designed for mainframes and it is mostly used by organizations for running critical applications that hold huge amounts of data where many data processing techniques are used.
  • DB2 i/os – This version is platform-dependent and is used on IBM power systems.
  • DB2 on Cloud – This option is fully managed for you; you don’t need to provision any hardware, install DB2, or set up a DB2 database. You can provision DB2 Cloud Data Warehouse in a couple of ways:
    • Single-node plan which gives you 99.5% SLA.
    • Duo node HA option which gives you 99.99% SLA

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services and simplifies the ETL process. It supports 100+ data sources (including 30+ free data sources) like Asana and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

GET STARTED WITH HEVO FOR FREE[/hevoButton]

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
SIGN UP HERE FOR A 14-DAY FREE TRIAL

What is Kafka?

DB2 Change Data Capture: kafka
Image Source: www.ovhcloud.com

Kafka is an open-source platform that is used both as a stream processing engine and a distributed data store for streaming data. Kafka can easily ingest and process huge amounts of streaming data from thousands of data sources (e.g. IoT devices) and store it in sequential and incremental order. Kafka is mostly used for developing streaming data pipelines real-time streaming applications. This is because Kafka:

  • Allows data sources to publish streaming data and downstream consumers to subscribe to these records
  • Provides immutable persistence for streams of records in the order in which the records were created
  • Can process streaming data in real-time

Kafka can also be used as a message broker for mediating communication between two or more applications.

What is Change Data Capture (CDC)?

DB2 Change Data Capture: CDC
Image Source: www.projectcubicle.com

CDC is a style of data movement. It works by using a set of technologies that allow you to identify and capture data that has changed in the database. Those UPDATE, INSERT, DELETE, and CREATE change events are captured and then sent either to a table in the same database or a different database, to a streaming data platform, or to cloud storage.

For example following an INSERT into a table, a change event with the following JSON structure is generated. It has metadata about the before state (NULL) and the after state.

{
"key":
{"customerid":79614},
   "value":{
      "before":null,
      "after":{
         "customerid":79614,
         "firstname":"Luke",
         "lastname":"Shaw",
         "dob":"31/07/85"
      },
      "source":{
         "version":"1.1.9.Final",
         "connector":"db2",
         "name":"db2-5273be3-ibm-55g1",
         "ts_ms":2015033814752,
         "snapshot":"true",
         "db":"defaultdb",
         "schema":"public",
         "table":"customers",
         "txId":5587630,
         "lsn":220441041305,
         "xmin":null
      }
   }
}

When you think about moving data from one source system to a target system, traditional data movement is based around bulk where you’re using an ETL tool to move data between databases. The challenge with this approach is that you’ve got a batch window/period of time where you can move all this data. ETL is also complex to set up and maintain.

A different way to do it is with CDC. CDC surpasses ETL by being real-time which is its biggest selling point over ETL. With CDC you’re capturing every change/transaction in real-time from the source database and you’re delivering it to a target in real-time as those changes happen.

The benefit of the CDC is that it is a very effective way of moving data for a variety of use cases. It may be moving data into a data warehouse, data lake, creating an operational data store, or a replica of that data in real-time which makes it perfect for the cloud. CDC delivers data in real-time which is great for analytics, moving data to a stream processing engine like Kafka, and real-time fraud detection.

However, all CDC is not created equally. There are 3 main styles of doing CDC:

  • Log-based – The best and most practical way to do CDC is with a log-based approach. In every database, when a new transaction comes in, it gets logged into a log file. CDC picks up those changes and then moves them from the log. This method is very efficient and has no impact on the source system.
  • Query-based – In this approach, you insert a special column called LAST_UPDATED to all tables in the source database and then query that for changes. Polling the LAST_UPDATED column is not real-time and it imposes a severe load on the database.
  • Triggers – You can also use database triggers to capture row-level operations in real-time. But triggers can impact the database performance.

Practical Use Cases for CDC

There are many use cases that need real-time synchronization.

1. Streaming data from your database to a data warehouse

One big use case of CDC is to stream data to your OLAP data warehouse for analytical processing.

2. Cache invalidation

This can be done by replaying the captured change events into a cache like Redis or Memcache.

3. Building search indexes

You can use change events to build a full-text search index.

4. Database migration

CDC can be used when migrating data to a different database vendor, moving data from On-Premises infrastructure to the cloud, or when synchronizing data between two different data systems.

5. Data synchronization in Microservices

CDC is becoming the de facto standard for synchronizing data across Microservices especially when the Outbox pattern is used 

How to Implement DB2 Change Data Capture with Kafka

For the rest of this DB2 Change Data Capture with Kafka guide, you will use Docker Compose to run a basic infrastructure with Kafka and Kafka Connect.

After provisioning the DB2 database server and Kafka Connect, the next phase will be to register the database connector. For that, you will use a Debezium connector. Debezium is an open-source tool that allows you to pull change events from DB2 and pushes them to a Kafka topic. It is the most popular tool that is used when working with Kafka.

For the rest of this How to Implement DB2 Change Data Capture with Kafka guide, you are going to use a fictional vaccine ordering service that uses DB2 for persistence and Kafka for reactive messaging. Let’s walk through the implementation:

1.DB2 Change Data Capture with Kafka: Set Up the Development Environment

The first is to set up your local environment. For this, you are going to be using docker-compose.

Let’s start by cloning the sample repository:

$ git clone https://github.com/ibm-cloud-architecture/vaccine-order-mgr

The next step of DB2 change data capture, use Docker Compose to stand up the five processes in the Microservices application. These are Kafka, Kafka Connect, ZooKeeper, DB2 for persistence, DB2 JDBC driver, and the vaccine-order-service.

Navigate to the project root directory and run the following command:

docker-compose -f strimzi-docker-compose.yaml up -d --build

Navigate to the environment direct and execute the following commands to create and validate the necessary Kafka topics for DB2 change data capture:

./createTopic.sh
 
./listTopics.sh

  __consumer_offsets
  db_history_vaccine_orders
  src_connect_configs
  src_connect_offsets
  src_connect_statuses
  vaccine_shipmentplan

You will be tracking the db_history_vaccine_orders topic for changes in the database schema on the vaccine orders table.

2. DB2 Change Data Capture with Kafka: Verify the State of the DB2 Database

To validate that the database is running, run the following script to connect to the DB2 server and list the existing tables. Note that the password for the test user is db2inst1:

docker exec -ti db2 bash

db2 connect to TESTDB USER DB2INST1

db2 "select * from syscat.schemata"

db2 list tables

3. DB2 Change Data Capture with Kafka: Configure the DB2 Connector

In this step of DB2 change data capture, you’re going to deploy and start the Debezium DB2 connector. The DB2 connector’s configuration is declared in the register-db2.json file under the environment/cdc directory.

To deploy the connector, navigate to this directory and run the following POST command:

curl -i -X POST -H "Accept:application/json" -H  "Content-Type:application/json" http://localhost:8083/connectors/ -d 
@cdc/register-db2.json

You can verify the status of the Kafka connector at:

http://localhost:8083/connectors/orderdb-connector/

Confirm that the Kafka topics have been created:

 ./listTopics.sh 
  vaccine_lot_db
  vaccine_lot_db.DB2INST1.ORDEREVENTS

You will find from the results that there are two Kafka topics that have been created. They are vaccine_lot_db which contains the definition of the DB2 database and the Debezium connector, and vaccine_lot_db.DB2INST1.ORDEREVENTS which will be used by the application for emitting events. The Debezium connector will start by taking a snapshot of the DB2INST1.ORDEREVENTS table and then it will send existing records to the Kafka topic.

4. DB2 Change Data Capture with Kafka: Start a Kafka Consumer

In this step of DB2 Change Data Capture with Kafka, you are going to use the console consumer tool, to initiate a consumer:

docker-compose exec kafka /opt/kafka/bin/kafka-console-consumer.sh     --bootstrap-server kafka:9092     --from-beginning     --property print.key=true     --topic db2server.DB2INST1.ORDERS

5. DB2 Change Data Capture with Kafka: Create a New Order in the Application

Use the user interface at http://localhost:8080 to create a new order.

Here is an example of the resulting records in the Kafka topic:

{"ID":"knr6sqy/T+bDwLnQgBHZF=="}   
{"before":null,"after":{"ID":"knr6sqy/T+bDwLnQgBHZF==","AGGREGATETYPE":"VaccineOrderEntity","AGGREGATEID":"21","TYPE":"OrderCreated","TIMESTAMP":1706315550441250,"PAYLOAD":"{"orderID":11,"deliveryLocation":"Zurich","quantity":100,"priority":1,"deliveryDate":"2022-02-23","askingOrganization":"Swiss Government","vaccineType":"MORDERNA","status":"OPEN","creationDate":"23-FEB-2022 23:24:00"}"},"source":{"version":"1.3.0.Final","connector":"db2","name":"vaccine_lot_db","ts_ms":1706315550441250,"snapshot":"last","db":"TESTDB","schema":"DB2INST1","table":"ORDEREVENTS","change_lsn":null,"commit_lsn":"00000000:0000150f:0000000000048fca"},"op":"r","ts_ms":1706315550441250,"transaction":null}

Conclusion

Hopefully, from this DB2 Change Data Capture with Kafka guide, you’ve been able to nail down the basics of implementing a real-time streaming pipeline that uses a log-based CDC approach to move DB2 change data capture from a DB2 instance to a Kafka topic.

SQL Server is a trusted source that a lot of companies use as it provides many benefits but transferring data from it into a data warehouse is a hectic task. The Automated data pipeline helps in solving this issue and this is where Hevo comes into the picture. Hevo Data is a No-code Data Pipeline and has awesome 100+ pre-built Integrations that you can choose from.

visit our website to explore hevo[/hevoButton]

Hevo can help you Integrate your data from numerous sources and load them into a destination to Analyze real-time data with a BI tool such as Tableau. It will make your life easier and data migration hassle-free. It is user-friendly, reliable, and secure.

SIGN UP for a 14-day free trial and see the difference!

Share your experience of learning about DB2 change data capture in the comments section below.

No-code Data Pipeline For Your Data Warehouse