It is generally well known today that if you’re going to have any successful application or business, you’re going to need a database and a way to integrate its data with other datasets for business analytics and machine learning workloads.

This guide goes over how to implement a data pipeline using DB2 change data capture with Kafka. You will go through a demo project that shows you how to ingest data from a DB2 database into Kafka using CDC and the Outbox pattern.

What is DB2?

DB2 Change Data Capture: ibm db2

DB2 is IBM’s enterprise-grade SQL database. DB2 was first released in 1983 and it is one database among a family of database servers developed by IBM. DB2 is both suitable for transactional and analytics processing. It is mostly used for e-commerce, online banking, and data warehousing purposes. DB2 is ACID-compliant and this means that it supports atomicity, consistency, isolation, and durability. Being a relational database means that DB2 works with structured data but it also supports unstructured data.

DB2 is platform-independent and can run on multiple platforms:

  • DB2 luw – This version runs on Linux/Unix/Windows and it is the one that is used by most organizations. This option gives you full management control over your DB2 instance.
  • DB2 z/os – This version is designed for mainframes and it is mostly used by organizations for running critical applications that hold huge amounts of data where many data processing techniques are used.
  • DB2 i/os – This version is platform-dependent and is used on IBM power systems.
  • DB2 on Cloud – This option is fully managed for you; you don’t need to provision any hardware, install DB2, or set up a DB2 database. You can provision DB2 Cloud Data Warehouse in a couple of ways:
    • Single-node plan which gives you 99.5% SLA.
    • Duo node HA option which gives you 99.99% SLA
Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data is a no-code data pipeline that simplifies the ETL process by loading data from 150+ sources, including databases, SaaS apps, and cloud storage, into your chosen destination. With a 3-step process—select the source, enter credentials, and choose the destination—Hevo not only loads data but also enriches and transforms it for analysis, all without coding.

Check out why Hevo is the Best:

  • Secure & Fault-Tolerant: Hevo ensures secure, consistent data handling with zero data loss.
  • Automated Schema Management: Hevo auto-detects and maps schemas, eliminating manual tasks.
  • Scalable & Efficient: Hevo scales effortlessly, handling millions of records with minimal latency and efficient bandwidth use.
  • Real-Time Monitoring & Support: Hevo provides live monitoring and 24/5 customer support via chat, email, and calls.
Get Started with Hevo for Free

What is Kafka?

DB2 Change Data Capture: kafka

Kafka is an open-source platform that is used both as a stream processing engine and a distributed data store for streaming data. Kafka can easily ingest and process huge amounts of streaming data from thousands of data sources (e.g. IoT devices) and store it in sequential and incremental order. Kafka is mostly used for developing streaming data pipelines real-time streaming applications. This is because Kafka:

  • Allows data sources to publish streaming data and downstream consumers to subscribe to these records
  • Provides immutable persistence for streams of records in the order in which the records were created
  • Can process streaming data in real-time

Kafka can also be used as a message broker for mediating communication between two or more applications.

What is Change Data Capture (CDC)?

DB2 Change Data Capture: CDC

CDC is a style of data movement. It works by using a set of technologies that allow you to identify and capture data that has changed in the database. Those UPDATE, INSERT, DELETE, and CREATE change events are captured and then sent either to a table in the same database or a different database, to a streaming data platform, or to cloud storage.

For example following an INSERT into a table, a change event with the following JSON structure is generated. It has metadata about the before state (NULL) and the after state.

{
"key":
{"customerid":79614},
   "value":{
      "before":null,
      "after":{
         "customerid":79614,
         "firstname":"Luke",
         "lastname":"Shaw",
         "dob":"31/07/85"
      },
      "source":{
         "version":"1.1.9.Final",
         "connector":"db2",
         "name":"db2-5273be3-ibm-55g1",
         "ts_ms":2015033814752,
         "snapshot":"true",
         "db":"defaultdb",
         "schema":"public",
         "table":"customers",
         "txId":5587630,
         "lsn":220441041305,
         "xmin":null
      }
   }
}

Traditional data movement relies on bulk ETL, which operates in batches and is complex to maintain. In contrast, Change Data Capture (CDC) enables real-time data transfer by capturing and delivering every change as it happens. CDC is ideal for cloud use cases, such as loading data into warehouses, lakes, or operational stores, and is beneficial for real-time analytics, streaming engines like Kafka, and fraud detection. Its real-time nature makes it more effective than ETL for various applications.

However, all CDC is not created equally. There are 3 main styles of doing CDC:

  • Log-based – The best and most practical way to do CDC is with a log-based approach. In every database, when a new transaction comes in, it gets logged into a log file. CDC picks up those changes and then moves them from the log. This method is very efficient and has no impact on the source system.
  • Query-based – In this approach, you insert a special column called LAST_UPDATED to all tables in the source database and then query that for changes. Polling the LAST_UPDATED column is not real-time and it imposes a severe load on the database.
  • Triggers – You can also use database triggers to capture row-level operations in real-time. But triggers can impact the database performance.

How to Implement DB2 Change Data Capture with Kafka

For the rest of this DB2 Change Data Capture with Kafka guide, you will use Docker Compose to run a basic infrastructure with Kafka and Kafka Connect.

After provisioning the DB2 database server and Kafka Connect, the next phase will be to register the database connector. For that, you will use a Debezium connector. Debezium is an open-source tool that allows you to pull change events from DB2 and pushes them to a Kafka topic. It is the most popular tool that is used when working with Kafka.

For the rest of this How to Implement DB2 Change Data Capture with Kafka guide, you are going to use a fictional vaccine ordering service that uses DB2 for persistence and Kafka for reactive messaging. Let’s walk through the implementation:

1. Set Up the Development Environment

The first is to set up your local environment. For this, you are going to be using docker-compose.

Let’s start by cloning the sample repository:

$ git clone https://github.com/ibm-cloud-architecture/vaccine-order-mgr

The next step of DB2 change data capture, use Docker Compose to stand up the five processes in the Microservices application. These are Kafka, Kafka Connect, ZooKeeper, DB2 for persistence, DB2 JDBC driver, and the vaccine-order-service.

Navigate to the project root directory and run the following command:

docker-compose -f strimzi-docker-compose.yaml up -d --build

Navigate to the environment direct and execute the following commands to create and validate the necessary Kafka topics for DB2 change data capture:

./createTopic.sh
 
./listTopics.sh

  __consumer_offsets
  db_history_vaccine_orders
  src_connect_configs
  src_connect_offsets
  src_connect_statuses
  vaccine_shipmentplan

You will be tracking the db_history_vaccine_orders topic for changes in the database schema on the vaccine orders table.

2. Verify the State of the DB2 Database

To validate that the database is running, run the following script to connect to the DB2 server and list the existing tables. Note that the password for the test user is db2inst1:

docker exec -ti db2 bash

db2 connect to TESTDB USER DB2INST1

db2 "select * from syscat.schemata"

db2 list tables
Integrate Kafka to Databricks
Integrate Kafka to Snowflake
Integrate MySQL to PostgreSQL

3. Configure the DB2 Connector

In this step of DB2 change data capture, you’re going to deploy and start the Debezium DB2 connector. The DB2 connector’s configuration is declared in the register-db2.json file under the environment/cdc directory.

To deploy the connector, navigate to this directory and run the following POST command:

curl -i -X POST -H "Accept:application/json" -H  "Content-Type:application/json" http://localhost:8083/connectors/ -d 
@cdc/register-db2.json

You can verify the status of the Kafka connector at:

http://localhost:8083/connectors/orderdb-connector/

Confirm that the Kafka topics have been created:

 ./listTopics.sh 
  vaccine_lot_db
  vaccine_lot_db.DB2INST1.ORDEREVENTS

You will find from the results that there are two Kafka topics that have been created. They are vaccine_lot_db which contains the definition of the DB2 database and the Debezium connector, and vaccine_lot_db.DB2INST1.ORDEREVENTS which will be used by the application for emitting events. The Debezium connector will start by taking a snapshot of the DB2INST1.ORDEREVENTS table and then it will send existing records to the Kafka topic.

4. Start a Kafka Consumer

In this step of DB2 Change Data Capture with Kafka, you are going to use the console consumer tool, to initiate a consumer:

docker-compose exec kafka /opt/kafka/bin/kafka-console-consumer.sh     --bootstrap-server kafka:9092     --from-beginning     --property print.key=true     --topic db2server.DB2INST1.ORDERS

5. Create a New Order in the Application

Use the user interface at http://localhost:8080 to create a new order.

Here is an example of the resulting records in the Kafka topic:

{"ID":"knr6sqy/T+bDwLnQgBHZF=="}   
{"before":null,"after":{"ID":"knr6sqy/T+bDwLnQgBHZF==","AGGREGATETYPE":"VaccineOrderEntity","AGGREGATEID":"21","TYPE":"OrderCreated","TIMESTAMP":1706315550441250,"PAYLOAD":"{"orderID":11,"deliveryLocation":"Zurich","quantity":100,"priority":1,"deliveryDate":"2022-02-23","askingOrganization":"Swiss Government","vaccineType":"MORDERNA","status":"OPEN","creationDate":"23-FEB-2022 23:24:00"}"},"source":{"version":"1.3.0.Final","connector":"db2","name":"vaccine_lot_db","ts_ms":1706315550441250,"snapshot":"last","db":"TESTDB","schema":"DB2INST1","table":"ORDEREVENTS","change_lsn":null,"commit_lsn":"00000000:0000150f:0000000000048fca"},"op":"r","ts_ms":1706315550441250,"transaction":null}

You can read more on db2 to SQL server migration.

Practical Use Cases for CDC

There are many use cases that need real-time synchronization.

1. Streaming data from your database to a data warehouse

One big use case of CDC is to stream data to your OLAP data warehouse for analytical processing.

2. Cache invalidation

This can be done by replaying the captured change events into a cache like Redis or Memcache.

3. Building search indexes

You can use change events to build a full-text search index.

4. Database migration

CDC can be used when migrating data to a different database vendor, moving data from On-Premises infrastructure to the cloud, or when synchronizing data between two different data systems.

5. Data synchronization in Microservices

CDC is becoming the de facto standard for synchronizing data across Microservices especially when the Outbox pattern is used 

Conclusion

Hopefully, from this DB2 Change Data Capture with Kafka guide, you’ve been able to nail down the basics of implementing a real-time streaming pipeline that uses a log-based CDC approach to move DB2 change data capture from a DB2 instance to a Kafka topic. Discover how PostgreSQL and DB2 differ in features, pricing, and support to help you choose the best fit for your projects. Find out more at DB2 vs PostgreSQL.

SQL Server is a trusted source that a lot of companies use as it provides many benefits but transferring data from it into a data warehouse is a hectic task. The Automated data pipeline helps in solving this issue and this is where Hevo comes into the picture. Hevo Data is a No-code Data Pipeline and has awesome 150+ pre-built integrations that you can choose from.

Hevo can help you Integrate your data from numerous sources and load them into a destination to Analyze real-time data with a BI tool such as Tableau. It will make your life easier and data migration hassle-free. It is user-friendly, reliable, and secure. With transparent pricings, and 150+ sources to choose from Hevo will give you a seamless experience.

Check out Hevo’s 14-day free trial and see the difference!

Share your experience of learning about DB2 change data capture in the comments section below.

Frequently Asked Questions

1. What is CDC in DB2?

CDC (Change Data Capture) in DB2 refers to a technology that captures changes made to data in a database and makes them available for use in downstream applications or processes.

2. How to enable CDC for DB2?

Install IBM InfoSphere CDC software, configure the source database, define and activate subscriptions, and monitor the CDC environment.

3. What is CDC used for?

Real-time data integration, data warehousing, auditing, ETL processes, backup and recovery, and event-driven architectures.

Jeremiah
Technical Content Writer, Hevo Data

Jeremiah is a specialist in crafting insightful content for the data industry, and his writing offers informative and simplified material on the complexities of data integration and analysis. He enjoys staying updated on the latest trends and uses his knowledge to help businesses succeed.

No-code Data Pipeline For Your Data Warehouse