Kafka CDC Postgres: Easy Steps To Set Up Real-Time Sync

on Data Integration, Tutorials • June 17th, 2020 • Write for Hevo

Nowadays, the use of CDC (Change Data Capture) in applications is a MUST.  Of course, you don’t want to hear that the changes that you made to the database were not reflected in the analytics. This can result from a failure by an hourly or nightly sync job to pull or push the changes made to data on one database server from/to another database server.  

Kafka CDC Postgres: To perform real-time data analytics on database systems such as PostgreSQL, big joins and aggregations are needed.  In most cases, this creates locks because most database systems comply with the ACID properties of transactions and offer good isolation levels. The locks may stay in place for long, reducing the performance of the applications in use by live users. Thus, it will make more sense for you to stream data to other organization teams to perform analytics on it. 

This is what exactly will be shown to you in this article. You will be using PostgreSQL, Debezium, and Kafka. 

Achieving Change Data Capture In Postgres

Two ways to achieving CDC in Postgres are:

Method 1: Using Kafka
At first, you would have to run instances of PostgreSQL, Zookeeper, and Kafka; create a connector with scripts and finally establish CDC using Kafka console consumer. This approach is time-consuming and involves writing custom scripts.

Method 2: Using Hevo
Using Hevo, you can accomplish CDC within minutes. Since Hevo is a fully managed platform, you wouldn’t have to maintain and monitor custom scripts. Sign up for a 14-day free trial now!

Here is a snapshot of what this post covers:

  1. What Is CDC (Change Data Capture)?
  2. An Overview Of Kafka
  3. Where Is Kafka Used?
  4. How Kafka Works?
  5. Method 1: Using Kafka
  6. Method 2: Using Hevo
  7. More Reasons To Try Hevo

What Is CDC (Change Data Capture)?

CDC is the process/technology of identifying and capturing the changes that have been made to a database. The changes can then be applied to some other data repository. 

The changes can also be made available in a format that can be consumed by data integration tools during the ETL (Extract, Transform, and Load) process. 

It reduces the resources that are required for the ETL processes by using the binary log of the source database or trigger functions to ingest the only data that has changed since the last ETL operation rather than the entire database. 

An Overview Of Kafka

Apache Kafka is a distributed streaming platform. Any streaming application is characterized by the following capabilities:

  1. It can publish and subscribe to a stream of records, just like an enterprise messaging system or a message queue. 
  2. It can store a stream of records in a durable and fault-tolerant way. 
  3. It can process a stream of records in real-time. 

Kafka stores data changes in an ordered log. In Kafka, each log can be referred to as a “topic partition”. The set of partitions that have been distributed across many nodes may be grouped to form a logical “topic”. You can configure the topic according to your own needs. It’s even possible to have each log write to one topic. 

Topics can be broken down into partitions to enable parallel processing across nodes. It is open-source and it currently has a large community of developers, who contribute towards its new features, updates and offers support to new users. It was designed and developed to run in a distributed environment. 

This means that it doesn’t sit on one computer, but it runs across many servers, giving it the ability to leverage additional storage capacity and processing power. 

kafka cdc postgres
Image Source: https://kafka.apache.org/

Where Is Kafka Used?

Every business wants to remain competitive. One of the ways to achieve this is by analyzing real-time data, which allows them to gain faster insights and achieve quicker response times. After gaining such insights, businesses can make sound decisions regarding running their businesses. 

Traditionally, data was transmitted across networks in the form of batches. This is associated with several limitations, especially low speed. Due to the distributed nature of Kafka and its streamlined way of managing incoming data, it can operate very quickly. Kafka makes it possible for large clusters to monitor and respond to millions of changes to a dataset each second. Kafka makes it possible to work with and react to streaming data in real-time. 

How Kafka Works?

Apache Kafka reads information from a source and organizes it into “topics”. A good example of such a data source is a transactional log where a supermarket records each sale. Kafka is suitable where a supermarket with several branches nationally processes thousands of sales each minute. This is made possible by a function called Producer. The Producer is the interface between applications and the topics, and it publishes data to the chosen topics. 

Another interface, called the Consumer, facilitates the reading of topic logs. The information stored in the topic logs is then passed to the applications that may need them. When you combine its components with other big data analytics components, Kafka acts as the “central nervous system”, passing data through input applications, storage lakes, and data processing engines. 

Method 1: Postgres CDC Using Kafka

Before getting into the details, let us look what you will cover here:

Prerequisites

To demonstrate the concept of CDC in PostgreSQL, you will need the following:

  • Apache Kafka: This will create a topic to which data changes can be published as they happen. Find all the resources to set it up here.
  • Kafka Connect: This will be used to stream data between PostgreSQL and Apache Kafka. You will use it to define connectors that will help us move data from PostgreSQL in and out of Kafka. 
  • Debezium: This will help us convert WALs (write-ahead logs) into a data stream. 

You will then use Kafka Connect API to stream data from the database into Kafka. 

An Example For Kafka CDC Postgres

You are now going to show you how to implement the concept of CDC using PostgreSQL, Kafka, and Debezium. Ensure that you’ve docker. Next, follow the steps given below:

Step 1: Start An Instance Of PostgreSQL
Step 2: Start An Instance Of Zookeeper
Step 3: Start An Instance Of Kafka
Step 4: Start An Instance Of Debezium
Step 5: Establish A Connection To PostgreSQL
Step 6: Create A Database And Table
Step 7: Create A Connector
Step 8: Check Whether The Connector Was Created
Step 9: Launch Kafka Console Consumer And Watch Changes

Step 1: Start An Instance Of PostgreSQL

Run the following command:

docker run — name postgres -p 5000:5432 debezium/postgres

The above command will start the PostgreSQL database on our system. 

The -name option helps us specify the name of the database, which is Postgres. The -p option helps us specify the port on which the database will be running. The database will be running on port number 5000. 

Step 2: Start An Instance Of Zookeeper

Just run the following command:

docker run -it — name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper

The command will start Zookeeper on our system for storage of offsets from the consumer. 

The -it option can also be written as -i -t. The -i option tells docker that you need an interactive session. The -t option tells docker that you need a tty (teletypewriter) attached, showing terminal-related information. The -p option helps us specify the ports that will be used by the Zookeeper instance. 

Step 3: Start An Instance Of Kafka

Use the following command:

docker run -it — name kafka -p 9092:9092 — link zookeeper:zookeeper debezium/kafka

The -it option tells docker that you need an interactive session with tty (teletypewriter) attached, showing terminal-related information. The -name option helps us specify the name of the instance. The -p option helps us specify the port on which the instance will be running. The –link option helps us link the two containers. 

Step 4: Start An Instance Of Debezium

Simply run the following command:

docker run -it — name connect -p 8083:8083 -e GROUP_ID=1 -e CONFIG_STORAGE_TOPIC=my-connect-configs -e 
OFFSET_STORAGE_TOPIC=my-connect-offsets -e 
ADVERTISED_HOST_NAME=$(echo $DOCKER_HOST | cut -f3 -d’/’ | cut -f1 -d’:’) — link zookeeper:zookeeper — link postgres:postgres — link kafka:kafka debezium/connect

The above command helps us achieve a lot. Other than starting an instance of Debezium, you have used this instance to link our existing containers together. The containers are Postgres, Kafka, and Zookeepers. The purpose of linking the containers is to have them communicate with each other. 

Step 5: Establish A Connection To PostgreSQL

You can connect to the PostgreSQL database by running the following command:

psql -h localhost -p 5000 -U postgres

Step 6: Create A Database And Table 

First, let’s create the database:

CREATE DATABASE inventory_db;

The table will have two columns, id (SERIAL data type) and name (VARCHAR data type). The id column will be the primary key for the table. 

To create the table, run the following command:

CREATE TABLE my_table(id SERIAL PRIMARY KEY, name VARCHAR);

Congratulations! You now have a ready setup. Next, you will be creating a connector to the Kafka connector.

Step 7: Create A Connector

Here is the code to help you register a connector to the Kafka Connect.

curl -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '
{
"name": "inventory_db-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "postgres",
"database.password": "postgres",
"database.dbname" : "inventory_db",
"database.server.name": "dbserver1",
"database.whitelist": "inventory_db",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "schema-changes.inventory"
}
}'

The code will create a connector and give it the name inventory_db-connector. All the parameters for the connector have been set, including the hostname, port, user, password, database name, etc. Let’s check whether the connector was created successfully!

Step 8: Check Whether The Connector Was Created

Use the following command:

curl -X GET -H "Accept:application/json" localhost:8083/connectors/inventory_db-connector

Now that you have a ready connector, you can start a Kafka console and begin to watch the changes.

Step 9: Launch Kafka Console Consumer And Watch Changes 

Just run the command given below:

docker run -it — name watcher — rm — link zookeeper:zookeeper debezium/kafka watch-topic -a -k dbserver1.public.my_table

And you’re done!

This is what you’ve learnt in this article:

  • You launched instances of PostgreSQL, Zookeeper, Kafka, and Debezium. 
  • You connected to the PostgreSQL database and created a new database for monitoring. 
  • You created a connector using the Kafka Connect. 
  • You verified whether the connector was created successfully. 
  • You launched a Kafka Console Consumer for watching changes as they happen in real-time. 

Method 2: Postgres CDC Using Hevo

Hevo, a No-code Data Pipeline can achieve CDC easily in minutes. You would not need a working knowledge of any of the previously mentioned applications like Zookeeper and Kafka. If you want to automate the process of the CDC in your PostgreSQL database, Hevo is the best product for you. They also offer pre-built integration to 100+ data sources.

Watch this video to get started with Hevo:

More Reasons To Try Hevo

  • Simplicity: Hevo has a comfortable user interface. Using Hevo for your data transfer ensures that it is done in a few minutes.
  • Scalability: Hevo handles data from a wide variety of sources like databases, analytics applications, and more at any scale. Thus, Hevo can help you scale, to meet your accelerating growth demands.
  • Reliable Data Load: Hevo ensures that your data migrations are reliable and done with negligible loss through its fault-tolerant architecture.
  • Real-Time: Hevo has a real-time streaming architecture that enables you to instantly move your data and thus gain real-time insights. 
  • Minimal Setup: Hevo is automated and completely managed. This ensures that it requires minimal effort on your part to set up the platform. 
  • 24*7 Support: Hevo’s team is available for 24*7 support over email, chat, and calls.

Give Hevo a try by signing up for a free 14-day trial!

Let us know about your experience in Kafka CDC Postgres in the comment section below.

No-code Data Pipeline for PostgreSQL