Nowadays, the use of CDC (Change Data Capture) in applications is a MUST.
Of course, you don’t want to hear that the changes that you made to the database were not reflected in the analytics. This can result from a failure by an hourly or nightly sync job to pull or push the changes made to data on one database server from/to another database server.
- Kafka CDC Postgres: To perform real-time data analytics on database systems such as PostgreSQL, big joins and aggregations are needed.
- In most cases, this creates locks because most database systems comply with the ACID properties of transactions and offer good isolation levels.
- The locks may stay in place for long, reducing the performance of the applications in use by live users. Thus, it will make more sense for you to stream data to other organization teams to perform analytics on it.
Prerequisites
- Apache Kafka: This will create a topic to which data changes can be published as they happen.
- Kafka Connect: This will be used to stream data between PostgreSQL and Apache Kafka. You will use it to define connectors that will help us move data from PostgreSQL in and out of Kafka.
- Debezium: This will help us convert WALs (write-ahead logs) into a data stream.
What Is CDC (Change Data Capture)?
- CDC is the process/technology of identifying and capturing the changes that have been made to a database. The changes can then be applied to some other data repository.
- The changes can also be made available in a format that can be consumed by data integration tools during the ETL (Extract, Transform, and Load) process.
- It reduces the resources that are required for the ETL processes by using the binary log of the source database or trigger functions to ingest the only data that has changed since the last ETL operation rather than the entire database.
An Overview Of Kafka
Any streaming application is characterized by the following capabilities:
- It can publish and subscribe to a stream of records, just like an enterprise messaging system or a message queue.
- It can store a stream of records in a durable and fault-tolerant way.
- It can process a stream of records in real-time.
Methods to Move Data From Postgres CDC Using Kafka
Step 1: Start An Instance Of PostgreSQL
Run the following command:
docker run — name postgres -p 5000:5432 debezium/postgres
The above command will start the PostgreSQL database on our system.
The -name option helps us specify the name of the database, which is Postgres. The -p option helps us specify the port on which the database will be running. The database will be running on port number 5000.
Step 2: Start An Instance Of Zookeeper
Just run the following command:
docker run -it — name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper
The command will start Zookeeper on our system for storage of offsets from the consumer.
The -it option can also be written as -i -t. The -i option tells docker that you need an interactive session. The -t option tells docker that you need a tty (teletypewriter) attached, showing terminal-related information. The -p option helps us specify the ports that will be used by the Zookeeper instance.
Step 3: Start An Instance Of Kafka
Use the following command:
docker run -it — name kafka -p 9092:9092 — link zookeeper:zookeeper debezium/kafka
The -it option tells docker that you need an interactive session with tty (teletypewriter) attached, showing terminal-related information. The -name option helps us specify the name of the instance. The -p option helps us specify the port on which the instance will be running. The –link option helps us link the two containers.
Step 4: Start An Instance Of Debezium
Simply run the following command:
docker run -it — name connect -p 8083:8083 -e GROUP_ID=1 -e CONFIG_STORAGE_TOPIC=my-connect-configs -e
OFFSET_STORAGE_TOPIC=my-connect-offsets -e
ADVERTISED_HOST_NAME=$(echo $DOCKER_HOST | cut -f3 -d’/’ | cut -f1 -d’:’) — link zookeeper:zookeeper — link postgres:postgres — link kafka:kafka debezium/connect
The above command helps us achieve a lot. Other than starting an instance of Debezium, you have used this instance to link our existing containers together. The containers are Postgres, Kafka, and Zookeepers. The purpose of linking the containers is to have them communicate with each other.
Step 5: Establish A Connection To PostgreSQL
You can connect to the PostgreSQL database by running the following command:
psql -h localhost -p 5000 -U postgres
Step 6: Create A Database And Table
First, let’s create the database:
CREATE DATABASE inventory_db;
The table will have two columns, id (SERIAL data type) and name (VARCHAR data type). The id column will be the primary key for the table.
To create the table, run the following command:
CREATE TABLE my_table(id SERIAL PRIMARY KEY, name VARCHAR);
Congratulations! You now have a ready setup. Next, you will be creating a connector to the Kafka connector.
Step 7: Create A Connector
Here is the code to help you register a connector to the Kafka Connect.
curl -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '
{
"name": "inventory_db-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "postgres",
"database.password": "postgres",
"database.dbname" : "inventory_db",
"database.server.name": "dbserver1",
"database.whitelist": "inventory_db",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "schema-changes.inventory"
}
}'
The code will create a connector and give it the name inventory_db-connector. All the parameters for the connector have been set, including the hostname, port, user, password, database name, etc. Let’s check whether the connector was created successfully!
Step 8: Check Whether The Connector Was Created
Use the following command:
curl -X GET -H "Accept:application/json" localhost:8083/connectors/inventory_db-connector
Now that you have a ready connector, you can start a Kafka console and begin to watch the changes.
Step 9: Launch Kafka Console Consumer And Watch Changes
Just run the command given below:
docker run -it — name watcher — rm — link zookeeper:zookeeper debezium/kafka watch-topic -a -k dbserver1.public.my_table
And you’re done!
Method 2: Postgres CDC Using Hevo
- Hevo Data, a No-code Data Pipeline can achieve CDC easily in minutes. You would not need a working knowledge of any of the previously mentioned applications like Zookeeper and Kafka. If you want to automate the process of the CDC in your PostgreSQL database, Hevo is the best product for you. They also offer pre-built integration to 150+ data sources.
- Sign Up for a 14-day free trial and experience the feature-rich Hevo suite firsthand.
More Reasons To Try Hevo
- Simplicity: Hevo has a comfortable user interface. Using Hevo for your data transfer ensures that it is done in a few minutes.
- Scalability: Hevo handles data from a wide variety of sources like databases, analytics applications, and more at any scale. Thus, Hevo can help you scale, to meet your accelerating growth demands.
- Reliable Data Load: Hevo ensures that your data migrations are reliable and done with negligible loss through its fault-tolerant architecture.
- Real-Time: Hevo has a real-time streaming architecture that enables you to instantly move your data and thus gain real-time insights.
- Minimal Setup: Hevo is automated and completely managed. This ensures that it requires minimal effort on your part to set up the platform.
- 24*7 Support: Hevo’s team is available for 24*7 support over email, chat, and calls.
Conclusion
This is what you’ve learned in this article:
- You launched instances of PostgreSQL, Zookeeper, Kafka, and Debezium.
- You connected to the PostgreSQL database and created a new database for monitoring.
- You created a connector using Kafka Connect.
- You verified whether the connector was created successfully.
- You launched a Kafka Console Consumer for watching changes as they happen in real-time.
Nicholas Samuel is a technical writing specialist with a passion for data, having more than 14+ years of experience in the field. With his skills in data analysis, data visualization, and business intelligence, he has delivered over 200 blogs. In his early years as a systems software developer at Airtel Kenya, he developed applications, using Java, Android platform, and web applications with PHP. He also performed Oracle database backups, recovery operations, and performance tuning. Nicholas was also involved in projects that demanded in-depth knowledge of Unix system administration, specifically with HP-UX servers. Through his writing, he intends to share the hands-on experience he gained to make the lives of data practitioners better.