Setting Up the Debezium PostgreSQL Connector Simplified: A Comprehensive Guide 101

• January 25th, 2022

Debezium PostgreSQL

Debezium PostgreSQL Connector allows organizations to monitor changes in databases and create triggers for other applications. Although it supports one task, the PostgreSQL connector is widely used for building highly reliable applications due to the flexibility of data processing it offers. Similar to other Debezium connectors, PostgreSQL connectors take snapshots and then determine the changes in the databases. The changes are then sent to Kafka topics, which are consumed by applications.

Upon a complete walkthrough of this article, you will gain a decent understanding of PostgreSQL and Debezium. You will also learn about the working of the Debezium PostgreSQL connector. Read along!

Table of Contents

Prerequisites

  • Basic understanding of Event Streams.
  • Basic understanding of Kafka and ZooKeeper Services

What is Debezium?

Debezium PostgreSQL: Debezium logo| Hevo Data
Image Source

Debezium is a Change Data Capture (CDC) tool that uses a log-based method to track changes in databases. The primary use of Debezium is to monitor such changes and store them to a different destination. In the previous databases, Database Administrators used to have access to such changes, and they used to save the changes to the source control system manually. But now, through Debezium, you can use connectors to keep track of changes and store them in different locations by replication. Replication is nothing but copying or moving the data of one central database to another. However, in Debezium, replication is carried out through connectors as it establishes the connection to its source databases.

What is PostgreSQL?

PostgreSQL is an advanced and Open-Source Object-Relational Database Management System. It is the extension of the SQL along with some advanced features. Today, PostgreSQL functions as both relational and non-relational queries. It has helped developers build applications while protecting data integrity — many websites as well as mobile applications leverage PostgreSQL for the flexibility and reliability of digital solutions. 

What are Debezium Connectors?

The main goal of Debezium connectors is to capture changes from the databases and produce events. Therefore, it becomes easier for applications to react to changes very quickly. These connectors then publish changes to Kafka. Kafka is an open-sourced and well-known Event Streaming Platform. When Debezium connectors are deployed to the Kafka cluster, they monitor databases for new changes. After monitoring, they also write events into Kafka.

These events are then independently consumed by the different applications. Kafka acts as a distributed system, ensuring all the connectors are running and configured properly. If any Kafka Connect endpoints in the Kafka cluster go down, the remaining Kafka connect endpoints will restart the previously running connector on the terminated endpoint. Consequently, it provides high fault tolerance and scalability to applications.

But, it’s possible that not every application needs such type of fault tolerance or scalability. As a result, such applications do not need to rely on the external Kafka clusters or Kafka connect services. This is where embedded Debezium connectors are used directly. Whenever there is any data change, the connectors will inform applications rather than the Kafka clusters. 

Supercharge PostgreSQL ETL & Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as PostgreSQL, Google Search Console, Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services and simplifies the ETL process. It supports 100+ data sources (including 40+ free data sources) and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

Get started with hevo for free

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-day free trial!

Getting Started with Debezium PostgreSQL Connector

The Debezium PostgreSQL connector mainly captures the row-level changes in the PostgreSQL databases. Initially, when the Debezium is connected to the PostgreSQL database, the Debezium PostgreSQL connector takes a continuous snapshot of all the schemas. Schemas are the basic framework of the databases. When the snapshots are entirely captured, the connectors notice all the row-level changes like insert, delete and update committed to the PostgreSQL database. The Debezium PostgreSQL connector generates the change data events and sends them to Kafka topics, which applications can consume for further tasks.

However, before connecting the Debezium PostgreSQL connectors, you need to check the compatible version of Debezium with the PostgreSQL connector. You can check the compatibility at Debezium PostgreSQL connectors.

PostgreSQL consists of two different ways to capture changes in databases:

  • A Logical Decoding Output Plugin: You need to install this output plugin. It should run first before running the PostgreSQL. The plugins can be decoderbufs, wal2json, or pgoutput. You can leverage one of them for your system. 
  • Java Code: Java code reads the changes produced by the logical decoding output plugins. The PostgreSQL JDBC driver uses the streaming replication protocol of PostgreSQL.

PostgreSQL makes use of the WAL (Write-Ahead Log) segments, which consist of the changes to actual data in PostgreSQL. And if the connector stops for any reason, it restarts from the WAL position where it last stopped. 

The most crucial feature that Debezium PostgreSQL provides is security. It is achieved through giving privileges to the users. Instead of providing unauthorized access to the Debezium users, the PostgreSQL connectors provide them with superuser privileges. It means a Debezium replicated user is provided with superuser privileges. For providing such privileges, you can refer to User Privileges.

How to use the Debezium PostgreSQL Connector with the PostgreSQL database?

In this tutorial, you will use the Debezium version 0.10 along with the docker images that can be accessed through GitHub. To start with the Debezium services, you must know about Kafka and the ZooKeeper services. ZooKeeper Services are the open-source services used in clusters in distributed systems. It is used for providing group and naming information. Follow the steps given below to use the Debezium PostgreSQL Connector with the PostgreSQL database:

  • Step 1: Start the ZooKeeper for Kafka to understand configuration information stored in Kafka topics.  The following command starts the ZooKeeper.
docker run -it --rm --name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper:0.10
  • it: It stands for interactive and is used to connect the input-output of the console to the ZooKeeper container.
  • rm: It is used to remove the container when the container gets stopped.
  • –name zookeeper: It is used to name the container.
  • -p 2181:2181 -p 2888:2888 -p 3888:3888: It is a port used by ZooKeeper to communicate with its functionalities and Kafka.
  • debezium/zookeeper:0.10: It is used to specify the version.
  • Step 2: Open another terminal and type the following command.
docker run -it --rm --name kafka -p 9092:9092 --link zookeeper:zookeeper debezium/kafka:0.10
  • As mentioned above, the -it and -rm have the same purpose here.
  • –name kafka: It names the container as Kafka.
  • -p 9092:9092: It tells the user that port 9092 is being used for the communication of Kafka and ZooKeeper.
  • –link zookeeper:zookeeper: This command determines the container that it can find ZooKeeper in the ZooKeeper container.

To connect the source database and the connectors, you need a database to create and store the data change events whenever there is any data modification activity on the source database. Use the PostgreSQL database and the Docker container from the Crunchy data container.

To enable logical replication, there is a lot of configuration needed. The Crunchy data container consists of a PostgreSQL file that contains the necessary configuration. When you start the container, it consists of the following configuration.

Image Source: Self
  • Step 3: You need environment variables to pass into the container and create a file. You can create it from the following commands.
Debezium PostgreSQL: Add Highlight Action
Image Source
  • Step 4: From the below command, use the pg-env.list to run the container. It creates environment variables and adds the Postgresql.config file by mounting the local pgconfig directory as a volume in the container.
docker run -it --rm --name=pgsql --env-file=pg-env.list --volume=`pwd`/pgconf:/pgconf -d crunchydata/crunchy-postgres:centos7-11.4-2.4.1
  • Step 5: You need an IP address to run the PostgreSQL in Docker. You can ask Docker the IP address of the container through the command docker inspect pgsql | grep IPAddress. Here, pgsql is the name given to the container. It shows the following result:
  • Step 6: Attach the terminal to the pgsql container and use the label to reference the running container with the following command.
  • Step 7: Run a new container, which has pgsql container in it with the below command.

–link pgsql: It allows the use of the host address of pg11. Add the password PG_PRIMARY_PASSWORD that is specified in the environment variable when the container starts.

  • Step 8: Once you get into Postgre, you need to create tables in databases. You can create it with the following set of queries.
  • Step 9: You also need a database to send the changes to the container. Create it with the following command.
CREATE DATABASE customers;
  • Step 10: Connect the JDBC driver to the container. To join the jdbc sink jar file to the container, you can refer to JDBC Drivers. From the above JDBC directory, run the Dockerfile. It will give the following output:

successfully built 62b583dce71b 
  • Step 11: Once the container is built, use the following command to run it.
docker tag 62b583dce71b jdbc-sink
  • Step 12: Run the Docker through the below command.
  • Step 13: Kafka consists of the endpoints that know which type of connectors are connected in the containers. These are REST endpoints. Hence, use the following commands to see the connectors.

You haven’t created any, so the result is null. You have to create the source connectors. For that, you require JSON to send to the Rest API for the configuration of the source connectors.

  • Step 14: The above command consists of the name of connectors and databases and how you have connected them along with the table name. Connect the table customers to the outside database of postgre from the below command.
Debezium PostgreSQL: Added Location| Hevo Data
Image Source
  • Step 15: Create a JSON file for the configuration of the sink connectors.

The Kafka JDBC sink creates the same destination table name as that of the topic name. Therefore, in this case, it is fulfillment.public.customers.

  • Step 16: Connect the connector with the database through the following command.
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d @jdbc-sink.json
  • Step 17: Use the pgsql container with PostgreSQL database with the below command.

docker run -it --rm --link pgsql:pg11 debezium/postgres:11 psql -h pg11 -U postgres

Now that you are in the PostgreSQL database run some queries and note the changes.

From above, the first query runs in the postgre database of the customer’s table, whereas the second one runs in the customer’s database. After using Debezium, any changes you make in the postgre database will be updated accordingly in the second database, which is the customer database.

From above, the first commands show the customer table on the PostgreSQL database. And the second commands show the customer table on the customer database. Now, any changes made in PostgreSQL will be propagated to the customer table of the second database.

Conclusion

This article gave us a brief on the working of Debezium PostgreSQL connectors. In Debezium, there are two ways of using connectors. The first one is using the Kafka connect clusters which are considered the high fault-tolerant for applications. And the second one is the embedded Debezium connectors that work the same as the first ones, but instead of sending the change in the data events to the clusters, it directly sends it to applications, making applications the least fault-tolerant compared to the first one. 

Furthermore, this article discussed the steps using which you can use the Debezium PostgreSQL connector for the PostgreSQL database.  Integrating and analyzing your data from a diverse set of data sources such as PostgreSQL can be challenging and this is where Hevo Data comes into the picture.

visit our website to explore hevo

Hevo Data provides its users with a simpler platform for integrating data from 100+ sources (including 40+ free sources) such as PostgreSQL for Analysis. It is a No-code Data Pipeline that can help you combine data from multiple sources. You can use it to transfer data from multiple data sources into your Data Warehouse, Database, or a destination of your choice. It provides you with a consistent and reliable solution to managing data in real-time, ensuring that you always have Analysis-ready data in your desired destination.

Give Hevo Data a try and sign up for a 14-day free trial today. You can also have a look at our unbeatable pricing that will help you choose the right plan for your business needs!

Share your experience of learning about the working of the Debezium PostgreSQL connector. Tell us in the comments below!

No Code Data Pipeline For PostgreSQL