Real-time data synchronization is essential to draw instant insights and stay on top of your data from your ever-growing PostgreSQL databases. This is where CDC comes into the picture.
In this blog, we will delve into the intricacies of Postgres CDC and explore the ins and outs of implementing Change Data Capture (CDC) in Postgres databases. This will enable you to achieve seamless, real-time data replication and keep your systems up-to-date with the latest information.
Table of Contents
What is PostgreSQL?
PostgreSQL is an open-source relational database that is versatile in nature and has advanced features that allow it to be used both as a database and a data warehouse. It adheres to SQL standards and and can be extended through the PgAdmin tool. It is completely free and the strong community support makes it one of the top choices for data integrity, reliability and data management.
Looking for the best ETL tools to connect your PostgreSQL? Rest assured, Hevo’s no-code platform helps streamline your ETL process. Try Hevo and equip your team to:
- Integrate data from 150+ sources(60+ free sources).
- Utilize drag-and-drop and custom Python script features to transform your data.
- Risk management and security framework for cloud-based systems with HIPAA, GDPR, SOC2 Compliance.
Try Hevo and discover why 2000+ customers have chosen Hevo over tools like AWS DMS to upgrade to a modern data stack.
Try Hevo Postgres CDC for FreeIntroduction to Postgres CDC
- PostgreSQL is one of the most widely used open-source relational databases. Its comprehensive querying layer and technical standards compliance make it a favorite for a variety of use cases ranging from OLTP workloads to data warehouse and analytics use cases. A challenge here is ensuring that the data warehouse has the latest version of data in the transactional database.
- It is common to have time-critical reporting requirements that cannot afford hourly or daily batch sync between these databases. Based on the change data capture paradigm, such requirements are handled using continuous sync between the instances.
- In databases, change data capture (CDC) is a set of software design patterns used to determine and track the changed data so that action can be taken using the changed data. PostgreSQL has in-built functionality for CDC. This post is about Postgres CDC and the ways to achieve this.
What is the need for Postgres CDC?
- Real-time and accurate data synchronization is essential for enterprises handling large volumes of data.
- Batch sync limitations can lead to outdated information, errors, and hindered decision-making.
- Postgres CDC methods track and manage database changes in real time.
- Continuous replication ensures up-to-date data is always available for reporting and analytics.
- Avoids batch processing delays, improving operational efficiency.
- Enhances data-driven decision-making by ensuring accuracy and timeliness.
How does Postgres CDC work?
Change Data Capture (CDC) is a methodology employed in PostgreSQL (and other databases) to track and propagate changes in data and its structure. It identifies, captures, and delivers modifications made to the source data, enabling organizations to leverage the updated information. By doing so, Postgres CDC facilitates rapid and efficient data transfers to data warehouses, where the changes can be analyzed and utilized.
The core purpose of CDC in PostgreSQL is to ensure that multiple databases remain synchronized and updated with the latest changes occurring in the PostgreSQL database. The Postgres CDC process broadly encompasses three key steps:
- Capturing data changes: CDC monitors and records any inserts, updates, or deletions made to the data in the PostgreSQL database.
- Transforming the change data: The captured changes are transformed into a format compatible with the target system, ensuring seamless integration.
- Loading the data: The transformed change data is then loaded into the target database or data warehouse, keeping it aligned with the source PostgreSQL database.
Benefits of CDC in PostgreSQL
- Enables quick, informed business decisions based on latest, updated data.
- Reduces data transfer costs by sending only latest changes, not entire datasets.
- Minimizes impact on Postgres database performance by creating copies refreshed with small change volumes.
Postgres Change Data Capture Using Custom Code
Approach 1: Using Kafka for Postgres CDC
Step 1.1: Start An Instance Of PostgreSQL
docker run — name postgres -p 5000:5432 debezium/postgres
The above command will start the PostgreSQL database on our system.
The -name option helps us specify the name of the database, which is Postgres. The -p option helps us specify the port on which the database will be running. The database will be running on port number 5000.
Step 1.2: Start An Instance Of Zookeeper
Just run the following command:
docker run -it — name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper
The command will start Zookeeper on our system for storage of offsets from the consumer.
The -it option can also be written as -i -t. The -i option tells docker that you need an interactive session. The -t option tells docker that you need a tty (teletypewriter) attached, showing terminal-related information. The -p option helps us specify the ports that will be used by the Zookeeper instance.
Step 1.3: Start An Instance Of Kafka
Use the following command:
docker run -it — name kafka -p 9092:9092 — link zookeeper:zookeeper debezium/kafka
The -it option tells docker that you need an interactive session with tty (teletypewriter) attached, showing terminal-related information. The -name option helps us specify the name of the instance. The -p option helps us specify the port on which the instance will be running. The –link option helps us link the two containers.
Step 1.4: Start An Instance Of Debezium
Simply run the following command:
docker run -it — name connect -p 8083:8083 -e GROUP_ID=1 -e CONFIG_STORAGE_TOPIC=my-connect-configs -e
OFFSET_STORAGE_TOPIC=my-connect-offsets -e
ADVERTISED_HOST_NAME=$(echo $DOCKER_HOST | cut -f3 -d’/’ | cut -f1 -d’:’) — link zookeeper:zookeeper — link postgres:postgres — link kafka:kafka debezium/connect
The above command helps us achieve a lot. Other than starting an instance of Debezium, you have used this instance to link our existing containers together. The containers are Postgres, Kafka, and Zookeepers. The purpose of linking the containers is to have them communicate with each other.
Step 1.5: Establish A Connection To PostgreSQL
You can connect to the PostgreSQL database by running the following command:
psql -h localhost -p 5000 -U postgres
Step 1.6: Create A Database And Table
First, let’s create the database:
CREATE DATABASE inventory_db;
The table will have two columns: id (SERIAL data type) and name (VARCHAR data type). The id column will be the primary key for the table.
To create the table, run the following command:
CREATE TABLE my_table(id SERIAL PRIMARY KEY, name VARCHAR);
Congratulations! You now have a ready setup. Next, you will be creating a connector to the Kafka connector.
Step 1.7: Create A Connector
Here is the code to help you register a connector to the Kafka Connect.
curl -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '
{
"name": "inventory_db-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "postgres",
"database.password": "postgres",
"database.dbname" : "inventory_db",
"database.server.name": "dbserver1",
"database.whitelist": "inventory_db",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "schema-changes.inventory"
}
}'
The code will create a connector and give it the name inventory_db-connector. All the parameters for the connector have been set, including the hostname, port, user, password, database name, etc. Let’s check whether the connector was created successfully!
Step 1.8: Check Whether The Connector Was Created
Use the following command:
curl -X GET -H "Accept:application/json" localhost:8083/connectors/inventory_db-connector
Now that you have a ready connector, you can start a Kafka console and begin to watch the changes.
Step 1.9: Launch Kafka Console Consumer And Watch Changes
Just run the command given below:
docker run -it — name watcher — rm — link zookeeper:zookeeper debezium/kafka watch-topic -a -k dbserver1.public.my_table
And you’ve performed kafka postgres cdc!
Approach 2: Postgres CDC using Postgres Audit Triggers
Trigger-based methods involve creating audit triggers on the database to capture all the events related to INSERT
, UPDATE
, and DELETE
methods. The advantage of this method is that everything happens at the SQL level, and the developer just needs to read a separate table that contains all the audit logs.
To create a trigger in the Postgres table execute the below command.
SELECT audit.audit_table('target_table');
This statement is to be executed with a user role with permission to access the audit schema. The app which writes to the tables must not be connecting to the database using a superuser role.
It is possible to turn off these audit logs in case of specific fields if the requirement does not need to track fields that do not change frequently.
What we have then is a table that captures the audit log. The developer still needs to write a program to read and parse these logs and push these into the target databases. The complexity of this logic will also depend on the target database or data warehouse to which data is being pushed.
Limitations of Using Postgres Audit Triggers for Postgres CDC
- A downside with this method is that having triggers affects the performance of the database.
- A common practice to avoid this performance hit is to have a separate table which tracks the main table and have triggers on the second table.
- The syncing between master and the secondary table can be done using Postgres logical replication feature.


Approach 3: Postgres CDC using Logical Decoding
Logical decoding uses the content of the write-ahead log to create a log of activities happening in the database. Write ahead log is an internal log that describes database changes on a storage level. The advantage of this approach is that it does not affect the performance of the database in any way.
The approach works based on the installation of an output plugin. To enable the use of logical decoding the following parameters need to be set in the Postgres configuration.
wal_level = logical
max_replication_slots = 10
Once these parameters are set, execute the following command to create a logical decoding slot.
SELECT * FROM pg_create_logical_replication_slot('slot_repl', 'decode_test');
This will ensure that every update to the tables will create audit events that can be accessed view select statements. Logical decoding supports two kinds of select statements to access these events. The below command can be used to take a peek at the events,
SELECT * FROM pg_logical_slot_peek_changes('slot_repl', NULL, NULL);
To consume or get the results, use the below command.
SELECT * FROM pg_logical_slot_get_changes('slot_repl', NULL, NULL);
The difference between ‘consume’ and ‘peek’ is that the second command removes the events once it is consumed. Let’s try to understand this with an example.
Let’s say there is a student table and we insert an entry into this table.
Insert into students (id, name, age) values (1, ‘Peter’, '30');
Once we execute the peek command, the result will be as follows.
SELECT * FROM pg_logical_slot_peek_changes('slot_repl', NULL, NULL);
0/16CEE88 | 102 | BEGIN 102
0/16CEE88 | 102 | table public.students: INSERT: id[integer]:1 name[character varying]:'Peter' age[integer]:30
0/16CEFA0 | 102 | COMMIT 102
Again, if we execute the same command, you will still see the same audit events.
SELECT * FROM pg_logical_slot_peek_changes('slot_repl', NULL, NULL);
0/16CEE88 | 102 | BEGIN 102
0/16CEE88 | 102 | table public.students: INSERT: id[integer]:1 name[character varying]:'Peter' age[integer]:30
0/16CEFA0 | 102 | COMMIT 102
Now let’s execute the get command.
SELECT * FROM pg_logical_slot_get_changes('slot_repl', NULL, NULL);
0/16CEE88 | 102 | BEGIN 102
0/16CEE88 | 102 | table public.students: INSERT: id[integer]:1 name[character varying]:'Peter' age[integer]:30
0/16CEFA0 | 102 | COMMIT 102
The results are the same. But if you execute the same command or the peek command again, you will find that there are zero results.
SELECT * FROM pg_logical_slot_get_changes('slot_repl', NULL, NULL);
(0 rows)
This means the results are served and deleted when get command is executed, greatly enhancing our ability to write the logic for using these events to create a replica of the table.
Postgres also supports the use of output plugins – which are basically c programs that can execute based on the INSERT, UPDATE and DELETE events and transform the events into a more user-friendly format. Wal2json is one such plugin that can output these events in JSON format. This plugin is widely used by most of the popular Postgres CDC add-ons like Debezium.
A point to note here is that these events will not capture the table creation and modification steps.
The downside of Using Postgres Logical Decoding for Postgres CDC
- In this case, the developers also need to write an elaborate logic to process these events and then transform them into statements for the target database.
- Depending on the use case you are trying to solve, this can increase the project timelines.
Approach 4: Postgres CDC using a Timestamp Column
The Postgres engine provides the above two methods to implement CDC. There is also a slightly convoluted custom method if you have the flexibility of a timestamp column in your table. This means the developer will need to query the table periodically and monitor the changes to the timestamp column. When changes are detected, the script can create appropriate database statements to write these changes to the target database. This approach is, however, effort-intensive and would demand a lot of time and effort from the developer.
Limitations of Using This Approach
- This custom method of implementing CDC by monitoring a timestamp column has several limitations.
- It requires the developer to manually query the table at regular intervals, which can be resource-intensive and prone to inefficiencies.
- Since it depends on the accuracy of the timestamp, any clock misalignment or inconsistency in timestamp updates could lead to missed or incorrect changes.
- Additionally, it significantly burdens the developer to write and maintain custom scripts, increasing both development time and the potential for errors.
Approach 5: Postgres CDC Using Queries
This approach is based on running periodic SQL queries against the source database to detect changes. It can be implemented in two ways:
- Direct Queries for Delta Changes:
- Query the database periodically to fetch records with changes based on specific conditions, such as a change flag or an updated timestamp.
- Example query for delta changes:
SELECT * FROM target_table
WHERE last_modified > NOW() - INTERVAL '5 minutes';
- Snapshot Comparison:
- Take periodic snapshots of the data and compare them with the previous snapshot to detect
INSERT
,UPDATE
, orDELETE
operations. - This approach is resource-intensive as it involves maintaining and comparing large datasets.
- Take periodic snapshots of the data and compare them with the previous snapshot to detect
- Advantages:
- Simple to implement for small datasets or low-frequency updates.
- No requirement for database-level configurations or plugins.
- Limitations:
- Performance overhead due to frequent querying of the database.
- Greater latency to detect changes than real-time CDC methods.
- Not scalable for high-transaction environments because of resource consumption.
Approach 6: Postgres CDC Using Write-Ahead Logs (WAL)
- The WAL method uses the internal mechanism of Postgres to log every change made in the database before committing it to the main database files.
- This method is very reliable and efficient for capturing database changes.
How it Works:
- WAL maintains a log of all INSERT, UPDATE, DELETE, and DDL statements in a sequential file.
- Tools or custom code can be used to read and process these logs for CDC purposes.
Implementation:
- Enable Archiving: Set the wal_level parameter to archive or higher in the Postgres configuration.
wal_level = logical
archive_mode = on
archive_command = 'cp %p /path_to_wal_archive/%f
- Read WAL Files: Use tools or libraries, such as pg_waldump or custom scripts, to parse WAL files and extract changes.
- Example Tool:
- Debezium: This is a popular tool that uses Postgres WAL to provide a real-time stream of change events.
- Advantages:
- High performance because WAL works at the storage layer.
- A complete record of changes ensures reliability.
- Change detection in real time with minimal latency.
- Limitations:
- Complex to implement without third-party tools.
- Takes up space to store the WAL files, especially when high transaction.
- Does not automatically capture schema changes such as table creation or modification.
Conclusion
In a nutshell, even though Postgres provides support for continuous syncing through triggers, logical decoding or custom logic, it is still upon the developers to capture these events and convert them to the target database. According to use cases, this logic must be written explicitly for different target databases. An alternative is a cloud-based service like Hevo which can use Postgres CDC to sync data to most destinations continuously.
Sign up for Hevo’s 14-day free trial and experience seamless data migration.
Frequently Asked Questions
1. How does CDC affect database performance?
Change Data Capture (CDC) can impact database performance by increasing memory usage, slowing down DML operations, raising CPU consumption, and consuming more disk space. To minimize these effects, enable CDC selectively, manage retention periods, allocate sufficient resources, and monitor performance closely.
2. How to check if CDC is enabled in Postgres?
To check if CDC is enabled in PostgreSQL, run:
SELECT * FROM pg_replication_slots WHERE slot_type = ‘logical’;
If entries appear, logical replication (often used for CDC) is active. You can also check for replication publications with:
SELECT * FROM pg_publication;
If found, CDC is likely enabled.
3. How to setup CDC for Postgres?
To set up CDC for PostgreSQL:
1.Enable logical replication in the PostgreSQL configuration.
2.Create a logical replication slot.
3.Capture database changes using logical decoding.
4.Optionally, use an output plugin for easier data handling.