Postgres CDC: 5 Easy Methods to Capture Data Changes

Keeping data in sync across systems can be a real challenge, especially when your PostgreSQL database is constantly updating with new inserts, updates, and deletes. Without an efficient Change Data Capture (CDC) mechanism, you might find yourself dealing with data inconsistencies, slow performance from constant polling, and frustrating delays in analytics or real-time processing. By setting up the right CDC strategy, you can track changes as they happen and replicate data smoothly, without putting unnecessary load on your database.

PostgreSQL gives you multiple ways to implement CDC, and in this blog, we’ll walk through six of the most popular ones. We’ll start with Kafka-Based CDC, a powerful option for real-time streaming, then look at Audit Triggers, which log every change using Postgres triggers. We’ll also cover Logical Decoding, which reads changes from Write-Ahead Logs (WAL), and Timestamp-Based CDC, a simpler approach that tracks updates using a dedicated column. For cases where those aren’t ideal, we’ll explore Query-Based CDC, which polls the database for modifications, and WAL-Based CDC, the most reliable method for capturing low-level changes. Each Postgres CDC approach has its strengths and trade-offs, and by the end of this guide, you’ll have a clear idea of which one fits your needs best.

Table of Contents

What is Postgres CDC?

Postgres CDC refers to the process of identifying, capturing, and delivering data changes (inserts, updates, deletes) in a PostgreSQL database to target systems in real-time or near real-time. It enables efficient data synchronization, replication, and real-time analytics by tracking changes at the source and transmitting them downstream.

Postgres CDC can be implemented using various methods, including triggers, queries with timestamp columns, logical replication, and reading transactional logs (WAL). Each method balances factors like latency, system impact, and complexity, these methods will be discussed in detail later.

Looking for the best ETL tools to connect to your PostgreSQL? Rest assured, Hevo’s no-code platform helps streamline your ETL process. Try Hevo and equip your team to:

Integrate data from 150+ sources(60+ free sources).
Utilize drag-and-drop and custom Python script features to transform your data.
Risk management and security framework for cloud-based systems with HIPAA, GDPR, SOC2 Compliance.

Try Hevo and discover why 2000+ customers have chosen Hevo over tools like AWS DMS to upgrade to a modern data stack.

Try Hevo Postgres CDC for Free

What is the need for Postgres CDC?

Real-time and accurate data synchronization is essential for enterprises handling large volumes of data.
Batch sync limitations can lead to outdated information, errors, and hindered decision-making.
Postgres CDC methods track and manage database changes in real time.
Continuous replication ensures up-to-date data is always available for reporting and analytics.
Avoids batch processing delays, improving operational efficiency.
Enhances data-driven decision-making by ensuring accuracy and timeliness.

How does Postgres CDC work?

Change Data Capture (CDC) is a methodology employed in PostgreSQL (and other databases) to track and propagate changes in data and its structure. It identifies, captures, and delivers modifications made to the source data, enabling organizations to leverage the updated information. By doing so, Postgres CDC facilitates rapid and efficient data transfers to data warehouses, where the changes can be analyzed and utilized.

The core purpose of CDC in PostgreSQL is to ensure that multiple databases remain synchronized and updated with the latest changes occurring in the PostgreSQL database. The Postgres CDC process broadly encompasses three key steps:

Capturing data changes: CDC monitors and records any inserts, updates, or deletions made to the data in the PostgreSQL database.
Transforming the change data: The captured changes are transformed into a format compatible with the target system, ensuring seamless integration.
Loading the data: The transformed change data is then loaded into the target database or data warehouse, keeping it aligned with the source PostgreSQL database.

Integrate PostgreSQL to Snowflake

Get a Demo Try it

Integrate PostgreSQL to Redshift

Get a Demo Try it

Integrate PostgreSQL on Amazon RDS to BigQuery

Get a Demo Try it

Benefits of CDC in PostgreSQL

The following are the key benefits of CDC in PostgreSQL:

By enabling change tracking, it allows quick, informed business decisions based on the latest, updated data.
Reduces data transfer costs by sending only the latest changes, not entire datasets.
Minimizes impact on Postgres database performance by creating copies refreshed with small change volumes.

Postgres Change Data Capture Using Custom Code

Approach 1: Using Kafka for Postgres CDC

Step 1.1: Start An Instance Of PostgreSQL

Run the following command:

docker run — name postgres -p 5000:5432 debezium/postgres

The above command will start the PostgreSQL database on our system.

The -name option helps us specify the name of the database, which is Postgres. The -p option helps us specify the port on which the database will be running. The database will be running on port number 5000.

Step 1.2: Start An Instance Of Zookeeper

Just run the following command:

docker run -it — name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper

The command will start Zookeeper on our system for storage of offsets from the consumer.

The -it option can also be written as -i -t. The -i option tells docker that you need an interactive session. The -t option tells docker that you need a tty (teletypewriter) attached, showing terminal-related information. The -p option helps us specify the ports that will be used by the Zookeeper instance.

Step 1.3: Start An Instance Of Kafka

Use the following command:

docker run -it — name kafka -p 9092:9092 — link zookeeper:zookeeper debezium/kafka

The -it option tells docker that you need an interactive session with tty (teletypewriter) attached, showing terminal-related information. The -name option helps us specify the name of the instance. The -p option helps us specify the port on which the instance will be running. The –link option helps us link the two containers.

Step 1.4: Start An Instance Of Debezium

Simply run the following command:

docker run -it — name connect -p 8083:8083 -e GROUP_ID=1 -e CONFIG_STORAGE_TOPIC=my-connect-configs -e 
OFFSET_STORAGE_TOPIC=my-connect-offsets -e 
ADVERTISED_HOST_NAME=$(echo $DOCKER_HOST | cut -f3 -d’/’ | cut -f1 -d’:’) — link zookeeper:zookeeper — link postgres:postgres — link kafka:kafka debezium/connect

The above command helps us achieve a lot. Other than starting an instance of Debezium, you have used this instance to link our existing containers together. The containers are Postgres, Kafka, and Zookeepers. The purpose of linking the containers is to have them communicate with each other.

Step 1.5: Establish A Connection To PostgreSQL

You can connect to the PostgreSQL database by running the following command:

psql -h localhost -p 5000 -U postgres

Step 1.6: Create A Database And Table

First, let’s create the database:

CREATE DATABASE inventory_db;

The table will have two columns: id (SERIAL data type) and name (VARCHAR data type). The id column will be the primary key for the table.

To create the table, run the following command:

CREATE TABLE my_table(id SERIAL PRIMARY KEY, name VARCHAR);

Congratulations! You now have a ready setup. Next, you will be creating a connector to the Kafka connector.

Step 1.7: Create A Connector

Here is the code to help you register a connector to the Kafka Connect.

curl -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '
{
"name": "inventory_db-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "postgres",
"database.password": "postgres",
"database.dbname" : "inventory_db",
"database.server.name": "dbserver1",
"database.whitelist": "inventory_db",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "schema-changes.inventory"
}
}'

The code will create a connector and give it the name inventory_db-connector. All the parameters for the connector have been set, including the hostname, port, user, password, database name, etc. Let’s check whether the connector was created successfully!

Step 1.8: Check Whether The Connector Was Created

Use the following command:

curl -X GET -H "Accept:application/json" localhost:8083/connectors/inventory_db-connector

Now that you have a ready connector, you can start a Kafka console and begin to watch the changes.

Step 1.9: Launch Kafka Console Consumer And Watch Changes

Just run the command given below:

docker run -it — name watcher — rm — link zookeeper:zookeeper debezium/kafka watch-topic -a -k dbserver1.public.my_table

And you’ve performed kafka postgres cdc!

Approach 2: Postgres CDC using Postgres Audit Triggers

Trigger-based methods involve creating audit triggers on the database to capture all the events related to INSERT, UPDATE, and DELETE methods. The advantage of this method is that everything happens at the SQL level, and the developer just needs to read a separate table that contains all the audit logs.

To create a trigger in the Postgres table execute the below command.

SELECT audit.audit_table('target_table');

This statement is to be executed with a user role with permission to access the audit schema. The app which writes to the tables must not be connecting to the database using a superuser role.

It is possible to turn off these audit logs in case of specific fields if the requirement does not need to track fields that do not change frequently.

What we have then is a table that captures the audit log. The developer still needs to write a program to read and parse these logs and push these into the target databases. The complexity of this logic will also depend on the target database or data warehouse to which data is being pushed.

Limitations of Using Postgres Audit Triggers for Postgres CDC

A downside with this method is that having triggers affects the performance of the database.
A common practice to avoid this performance hit is to have a separate table which tracks the main table and have triggers on the second table.
The syncing between master and the secondary table can be done using Postgres logical replication feature.

Solve your data replication problems with Hevo’s reliable, no-code, automated pipelines with 150+ connectors.

Get your free trial right away!

Approach 3: Postgres CDC using Logical Decoding

Logical decoding uses the content of the write-ahead log to create a log of activities happening in the database. Write ahead log is an internal log that describes database changes on a storage level. The advantage of this approach is that it does not affect the performance of the database in any way.

The approach works based on the installation of an output plugin. To enable the use of logical decoding the following parameters need to be set in the Postgres configuration.

wal_level = logical           

max_replication_slots = 10

Once these parameters are set, execute the following command to create a logical decoding slot.

SELECT * FROM pg_create_logical_replication_slot('slot_repl', 'decode_test');

This will ensure that every update to the tables will create audit events that can be accessed view select statements. Logical decoding supports two kinds of select statements to access these events. The below command can be used to take a peek at the events,

SELECT * FROM pg_logical_slot_peek_changes('slot_repl', NULL, NULL);

To consume or get the results, use the below command.

SELECT * FROM pg_logical_slot_get_changes('slot_repl', NULL, NULL);

The difference between ‘consume’ and ‘peek’ is that the second command removes the events once it is consumed. Let’s try to understand this with an example.

Let’s say there is a student table and we insert an entry into this table.

Insert into students (id, name, age) values (1, ‘Peter’, '30');

Once we execute the peek command, the result will be as follows.

SELECT * FROM pg_logical_slot_peek_changes('slot_repl', NULL, NULL);

0/16CEE88 | 102 | BEGIN 102

0/16CEE88 | 102 | table public.students: INSERT: id[integer]:1 name[character varying]:'Peter' age[integer]:30

0/16CEFA0 | 102 | COMMIT 102

Again, if we execute the same command, you will still see the same audit events.

SELECT * FROM pg_logical_slot_peek_changes('slot_repl', NULL, NULL);

0/16CEE88 | 102 | BEGIN 102

0/16CEE88 | 102 | table public.students: INSERT: id[integer]:1 name[character varying]:'Peter' age[integer]:30

0/16CEFA0 | 102 | COMMIT 102

Now let’s execute the get command.

SELECT * FROM pg_logical_slot_get_changes('slot_repl', NULL, NULL);

0/16CEE88 | 102 | BEGIN 102

0/16CEE88 | 102 | table public.students: INSERT: id[integer]:1 name[character varying]:'Peter' age[integer]:30

0/16CEFA0 | 102 | COMMIT 102

The results are the same. But if you execute the same command or the peek command again, you will find that there are zero results.

SELECT * FROM pg_logical_slot_get_changes('slot_repl', NULL, NULL);

(0 rows)

This means the results are served and deleted when get command is executed, greatly enhancing our ability to write the logic for using these events to create a replica of the table.

Postgres also supports the use of output plugins – which are basically c programs that can execute based on the INSERT, UPDATE and DELETE events and transform the events into a more user-friendly format. Wal2json is one such plugin that can output these events in JSON format. This plugin is widely used by most of the popular Postgres CDC add-ons like Debezium.

A point to note here is that these events will not capture the table creation and modification steps.

The downside of Using Postgres Logical Decoding for Postgres CDC

In this case, the developers also need to write an elaborate logic to process these events and then transform them into statements for the target database.
Depending on the use case you are trying to solve, this can increase the project timelines.

Approach 4: Postgres CDC using a Timestamp Column

The Postgres engine provides the above two methods to implement CDC. There is also a slightly convoluted custom method if you have the flexibility of a timestamp column in your table. This means the developer will need to query the table periodically and monitor the changes to the timestamp column. When changes are detected, the script can create appropriate database statements to write these changes to the target database. This approach is, however, effort-intensive and would demand a lot of time and effort from the developer.

Limitations of Using This Approach

This custom method of implementing CDC by monitoring a timestamp column has several limitations.
It requires the developer to manually query the table at regular intervals, which can be resource-intensive and prone to inefficiencies.
Since it depends on the accuracy of the timestamp, any clock misalignment or inconsistency in timestamp updates could lead to missed or incorrect changes.
Additionally, it significantly burdens the developer to write and maintain custom scripts, increasing both development time and the potential for errors.

Approach 5: Postgres CDC Using Queries

This approach is based on running periodic SQL queries against the source database to detect changes. It can be implemented in two ways:

Direct Queries for Delta Changes:
- Query the database periodically to fetch records with changes based on specific conditions, such as a change flag or an updated timestamp.
- Example query for delta changes:

SELECT * FROM target_table
WHERE last_modified > NOW() - INTERVAL '5 minutes';

Snapshot Comparison:
- Take periodic snapshots of the data and compare them with the previous snapshot to detect INSERT, UPDATE, or DELETE operations.
- This approach is resource-intensive as it involves maintaining and comparing large datasets.

Advantages:
- Simple to implement for small datasets or low-frequency updates.
- No requirement for database-level configurations or plugins.
Limitations:
- Performance overhead due to frequent querying of the database.
- Greater latency to detect changes than real-time CDC methods.
- Not scalable for high-transaction environments because of resource consumption.

Approach 6: Postgres CDC Using Write-Ahead Logs (WAL)

The WAL method uses the internal mechanism of Postgres to log every change made in the database before committing it to the main database files.
This method is very reliable and efficient for capturing database changes.

How it Works:

WAL maintains a log of all INSERT, UPDATE, DELETE, and DDL statements in a sequential file.
CDC tools or custom code can be used to read and process these logs for CDC purposes.

Implementation:

Enable Archiving: Set the wal_level parameter to archive or higher in the Postgres configuration.

wal_level = logical
archive_mode = on
archive_command = 'cp %p /path_to_wal_archive/%f

Read WAL Files: Use tools or libraries, such as pg_waldump or custom scripts, to parse WAL files and extract changes.
Example Tool:
- Debezium CDC is a popular tool that uses Postgres WAL to provide a real-time stream of change events.

Advantages:
- High performance because WAL works at the storage layer.
- A complete record of changes ensures reliability.
- Change detection in real time with minimal latency.
Limitations:
- Complex to implement without third-party tools.
- Takes up space to store the WAL files, especially when high transaction.
- Does not automatically capture schema changes such as table creation or modification.

Conclusion

In a nutshell, even though Postgres provides support for continuous syncing through triggers, logical decoding or custom logic, it is still upon the developers to capture these events and convert them to the target database. According to use cases, this logic must be written explicitly for different target databases. An alternative is a cloud-based service like Hevo which can use Postgres CDC to sync data to most destinations continuously.

Frequently Asked Questions

1. How does CDC affect database performance?

Change Data Capture (CDC) can impact database performance by increasing memory usage, slowing down DML operations, raising CPU consumption, and consuming more disk space. To minimize these effects, enable CDC selectively, manage retention periods, allocate sufficient resources, and monitor performance closely.

2. How to check if CDC is enabled in Postgres?

To check if CDC is enabled in PostgreSQL, run:

SELECT * FROM pg_replication_slots WHERE slot_type = ‘logical’;

If entries appear, logical replication (often used for CDC) is active. You can also check for replication publications with:

SELECT * FROM pg_publication;

If found, CDC is likely enabled.

3. How to setup CDC for Postgres?

To set up CDC for PostgreSQL:
1.Enable logical replication in the PostgreSQL configuration.
2.Create a logical replication slot.
3.Capture database changes using logical decoding.
4.Optionally, use an output plugin for easier data handling.

Vivek Sinha Director of Product Management, Hevo Data

Vivek Sinha is a seasoned product leader with over 10 years of expertise in revolutionizing real-time analytics and cloud-native technologies. He specializes in enhancing Apache Pinot, focusing on query processing and data mutability. Vivek is renowned for his strategic vision and ability to deliver cutting-edge solutions that empower businesses to harness the full potential of their data.

Best Change Data Capture Guide for Postgres CDC (With Code)

What is Postgres CDC?

What is the need for Postgres CDC?

How does Postgres CDC work?

Benefits of CDC in PostgreSQL

Postgres Change Data Capture Using Custom Code

Approach 1: Using Kafka for Postgres CDC

Step 1.1: Start An Instance Of PostgreSQL

Step 1.2: Start An Instance Of Zookeeper

Step 1.3: Start An Instance Of Kafka

Step 1.4: Start An Instance Of Debezium

Step 1.5: Establish A Connection To PostgreSQL

Step 1.6: Create A Database And Table

Step 1.7: Create A Connector

Step 1.8: Check Whether The Connector Was Created

Step 1.9: Launch Kafka Console Consumer And Watch Changes

Approach 2: Postgres CDC using Postgres Audit Triggers

Limitations of Using Postgres Audit Triggers for Postgres CDC

Approach 3: Postgres CDC using Logical Decoding

The downside of Using Postgres Logical Decoding for Postgres CDC

Approach 4: Postgres CDC using a Timestamp Column

Limitations of Using This Approach

Approach 5: Postgres CDC Using Queries

Approach 6: Postgres CDC Using Write-Ahead Logs (WAL)

Conclusion

Frequently Asked Questions

1. How does CDC affect database performance?

2. How to check if CDC is enabled in Postgres?

3. How to setup CDC for Postgres?

Related Posts

Using PostgreSQL as a Source with AWS DMS: A Step-by-Step Guide

PostgreSQL Query Optimization: 10 Best Tricks & Techniques (Explained with Code)

What Is DynamoDB CDC? Methods, Benefits & Use Cases