Real-time data synchronization is essential to draw instant insights and stay on top of your data from your ever-growing PostgreSQL databases. This is where CDC comes into the picture.

In this blog, we will delve into the intricacies of Postgres CDC and explore the ins and outs of implementing Change Data Capture (CDC) in Postgres databases. This will enable you to achieve seamless, real-time data replication and keep your systems up-to-date with the latest information.

What is PostgreSQL?

Postgres CDC: PostgreSQL Logo

PostgreSQL is an open-source relational database that is versatile in nature and has advanced features that allow it to be used both as a database and a data warehouse. It adheres to SQL standards and and can be extended through the PgAdmin tool. It is completely free and the strong community support makes it one of the top choices for data integrity, reliability and data management.

Introduction to Postgres CDC

PostgreSQL is one of the most widely used open-source relational databases. Its comprehensive querying layer and technical standards-compliance make it a favorite for a variety of uses cases ranging from OLTP workloads to data warehouse and analytics use cases. In the typical architecture, organizations will have a relational database handle the transactional loads and then a separate data warehouse to perform all the analytics use cases and aggregated reporting.

A challenge here is in ensuring that the data warehouse has the latest version of data that is present in the transactional database. It is common to have time-critical reporting requirements that cannot afford hourly or daily batch sync between these databases. Such requirements are handled using continuous sync between the instances based on the change data capture paradigm.

In databases, change data capture (CDC) is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data. PostgreSQL has in-built functionality for CDC. This post is about Postgres CDC and the ways to achieve this.

Postgres CDC
Image by Ramesh naik E via Medium

What is the need for Postgres CDC?

The need for Postgres CDC arises from the critical requirement for real-time and accurate data synchronization in modern enterprises dealing with vast volumes of data. In scenarios where immediate access to the most up-to-date data is crucial for reporting and decision-making, waiting for scheduled batch syncs can lead to errors, outdated information, and hindered decision-making processes.

Postgres CDC methods offer a solution by enabling the tracking and managing of database changes and facilitating the replication of source data changes to destination data stores. By leveraging Postgres CDC, organizations can overcome the limitations of batch processing, ensuring continuous and precise data synchronization between databases, thereby enhancing the accuracy and timeliness of data-driven decision-making processes.

Simplify Postgres CDC with Hevo’s no-code data pipeline

Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready.

Start for free now!

Get Started with Hevo for Free

How does Postgres CDC work?

Change Data Capture (CDC) is a methodology employed in PostgreSQL (and other databases) to track and propagate changes in data and its structure. It identifies, captures, and delivers modifications made to the source data, enabling organizations to leverage the updated information. By doing so, Postgres CDC facilitates rapid and efficient data transfers to data warehouses, where the changes can be analyzed and utilized.

The core purpose of CDC in PostgreSQL is to ensure that multiple databases remain synchronized and updated with the latest changes occurring in the PostgreSQL database. The Postgres CDC process broadly encompasses three key steps:

  1. Capturing data changes: CDC monitors and records any inserts, updates, or deletions made to the data in the PostgreSQL database.
  2. Transforming the change data: The captured changes are transformed into a format compatible with the target system, ensuring seamless integration.
  3. Loading the data: The transformed change data is then loaded into the target database or data warehouse, keeping it aligned with the source PostgreSQL database.

Benefits of CDC in PostgreSQL

  • Enables quick, informed business decisions based on latest, updated data.
  • Reduces data transfer costs by sending only latest changes, not entire datasets.
  • Minimizes impact on Postgres database performance by creating copies refreshed with small change volumes.
  • Enhances master data management by continuously updating centralized data, ensuring everyone accesses same data for better decisions.
  • Facilitates disaster recovery and backups by replicating key databases in real-time.

Postgres Change Data Capture Using Custom Code

Approach 1: Postgres CDC using Postgres Audit Triggers

Trigger-based methods involve creating audit triggers on the database to capture all the events related to INSERT, UPDATE and DELETE methods. The advantage with this method is that everything happens at the SQL level and the developer just needs to read a separate table that contains all the audit logs. 

To create a trigger in the Postgres table execute the below command.

SELECT audit.audit_table('target_table');

This statement is to be executed with a user role with permission to access the audit schema. The app which writes to the tables must not be connecting to the database using a superuser role. 

It is possible to turn off these audit logs in case of specific fields if the requirement does not need to track fields that do not change frequently.

What we have then is a table that captures the audit log. The developer still needs to write a program to read and parse these logs and push these into the target databases. The complexity of this logic will also depend on the target database or data warehouse to which data is being pushed. 

Limitations of Using Postgres Audit Triggers for Postgres CDC

A downside with this method is that having triggers affects the performance of the database. A common practice to avoid this performance hit is to have a separate table which tracks the main table and have triggers on the second table. The syncing between master and the secondary table can be done using Postgres logical replication feature.

Approach 2: Postgres CDC using Logical Decoding

Logical decoding uses the content of the write-ahead log to create a log of activities happening in the database. Write ahead log is an internal log that describes database changes on a storage level. The advantage of this approach is that it does not affect the performance of the database in any way. 

The approach works based on the installation of an output plugin. To enable the use of logical decoding the following parameters need to be set in the Postgres configuration.

wal_level = logical           

max_replication_slots = 10

Once these parameters are set, execute the following command to create a logical decoding slot.

SELECT * FROM pg_create_logical_replication_slot('slot_repl', 'decode_test');

This will ensure that every update to the tables will create audit events that can be accessed view select statements. Logical decoding supports two kinds of select statements to access these events. The below command can be used to take a peek at the events,

SELECT * FROM pg_logical_slot_peek_changes('slot_repl', NULL, NULL);

To consume or get the results, use the below command.

SELECT * FROM pg_logical_slot_get_changes('slot_repl', NULL, NULL);

The difference between ‘consume’ and ‘peek’ is that the second command removes the events once it is consumed. Let’s try to understand this with an example.

Let’s say there is a student table and we insert an entry into this table.

Insert into students (id, name, age) values (1, ‘Peter’, '30');

Once we execute the peek command, the result will be as follows.

SELECT * FROM pg_logical_slot_peek_changes('slot_repl', NULL, NULL);
0/16CEE88 | 102 | BEGIN 102

0/16CEE88 | 102 | table public.students: INSERT: id[integer]:1 name[character varying]:'Peter' age[integer]:30

0/16CEFA0 | 102 | COMMIT 102

Again, if we execute the same command, you will still see the same audit events.  

SELECT * FROM pg_logical_slot_peek_changes('slot_repl', NULL, NULL);
0/16CEE88 | 102 | BEGIN 102

0/16CEE88 | 102 | table public.students: INSERT: id[integer]:1 name[character varying]:'Peter' age[integer]:30

0/16CEFA0 | 102 | COMMIT 102

Now let’s execute the get command.

SELECT * FROM pg_logical_slot_get_changes('slot_repl', NULL, NULL);
0/16CEE88 | 102 | BEGIN 102

0/16CEE88 | 102 | table public.students: INSERT: id[integer]:1 name[character varying]:'Peter' age[integer]:30

0/16CEFA0 | 102 | COMMIT 102

The results are the same. But if you execute the same command or the peek command again, you will find that there are zero results. 

SELECT * FROM pg_logical_slot_get_changes('slot_repl', NULL, NULL);

(0 rows)

This means when get command is executed the results are served and deleted, greatly enhancing our ability to write the logic for using these events to create a replica of the table.

Postgres also supports the use of output plugins – which are basically c programs that can execute based on the INSERT, UPDATE  and DELETE events and transform the events into a more user-friendly format. Wal2json is one such plugin which can output these events in JSON format. This plugin is widely used by most of the popular Postgres CDC add ons like Debezium

A point to note here is that these events will not capture the table creation and modification steps.

Downside of Using Postgres Logical Decoding for Postgres CDC

In this case, also, the developers need to write an elaborate logic to process these events and then transform them into statements for the target database. Depending on the use case that you are trying to solve, this can increase the project timelines. 

Approach 3: Postgres CDC using a Timestamp Column

The above two methods are provided by the Postgres engine to implement CDC. There is also a slightly convoluted custom method if you have the flexibility of a timestamp column in your table. This means the developer will need to query the table periodically and monitor the changes to the timestamp column. When changes are detected, the script can create appropriate database statements to write these changes to the target database. This approach is, however, effort-intensive and would demand a lot of time and effort from the developer. 

An Easier Approach to Set Up Postgres Change Data Capture

In a nutshell, even though Postgres provides support for continuous syncing through triggers, logical decoding or through custom logic, it is still upon the developers to capture these events and convert them to the target database. This logic will need to be specifically written for different target databases according to use cases. An alternative is to use a cloud-based service like Hevo which can use Postgres CDC to continuously sync data to most destinations.

Sign up here for a 14-day free trial!

What are your thoughts about setting up Postgres CDC? Let us know in the comments?

Frequently Asked Questions

Q. How does CDC affect database performance?

Change Data Capture (CDC) can impact database performance by increasing memory consumption, slowing down DML operations, raising CPU usage, consuming more disk space, and introducing slight latency. To mitigate these effects, it is advisable to enable CDC selectively, manage retention periods efficiently, allocate adequate compute resources, and monitor performance closely to ensure optimal database operation while leveraging the benefits of real-time data integration.

Vivek Sinha
Director of Product Management, Hevo Data

Vivek Sinha has more than 10 years of experience in real-time analytics and cloud-native technologies. With a focus on Apache Pinot, he was a driving force in shaping innovation and defensible differentiators, including enhanced query processing, data mutability support, and cost-effective tiered storage solutions at Hevo. He also demonstrates a passion for exploring and implementing innovative trends within the dynamic data industry landscape.

All your customer data in one place.