PostgreSQL CDC: How to Set Up Real-time Sync

on Tutorial • February 28th, 2020 • Write for Hevo

Introduction to PostgreSQL CDC

PostgreSQL is one of the most widely used open-source relational databases. Its comprehensive querying layer and technical standards-compliance make it a favorite for a variety of uses cases ranging from OLTP workloads to data warehouse and analytics use cases. In the typical architecture, organizations will have a relational database handle the transactional loads and then a separate data warehouse to perform all the analytics use cases and aggregated reporting. A challenge here is in ensuring that the data warehouse has the latest version of data that is present in the transactional database. It is common to have time-critical reporting requirements that cannot afford hourly or daily batch sync between these databases. Such requirements are handled using continuous sync between the instances based on the change data capture paradigm.

In databases, change data capture (CDC) is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data. (Wikipedia). PostgreSQL has in-built functionality for CDC. This post is about PostgreSQL CDC and the ways to achieve this.

Postgres CDC
Image by Ramesh naik E via Medium

In this article, you will learn:

Methods to Implement PostgreSQL Change Data Capture (CDC)

There are many ways to implement Postgres CDC. This blog discusses the following two approaches:

Method 1: Building Custom Code to achieve PostgreSQL CDC
We will discuss the following approaches in this blog:

  1. Using Postgres Audit Triggers
  2. Using Postgres Logical decoding
  3. Using a timestamp column

Method 2: Using a Completely Automated Data Pipeline Solution like Hevo Data
While building a custom code may look like a very lucrative option, there are many roadblocks, challenges, and limitations that make it effort-intensive. Implementing a solution like Hevo help achieve PostgreSQL CDC to any destination of your choice without having to write a single line of code.

Try Hevo for free by signing up here.

Postgres Change Data Capture Using Custom Code

Approach 1: Using Postgres Audit Triggers

Trigger-based methods involve creating audit triggers on the database to capture all the events related to INSERT, UPDATE and DELETE methods. The advantage with this method is that everything happens at the SQL level and the developer just needs to read a separate table that contains all the audit logs. 

To create a trigger in the Postgres table execute the below command.

SELECT audit.audit_table('target_table');

This statement is to be executed with a user role with permission to access the audit schema. The app which writes to the tables must not be connecting to the database using a superuser role. 

It is possible to turn off these audit logs in case of specific fields if the requirement does not need to track fields that do not change frequently.

What we have then is a table that captures the audit log. The developer still needs to write a program to read and parse these logs and push these into the target databases. The complexity of this logic will also depend on the target database or data warehouse to which data is being pushed. 

Limitations of Using Postgres Audit Triggers for PostgreSQL CDC

A downside with this method is that having triggers affects the performance of the database. A common practice to avoid this performance hit is to have a separate table which tracks the main table and have triggers on the second table. The syncing between master and the secondary table can be done using Postgres logical replication feature.

Approach 2: Using Postgres Logical Decoding

Logical decoding uses the content of the write-ahead log to create a log of activities happening in the database. Write ahead log is an internal log that describes database changes on a storage level. The advantage of this approach is that it does not affect the performance of the database in any way. 

The approach works based on the installation of an output plugin. To enable the use of logical decoding the following parameters need to be set in the Postgres configuration.

wal_level = logical           

max_replication_slots = 10

Once these parameters are set, execute the following command to create a logical decoding slot.

SELECT * FROM pg_create_logical_replication_slot('slot_repl', 'decode_test');

This will ensure that every update to the tables will create audit events that can be accessed view select statements. Logical decoding supports two kinds of select statements to access these events. The below command can be used to take a peek at the events,

SELECT * FROM pg_logical_slot_peek_changes('slot_repl', NULL, NULL);

To consume or get the results, use the below command.

SELECT * FROM pg_logical_slot_get_changes('slot_repl', NULL, NULL);

The difference between ‘consume’ and ‘peek’ is that the second command removes the events once it is consumed. Let’s try to understand this with an example.

Let’s say there is a student table and we insert an entry into this table.

Insert into students (id, name, age) values (1, ‘Peter’, '30');

Once we execute the peek command, the result will be as follows.

SELECT * FROM pg_logical_slot_peek_changes('slot_repl', NULL, NULL);
0/16CEE88 | 102 | BEGIN 102

0/16CEE88 | 102 | table public.students: INSERT: id[integer]:1 name[character varying]:'Peter' age[integer]:30

0/16CEFA0 | 102 | COMMIT 102

Again, if we execute the same command, you will still see the same audit events.  

SELECT * FROM pg_logical_slot_peek_changes('slot_repl', NULL, NULL);
0/16CEE88 | 102 | BEGIN 102

0/16CEE88 | 102 | table public.students: INSERT: id[integer]:1 name[character varying]:'Peter' age[integer]:30

0/16CEFA0 | 102 | COMMIT 102

Now let’s execute the get command.

SELECT * FROM pg_logical_slot_get_changes('slot_repl', NULL, NULL);
0/16CEE88 | 102 | BEGIN 102

0/16CEE88 | 102 | table public.students: INSERT: id[integer]:1 name[character varying]:'Peter' age[integer]:30

0/16CEFA0 | 102 | COMMIT 102

The results are the same. But if you execute the same command or the peek command again, you will find that there are zero results. 

SELECT * FROM pg_logical_slot_get_changes('slot_repl', NULL, NULL);

(0 rows)

This means when get command is executed the results are served and deleted, greatly enhancing our ability to write the logic for using these events to create a replica of the table.

Postgres also supports the use of output plugins – which are basically c programs that can execute based on the INSERT, UPDATE  and DELETE events and transform the events into a more user-friendly format. Wal2json is one such plugin which can output these events in JSON format. This plugin is widely used by most of the popular Postgres CDC add ons like Debezium

A point to note here is that these events will not capture the table creation and modification steps.

Downside of Using Postgres Logical Decoding for Postgres CDC

In this case, also, the developers need to write an elaborate logic to process these events and then transform them into statements for the target database. Depending on the use case that you are trying to solve, this can increase the project timelines. 

Approach 3: Using a Timestamp Column

The above two methods are provided by the Postgres engine to implement CDC. There is also a slightly convoluted custom method if you have the flexibility of a timestamp column in your table. This means the developer will need to query the table periodically and monitor the changes to the timestamp column. When changes are detected, the script can create appropriate database statements to write these changes to the target database. This approach is, however, effort-intensive and would demand a lot of time and effort from the developer. 

An Easier Approach to Set Up PostgreSQL Change Data Capture

In a nutshell, even though Postgres provides support for continuous syncing through triggers, logical decoding or through custom logic, it is still upon the developers to capture these events and convert them to the target database. This logic will need to be specifically written for different target databases according to use cases. An alternative is to use a cloud-based service like Hevo which can use Postgres CDC to continuously sync data to most destinations. 

What are your thoughts about setting up PostgreSQL CDC? Let us know in the comments?

No-code Data Pipeline for your Data Warehouse