Real-time data synchronization is essential to draw instant insights and stay on top of your data from your ever-growing PostgreSQL databases. This is where CDC comes into the picture.
In this blog, we will delve into the intricacies of Postgres CDC and explore the ins and outs of implementing Change Data Capture (CDC) in Postgres databases. This will enable you to achieve seamless, real-time data replication and keep your systems up-to-date with the latest information.
What is PostgreSQL?
PostgreSQL is an open-source relational database that is versatile in nature and has advanced features that allow it to be used both as a database and a data warehouse. It adheres to SQL standards and and can be extended through the PgAdmin tool. It is completely free and the strong community support makes it one of the top choices for data integrity, reliability and data management.
Looking for the best ETL tools to connect your PostgreSQL? Rest assured, Hevo’s no-code platform helps streamline your ETL process. Try Hevo and equip your team to:
- Integrate data from 150+ sources(60+ free sources).
- Utilize drag-and-drop and custom Python script features to transform your data.
- Risk management and security framework for cloud-based systems with HIPAA, GDPR, SOC2 Compliance.
Try Hevo and discover why 2000+ customers have chosen Hevo over tools like AWS DMS to upgrade to a modern data stack.
Try Hevo Postgres CDC for Free
Introduction to Postgres CDC
- PostgreSQL is one of the most widely used open-source relational databases. Its comprehensive querying layer and technical standards-compliance make it a favorite for a variety of uses cases ranging from OLTP workloads to data warehouse and analytics use cases. A challenge here is in ensuring that the data warehouse has the latest version of data that is present in the transactional database.
- It is common to have time-critical reporting requirements that cannot afford hourly or daily batch sync between these databases. Such requirements are handled using continuous sync between the instances based on the change data capture paradigm.
- In databases, change data capture (CDC) is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data. PostgreSQL has in-built functionality for CDC. This post is about Postgres CDC and the ways to achieve this.
What is the need for Postgres CDC?
The need for Postgres CDC arises from the critical requirement for real-time and accurate data synchronization in modern enterprises dealing with vast volumes of data. In scenarios where immediate access to the most up-to-date data is crucial for reporting and decision-making, waiting for scheduled batch syncs can lead to errors, outdated information, and hindered decision-making processes.
Postgres CDC methods offer a solution by enabling the tracking and managing of database changes and facilitating the replication of source data changes to destination data stores. By leveraging Postgres CDC, organizations can overcome the limitations of batch processing, ensuring continuous and precise data synchronization between databases, thereby enhancing the accuracy and timeliness of data-driven decision-making processes.
How does Postgres CDC work?
Change Data Capture (CDC) is a methodology employed in PostgreSQL (and other databases) to track and propagate changes in data and its structure. It identifies, captures, and delivers modifications made to the source data, enabling organizations to leverage the updated information. By doing so, Postgres CDC facilitates rapid and efficient data transfers to data warehouses, where the changes can be analyzed and utilized.
The core purpose of CDC in PostgreSQL is to ensure that multiple databases remain synchronized and updated with the latest changes occurring in the PostgreSQL database. The Postgres CDC process broadly encompasses three key steps:
- Capturing data changes: CDC monitors and records any inserts, updates, or deletions made to the data in the PostgreSQL database.
- Transforming the change data: The captured changes are transformed into a format compatible with the target system, ensuring seamless integration.
- Loading the data: The transformed change data is then loaded into the target database or data warehouse, keeping it aligned with the source PostgreSQL database.
Integrate PostgreSQL to Snowflake
Integrate PostgreSQL to Redshift
Integrate PostgreSQL on Amazon RDS to BigQuery
Benefits of CDC in PostgreSQL
- Enables quick, informed business decisions based on latest, updated data.
- Reduces data transfer costs by sending only latest changes, not entire datasets.
- Minimizes impact on Postgres database performance by creating copies refreshed with small change volumes.
Postgres Change Data Capture Using Custom Code
Approach 1: Postgres CDC using Postgres Audit Triggers
Trigger-based methods involve creating audit triggers on the database to capture all the events related to INSERT, UPDATE and DELETE methods. The advantage with this method is that everything happens at the SQL level and the developer just needs to read a separate table that contains all the audit logs.
To create a trigger in the Postgres table execute the below command.
SELECT audit.audit_table('target_table');
This statement is to be executed with a user role with permission to access the audit schema. The app which writes to the tables must not be connecting to the database using a superuser role.
It is possible to turn off these audit logs in case of specific fields if the requirement does not need to track fields that do not change frequently.
What we have then is a table that captures the audit log. The developer still needs to write a program to read and parse these logs and push these into the target databases. The complexity of this logic will also depend on the target database or data warehouse to which data is being pushed.
Limitations of Using Postgres Audit Triggers for Postgres CDC
- A downside with this method is that having triggers affects the performance of the database.
- A common practice to avoid this performance hit is to have a separate table which tracks the main table and have triggers on the second table.
- The syncing between master and the secondary table can be done using Postgres logical replication feature.
PostgreSQL to BigQuery: Seamless CDC Integration
No credit card required
Approach 2: Postgres CDC using Logical Decoding
Logical decoding uses the content of the write-ahead log to create a log of activities happening in the database. Write ahead log is an internal log that describes database changes on a storage level. The advantage of this approach is that it does not affect the performance of the database in any way.
The approach works based on the installation of an output plugin. To enable the use of logical decoding the following parameters need to be set in the Postgres configuration.
wal_level = logical
max_replication_slots = 10
Once these parameters are set, execute the following command to create a logical decoding slot.
SELECT * FROM pg_create_logical_replication_slot('slot_repl', 'decode_test');
This will ensure that every update to the tables will create audit events that can be accessed view select statements. Logical decoding supports two kinds of select statements to access these events. The below command can be used to take a peek at the events,
SELECT * FROM pg_logical_slot_peek_changes('slot_repl', NULL, NULL);
To consume or get the results, use the below command.
SELECT * FROM pg_logical_slot_get_changes('slot_repl', NULL, NULL);
The difference between ‘consume’ and ‘peek’ is that the second command removes the events once it is consumed. Let’s try to understand this with an example.
Let’s say there is a student table and we insert an entry into this table.
Insert into students (id, name, age) values (1, ‘Peter’, '30');
Once we execute the peek command, the result will be as follows.
SELECT * FROM pg_logical_slot_peek_changes('slot_repl', NULL, NULL);
0/16CEE88 | 102 | BEGIN 102
0/16CEE88 | 102 | table public.students: INSERT: id[integer]:1 name[character varying]:'Peter' age[integer]:30
0/16CEFA0 | 102 | COMMIT 102
Again, if we execute the same command, you will still see the same audit events.
SELECT * FROM pg_logical_slot_peek_changes('slot_repl', NULL, NULL);
0/16CEE88 | 102 | BEGIN 102
0/16CEE88 | 102 | table public.students: INSERT: id[integer]:1 name[character varying]:'Peter' age[integer]:30
0/16CEFA0 | 102 | COMMIT 102
Now let’s execute the get command.
SELECT * FROM pg_logical_slot_get_changes('slot_repl', NULL, NULL);
0/16CEE88 | 102 | BEGIN 102
0/16CEE88 | 102 | table public.students: INSERT: id[integer]:1 name[character varying]:'Peter' age[integer]:30
0/16CEFA0 | 102 | COMMIT 102
The results are the same. But if you execute the same command or the peek command again, you will find that there are zero results.
SELECT * FROM pg_logical_slot_get_changes('slot_repl', NULL, NULL);
(0 rows)
This means when get command is executed the results are served and deleted, greatly enhancing our ability to write the logic for using these events to create a replica of the table.
Postgres also supports the use of output plugins – which are basically c programs that can execute based on the INSERT, UPDATE and DELETE events and transform the events into a more user-friendly format. Wal2json is one such plugin which can output these events in JSON format. This plugin is widely used by most of the popular Postgres CDC add ons like Debezium.
A point to note here is that these events will not capture the table creation and modification steps.
Downside of Using Postgres Logical Decoding for Postgres CDC
- In this case, also, the developers need to write an elaborate logic to process these events and then transform them into statements for the target database.
- Depending on the use case that you are trying to solve, this can increase the project timelines.
Approach 3: Postgres CDC using a Timestamp Column
The above two methods are provided by the Postgres engine to implement CDC. There is also a slightly convoluted custom method if you have the flexibility of a timestamp column in your table. This means the developer will need to query the table periodically and monitor the changes to the timestamp column. When changes are detected, the script can create appropriate database statements to write these changes to the target database. This approach is, however, effort-intensive and would demand a lot of time and effort from the developer.
Limitations of Using Approach 3:
- This custom method of implementing CDC by monitoring a timestamp column has several limitations.
- It requires the developer to manually query the table at regular intervals, which can be resource-intensive and prone to inefficiencies.
- Since it depends on the accuracy of the timestamp, any clock misalignment or inconsistency in timestamp updates could lead to missed or incorrect changes.
- Additionally, it places a significant burden on the developer to write and maintain custom scripts, increasing both development time and potential for errors.
Approach 4: Postgres CDC Using Queries
This approach is based on running periodic SQL queries against the source database to detect changes. It can be implemented in two ways:
- Direct Queries for Delta Changes:
- Query the database periodically to fetch records with changes based on specific conditions, such as a change flag or an updated timestamp.
- Example query for delta changes:
SELECT * FROM target_table
WHERE last_modified > NOW() - INTERVAL '5 minutes';
- Snapshot Comparison:
- Take periodic snapshots of the data and compare them with the previous snapshot to detect INSERT, UPDATE, or DELETE operations.
- This approach is resource-intensive as it involves maintaining and comparing large datasets.
- Advantages:
- Simple to implement for small datasets or low-frequency updates.
- No requirement for database-level configurations or plugins.
- Limitations:
- Performance overhead due to frequent querying of the database.
- Greater latency to detect changes than real-time CDC methods.
- Not scalable for high-transaction environments because of resource consumption.
Approach 5: Postgres CDC Using Write-Ahead Logs (WAL)
- The WAL method uses the internal mechanism of Postgres to log every change made in the database before committing it to the main database files.
- This method is very reliable and efficient for capturing database changes.
How it Works:
- WAL maintains a log of all INSERT, UPDATE, DELETE, and DDL statements in a sequential file.
- Tools or custom code can be used to read and process these logs for CDC purposes.
Implementation:
- Enable Archiving: Set the wal_level parameter to archive or higher in the Postgres configuration.
wal_level = logical
archive_mode = on
archive_command = 'cp %p /path_to_wal_archive/%f
- Read WAL Files: Use tools or libraries, such as pg_waldump or custom scripts, to parse WAL files and extract changes.
- Example Tool:
- Debezium: This is a popular tool that uses Postgres WAL to provide a real-time stream of change events.
- Advantages:
- High performance because WAL works at the storage layer.
- A complete record of changes ensures reliability.
- Change detection in real time with minimal latency.
- Limitations:
- Complex to implement without third-party tools.
- Takes up space to store the WAL files, especially when high transaction.
- Does not automatically capture schema changes such as table creation or modification.
Conclusion
In a nutshell, even though Postgres provides support for continuous syncing through triggers, logical decoding or through custom logic, it is still upon the developers to capture these events and convert them to the target database. This logic will need to be specifically written for different target databases according to use cases. An alternative is to use a cloud-based service like Hevo which can use Postgres CDC to continuously sync data to most destinations.
Sign up for Hevo’s 14-day free trial and experience seamless data migration.
Frequently Asked Questions
1. How does CDC affect database performance?
Change Data Capture (CDC) can impact database performance by increasing memory usage, slowing down DML operations, raising CPU consumption, and consuming more disk space. To minimize these effects, enable CDC selectively, manage retention periods, allocate sufficient resources, and monitor performance closely.
2. How to check if CDC is enabled in Postgres?
To check if CDC is enabled in PostgreSQL, run:
SELECT * FROM pg_replication_slots WHERE slot_type = ‘logical’;
If entries appear, logical replication (often used for CDC) is active. You can also check for replication publications with:
SELECT * FROM pg_publication;
If found, CDC is likely enabled.
3. How to setup CDC for Postgres?
To set up CDC for PostgreSQL:
1.Enable logical replication in the PostgreSQL configuration.
2.Create a logical replication slot.
3.Capture database changes using logical decoding.
4.Optionally, use an output plugin for easier data handling.
Vivek Sinha is a seasoned product leader with over 10 years of expertise in revolutionizing real-time analytics and cloud-native technologies. He specializes in enhancing Apache Pinot, focusing on query processing and data mutability. Vivek is renowned for his strategic vision and ability to deliver cutting-edge solutions that empower businesses to harness the full potential of their data.