Data replication ensures the duplication of data on an ongoing basis so that the replicate is in a consistently updated state and is synchronized with the source. This ensures high data availability and safety of data against undesired events such as crashes, system errors, etc.

PostgreSQL streaming replication truly meets this requirement. This feature is achieved through a master-slave configuration. This blog is aimed at providing a detailed explanation of PostgreSQL replication.

What is Streaming Replication?

One of the standard features of PostgreSQL is streaming replication, which enables real-time movement of updated data from the primary server to the standby server, keeping the databases on both servers up to date. Your system will greatly benefit from the streaming replication features listed below.

Failover: The standby server has the ability to take over in the event that the primary server fails.

Read-only load balancing: SQL processing that is read-only can be split over several servers.

To divide up the read-only burden, you can process read-only SQL on the standby server. Nevertheless, PostgreSQL lacks a distribution function that takes server load and requests into account. An open source extension called pgpool-II’s load balancing capability can be used to evenly share workload and read-only queries among PostgreSQL instances.

Solve your data replication problems with Hevo’s reliable, no-code, automated pipelines with 150+ connectors.
Get your free trial right away!

Streaming Replication Mechanism

Let’s examine what needs to be transferred and how it should be transferred in more detail to better comprehend streaming replication.

What is shipped

For crash recovery or rollback, PostgreSQL stores the most recent data from the primary server in a transaction log called a write-ahead log, or WAL. The way streaming replication operates is by sending the WAL in real time to the standby server and using it there.

Setting up Streaming Replication

Steps to be performed on Primary database instance

1: Review the below parameter settings 

(a) Changes required in postgresql.conf

listen_addresses = '*'
archive_mode = on
max_wal_senders = 5 
max_wal_size = 10GB    
wal_level = replica
hot_standby = on   
archive_command = 'rsync -a %p /opt/pg_archives/%f'

Synchronous Replication and Asynchronous Replication

You can choose between synchronous and asynchronous replication for streaming replication on each standby server. The features of each are covered in this section, along with important setup points.

Features of replications that are synchronous and asynchronous

Whether or not to wait for the standby server’s answer before finishing the processing on the primary server is the distinction between synchronous and asynchronous replication. Select the configuration that best suits your needs since it has an impact on high availability and SQL processing response time.

Synchronous Replication

The primary server waits for a response from the standby server before completing a process. Therefore, the overall response time includes the log shipping time. Since there is no delay in WAL shipping to the standby server, the data freshness (reliability) of the standby server is improved. Suitable for failover and read-only load balancing operations.

Asynchronous Replication

The primary server completes a process without waiting for a response from the standby server. Therefore, the overall response time is about the same as when streaming replication is not used. Since WAL shipping and its application (data update) on the standby server are done asynchronously, the updated result on the primary server may not be immediately available on the standby server.

Depending on the timing of failover, data may be lost. It is suitable for replication to remote areas for disaster recovery.

You need to create /opt/pg_archives folder and set ownership to PostgreSQL superuser  i.e. postgres in this case.(b) Changes required in pg_hba.conf

Set up authentication on the primary server to allow replication connections from the standby server(s).

host    replication all   192.168.57.102/32   trust

2: Reload/restart the PostgreSQL database instance

[root@pg ~]# systemctl restart postgresql-14.service
OR
[root@pg ~]# su postgres
bash-4.2$ /usr/pgsql-14/bin/pg_ctl -D /var/lib/pgsql/14/data/ restart -mf

Replication Methods In PostgreSQL

Method 1: Hevo Data, The Easier Approach

Now restart the slave server to complete the process.

This is how a standby node can be set up for PostgreSQL streaming replication.

Hevo Data, a no-code Data Pipeline can help you replicate data from PostgreSQL (among 150+ data sources) swiftly to a database/data warehouse of your choice. Hevo is fully-managed and automates the process of monitoring and replicating the changes on the secondary database rather than making the user write the code repeatedly. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

Check out what makes Hevo amazing:

  • Minimal Learning: Hevo with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to export. 
  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
Seamlessly replicate data from 150+ data sources in minutes

Remember that Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs.

Sign up here for a 14-day free trial!

Get Started with Hevo for Free

Method 2: PostgreSQL Streaming Replication

PostgreSQL streaming replication
Image Source

PostgreSQL streaming replication or WAL replication can be set up quite easily upon installing PostgreSQL on all servers. PostgreSQL streaming replication is based on transferring the WAL files from the primary to the target database.

PostgreSQL streaming replication is implemented using a master-slave configuration. The master is known as the primary instance and handles the primary database and its operations. The slave acts as the secondary instance and implements all modifications made to the primary database on itself, thus making itself an identical copy. The master is the read/write server whereas the slave is a read-only server.

Here you need to configure both the master node & the standby node.

Configuring Master Node

The master node can be configured using the following steps:

Step 1: Database Initialization

To initialize the database, make use of the initidb utility command.

Step 2: Creating A New User

Create a new user with replication privileges using the following command:

CREATE USER <user_name> REPLICATION LOGIN ENCRYPTED PASSWORD ’<password>’;

The user must specify a username and password for the given query. The replication keyword is used to provide the user with the necessary privileges.

Example query:

CREATE USER rep_user REPLICATION LOGIN ENCRYPTED PASSWORD 'rep_pass'
New User Created - postgresql streaming replication
Step 3: Configuring PostgreSQL Streaming Replication Properties

You can configure the streaming properties using the PostgreSQL configuration file (postgresql.conf). Make the following changes in the file:

wal_level = logical
wal_log_hints = on
max_wal_senders = 8
max_wal_size = 1GB
hot_standby = on

The parameters used are:

  • wal_level: This is used to enable PostgreSQL streaming replication. The possible values here are replica, minimal, and logical.
  • wal_log_hints: It is required for the pg_rewind capability, this helps when the standby server is out of sync with the master server.
  • max_wal_senders: It is used to specify the maximum number of concurrent connections that can be established with the standby servers.
  • max_wal_size: It is used to specify the size of WAL files that can be retained within log files.
  • hot_standby: This parameter enables a read-only connection with the slave when it is set to ON.
Step 4: Replication Entry In pg_hba.conf File

This is used to allow the servers to establish a connection with each other for replication. The file is usually found in the data directory of PostgreSQL.

Use the following lines of code:

host replication rep_user IPaddress md5

The master server now allows a user named rep_user to connect & act as the slave server using the specified IP for replication.

Example query:

host replication rep_user 192.168.0.22/32 md5

This is how the master node can be set up for PostgreSQL streaming replication.

Download the Ultimate Guide on Database Replication
Download the Ultimate Guide on Database Replication
Download the Ultimate Guide on Database Replication
Learn the 3 ways to replicate databases & which one you should prefer.

Configuring Standby Node

Once the master node is configured, the next step for PostgreSQL replication is to configure the standby node. You can configure the standby node in two simple steps:

Step 1: Create A Backup Of The Master Node

Use the pg_basebackup utility to create a backup of the master node. This will be used as the starting point for the slave mode.

The utility must be used with the following syntax:

pg_basebackp -D <data_directory> -h <master_host> -X stream -c fast -U rep_user -W

The parameters used are:

  • -D: This indicates the directory you are working on.
  • -h: This is used to specify the master host.
  • -X: This is used to include the necessary transactional log files.
  • -C: This is used to set the checkpoints.
  • -W: This is used to prompt the user for a password before connecting to the database.
Step 2: Creating The Replication Configuration File

Check if the replication configuration file exists. If it doesn’t, create the replication configuration file as recovery.conf. The file should be created in the data directory of the PostgreSQL installation.

You can create the file automatically making use of the -R option in the pg_basebackup utility.

The recovery.conf file should have the following commands:

standby_mode = ‘on’
primary_conninfo  = ‘host=<master_host> port=<postgres_port> user=<replication_user> password=<password> application_name=”host_name”’
recovery_target_timeline = ‘latest

The parameters used are:

  • standby_mode: When this parameter is set to ON, it causes the server to start as the slave.
  • primary_conninfo: This parameter is used to establish a connection between the slave & the master server with the help of a connection string.
  • recovery_target_timeline: It is used to set the recovery time.

You need to give the IP address, username & password of the master server as values for the primary_conninfo parameter to establish a connection.

Example query:

primary_conninfo = 'host=192.168.0.26 port=5432 user=rep_user password=rep_pass'

Additional Resources on PostgreSQL Streaming Replication

Conclusion

  • This article enriches you with in-depth knowledge about the various concepts behind the replication process in PostgreSQL.
  • It provides a step-by-step comprehensive guide to help you perform PostgreSQL streaming replication in the smoothest way possible.
  • PostgreSQL streaming replication method can, however, prove to be challenging especially for beginners & this is where Hevo can help you out.
  • Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs.

We would love to hear about your experience of performing PostgreSQL streaming replication! Share your thoughts in the comments section below.

Vishal Agrawal
Technical Content Writer, Hevo Data

Vishal Agarwal is a Data Engineer with 10+ years of experience in the data field. He has designed scalable and efficient data solutions, and his expertise lies in AWS, Azure, Spark, GCP, SQL, Python, and other related technologies. By combining his passion for writing and the knowledge he has acquired over the years, he wishes to help data practitioners solve the day-to-day challenges they face in data engineering. In his article, Vishal applies his analytical thinking and problem-solving approaches to untangle the intricacies of data integration and analysis.