Data replication ensures the duplication of data on an ongoing basis so that the replicate is in a consistently updated state and is synchronized with the source.
This ensures high data availability and safety of data against undesired events such as crashes, system errors, etc.
PostgreSQL streaming replication truly meets this requirement. This feature is achieved through a master-slave configuration. This blog is aimed at providing a detailed explanation of PostgreSQL replication.
Step-by-Step Replication Methods In PostgreSQL
Method 1: Hevo Data, The Easier Approach
Now restart the slave server to complete the process.
This is how a standby node can be set up for PostgreSQL streaming replication.
Hevo Data, a no-code Data Pipeline can help you replicate data from PostgreSQL (among 150+ data sources) swiftly to a database/data warehouse of your choice. Hevo is fully-managed and automates the process of monitoring and replicating the changes on the secondary database rather than making the user write the code repeatedly. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.
Check out what makes Hevo amazing:
- Minimal Learning: Hevo with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to export.
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
Hevo simplifies replicating your PostgreSQL streaming data and migrates it to a destination of your choice with just a few clicks.
Easily set up and manage replication processes, ensuring data consistency and reliability across your PostgreSQL databases. Optimize your database operations effortlessly with Hevo.
Get Started with Hevo for free
Method 2: PostgreSQL Streaming Replication
PostgreSQL streaming replication or WAL replication can be set up quite easily upon installing PostgreSQL on all servers. PostgreSQL streaming replication is based on transferring the WAL files from the primary to the target database.
PostgreSQL streaming replication is implemented using a master-slave configuration. The master is known as the primary instance and handles the primary database and its operations. The slave acts as the secondary instance and implements all modifications made to the primary database on itself, thus making itself an identical copy. The master is the read/write server whereas the slave is a read-only server.
Here you need to configure both the master node & the standby node.
Configuring Master Node
The master node can be configured using the following steps:
Step 1: Database Initialization
To initialize the database, make use of the initidb utility command.
Step 2: Creating A New User
Create a new user with replication privileges using the following command:
CREATE USER <user_name> REPLICATION LOGIN ENCRYPTED PASSWORD ’<password>’;
The user must specify a username and password for the given query. The replication keyword is used to provide the user with the necessary privileges.
Example query:
CREATE USER rep_user REPLICATION LOGIN ENCRYPTED PASSWORD 'rep_pass'
Step 3: Configuring PostgreSQL Streaming Replication Properties
You can configure the streaming properties using the PostgreSQL configuration file (postgresql.conf). Make the following changes in the file:
wal_level = logical
wal_log_hints = on
max_wal_senders = 8
max_wal_size = 1GB
hot_standby = on
The parameters used are:
- wal_level: This is used to enable PostgreSQL streaming replication. The possible values here are replica, minimal, and logical.
- wal_log_hints: It is required for the pg_rewind capability, this helps when the standby server is out of sync with the master server.
- max_wal_senders: It is used to specify the maximum number of concurrent connections that can be established with the standby servers.
- max_wal_size: It is used to specify the size of WAL files that can be retained within log files.
- hot_standby: This parameter enables a read-only connection with the slave when it is set to ON.
Step 4: Replication Entry In pg_hba.conf File
This is used to allow the servers to establish a connection with each other for replication. The file is usually found in the data directory of PostgreSQL.
Use the following lines of code:
host replication rep_user IPaddress md5
The master server now allows a user named rep_user to connect & act as the slave server using the specified IP for replication.
Example query:
host replication rep_user 192.168.0.22/32 md5
This is how the master node can be set up for PostgreSQL streaming replication.
Configuring Standby Node
Once the master node is configured, the next step for PostgreSQL replication is to configure the standby node. You can configure the standby node in two simple steps:
Step 1: Create A Backup Of The Master Node
Use the pg_basebackup utility to create a backup of the master node. This will be used as the starting point for the slave mode.
The utility must be used with the following syntax:
pg_basebackup -D <data_directory> -h <master_host> -X stream -c fast -U rep_user -W
The parameters used are:
- -D: This indicates the directory you are working on.
- -h: This is used to specify the master host.
- -X: This is used to include the necessary transactional log files.
- -C: This is used to set the checkpoints.
- -W: This is used to prompt the user for a password before connecting to the database.
Step 2: Creating The Replication Configuration File
Check if the replication configuration file exists. If it doesn’t, create the replication configuration file as recovery.conf. The file should be created in the data directory of the PostgreSQL installation.
You can create the file automatically making use of the -R option in the pg_basebackup utility.
The recovery.conf file should have the following commands:
standby_mode = ‘on’
primary_conninfo = ‘host=<master_host> port=<postgres_port> user=<replication_user> password=<password> application_name=”host_name”’
recovery_target_timeline = ‘latest
The parameters used are:
- standby_mode: When this parameter is set to ON, it causes the server to start as the slave.
- primary_conninfo: This parameter is used to establish a connection between the slave & the master server with the help of a connection string.
- recovery_target_timeline: It is used to set the recovery time.
You need to give the IP address, username & password of the master server as values for the primary_conninfo parameter to establish a connection.
Example query:
primary_conninfo = 'host=192.168.0.26 port=5432 user=rep_user password=rep_pass'
What is Streaming Replication?
One of the standard features of PostgreSQL is streaming replication, which enables real-time movement of updated data from the primary server to the standby server, keeping the databases on both servers up to date. Your system will greatly benefit from the streaming replication features listed below.
Failover: The standby server has the ability to take over in the event that the primary server fails.
Read-only load balancing: SQL processing that is read-only can be split over several servers.
To divide up the read-only burden, you can process read-only SQL on the standby server. Nevertheless, PostgreSQL lacks a distribution function that takes server load and requests into account. An open source extension called pgpool-II’s load balancing capability can be used to evenly share workload and read-only queries among PostgreSQL instances.
Streaming Replication Mechanism
Let’s examine what needs to be transferred and how it should be transferred in more detail to better comprehend streaming replication.
What is shipped
For crash recovery or rollback, PostgreSQL stores the most recent data from the primary server in a transaction log called a write-ahead log, or WAL. The way streaming replication operates is by sending the WAL in real time to the standby server and using it there.
Setting up Streaming Replication
Steps to be performed on Primary database instance
1: Review the below parameter settings
(a) Changes required in postgresql.conf
listen_addresses = '*'
archive_mode = on
max_wal_senders = 5
max_wal_size = 10GB
wal_level = replica
hot_standby = on
archive_command = 'rsync -a %p /opt/pg_archives/%f'
Synchronous Replication and Asynchronous Replication
You can choose between synchronous and asynchronous replication for streaming replication on each standby server. The features of each are covered in this section, along with important setup points.
Features of replications that are synchronous and asynchronous
Whether or not to wait for the standby server’s answer before finishing the processing on the primary server is the distinction between synchronous and asynchronous replication. Select the configuration that best suits your needs since it has an impact on high availability and SQL processing response time.
Synchronous Replication
The primary server waits for a response from the standby server before completing a process. Therefore, the overall response time includes the log shipping time. Since there is no delay in WAL shipping to the standby server, the data freshness (reliability) of the standby server is improved. Suitable for failover and read-only load balancing operations.
Asynchronous Replication
The primary server completes a process without waiting for a response from the standby server. Therefore, the overall response time is about the same as when streaming replication is not used. Since WAL shipping and its application (data update) on the standby server are done asynchronously, the updated result on the primary server may not be immediately available on the standby server.
Depending on the timing of failover, data may be lost. It is suitable for replication to remote areas for disaster recovery.
You need to create /opt/pg_archives folder and set ownership to PostgreSQL superuser i.e. postgres in this case.(b) Changes required in pg_hba.conf
Set up authentication on the primary server to allow replication connections from the standby server(s).
host replication all 192.168.57.102/32 trust
2: Reload/restart the PostgreSQL database instance
[root@pg ~]# systemctl restart postgresql-14.service
OR
[root@pg ~]# su postgres
bash-4.2$ /usr/pgsql-14/bin/pg_ctl -D /var/lib/pgsql/14/data/ restart -mf
Additional Resources on PostgreSQL Streaming Replication
Conclusion
- This article enriches you with in-depth knowledge about the various concepts behind the replication process in PostgreSQL.
- It provides a step-by-step comprehensive guide to help you perform PostgreSQL streaming replication in the smoothest way possible.
- PostgreSQL streaming replication method can, however, prove to be challenging especially for beginners & this is where Hevo can help you out.
- Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs.
Vishal Agarwal is a Data Engineer with 10+ years of experience in the data field. He has designed scalable and efficient data solutions, and his expertise lies in AWS, Azure, Spark, GCP, SQL, Python, and other related technologies. By combining his passion for writing and the knowledge he has acquired over the years, he wishes to help data practitioners solve the day-to-day challenges they face in data engineering. In his article, Vishal applies his analytical thinking and problem-solving approaches to untangle the intricacies of data integration and analysis.