A Postgres cluster can be thought of as a collection of databases with their configurations. For example, you have a cluster with two databases that utilize Postgres v9. And all databases use the same cluster settings, such as buffer size, number of connections allowed, connection pool size, etc.

Similarly, you can create a cluster that runs on Postgres 12 and can contain many databases. Multiple clusters with the same version but different configurations are also possible. This article dives into the steps on how to do Postgres Cluster setup and the multiple options available to create a high-availability environment in PostgreSQL.

Steps for Setting Up a Basic PostgreSQL Cluster

All of a cluster’s data, including its configuration, can be kept entirely within the data directory. When you use Postgres tools to build a data directory, it is initially self-contained. Stopping the Postgres clusters’ process and copying this directory creates a complete backup of the cluster’s data. The entire directory can be moved and will continue to function normally. 

We’ll go over the steps on how to create cluster in PostgreSQL are given below. These work only in Ubuntu.

Integrate PostgreSQL to BigQuery
Integrate PostgreSQL to Snowflake
Integrate PostgreSQL to Redshift

PostgreSQL Cluster Set-Up: Creating a New PostgreSQL Cluster

  • To create a new PostgreSQL cluster, use the following command:
initdb -D /usr/local/pgsql/data
  • This PostgreSQL cluster builds a new database and a data directory in the specified location. This directory must be created by a user who has the authority to do so. Also, default configurations would be created, which are normally stored in /var/lib/postgresql/version/clusterName.
  • To connect to a PostgreSQL cluster, you can use the following command:
psql -U Postgres -p 5436 -h localhost
  • Each PostgreSQL cluster has its unique port number, so you have to use the correct port number while typing in the command.
  • To create a new database, use the above command and then use the one below:
CREATE DATABASE mynewdb;

PostgreSQL Cluster Set-Up: Start a Server for a Cluster

  • To start a server, use the following command:
pg_ctlcluster 12 main start
  • In the above code main is the name of the PostgreSQL cluster used and 12 is the Postgres version being used.

PostgreSQL Cluster Set-Up: Stop the Server for a Cluster

  • To stop the PostgreSQL cluster, use the following command:
pg_ctlcluster 12 main stop
  • Similarly, to check the status of the PostgreSQL Cluster, you can use this:
pg_ctlcluster 12 main status

PostgreSQL Cluster Set-Up: Reload or Restart the Server for the Cluster

  • To reload or restart the server for the PostgreSQL Cluster, use the following command:
pg_ctlcluster 12 main restart/reload
  • Restarting it stops the server if it is running and starts it and Reloading it Causes the configuration files to be re-read without a full shutdown of the server.

Here is an example of how you can cluster a table using an index.

You can cluster a table named employees based on its employees_ind index using the following command:

CLUSTER employees USING employees_ind;

You can cluster an employees table using the same index you used before with the following command:

CLUSTER employees;

You can cluster all the tables from a database that have been previously clustered using the following command:

CLUSTER;

Clustering Options to Create High Availability Environment in PostgreSQL

A Cluster is a collection of hosts that work together and are viewed as a single entity.  By adding servers, you can obtain horizontal scalability and the ability to process more work. It can withstand a node failure and continue to operate transparently.

Depending on what is exchanged, there are two models:

 Here are some of the Postgres clustering options:

Simplify PostgreSQL’s ETL & Data Analysis with Hevo’s No-code Data Pipeline

Are you looking for an easy way to replicate your PostgreSQL data? Hevo Data, a No-code Data Pipeline platform, helps load data from any data source, such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services, and simplifies the ETL process.

Hevo supports 150+ Data Sources (including 40+ Free Sources) and loads the data onto the desired destination like PostgreSQL in real-time and enriches the data and transforms it into an analysis-ready form.

Simplify your Data Analysis with Hevo today!

SIGN UP HERE FOR A 14-DAY FREE TRIAL!

PgCluster

Atsushi Mitani and the PGCluster developer’s team created PGCluster, a PostgreSQL multi-master and synchronous replication solution based on shared-nothing architecture. The main benefit of PgCluster is that it prevents data loss and the system from shutting down.

Because all nodes are synchronous, any machine in a 6-server arrangement can go down without losing data. It’s supposed to provide some read scaling as well, but it doesn’t for most people looking for a PostgreSQL Cluster.

Key Features

  • High Availability: During failover, auto-recovery, dynamic add and upgrade server, and so on, the service is not interrupted. Also, there isn’t a single point of failure. Data transfer speed is automatically adjusted depending on network bandwidth. Servers for load balancing and monitoring are included.
  • Synchronous multi‐master configuration: Any change to one master is immediately apparent to all other masters. There has been no change in the way transactions are handled. On top of this, the now() and random() values are duplicated.
  • Based on PostgreSQL: PgCluster is one of the crucial clustering options that is based on PostgreSQL.
  • Apps have the same API as PostgreSQL: PgCluster’s apps have the same API as PostgreSQL which makes it so compatible with it to begin with.

Pgpool-II

It’s a piece of software that connects PostgreSQL servers to PostgreSQL database clients. Pgpool-II communicates with both the backend and frontend of PostgreSQL and relays a connection between them.

As a result, a database application (frontend) considers Pgpool-II to be the PostgreSQL server, while the server (backend) considers Pgpool-II to be one of its clients.

Because Pgpool-II is transparent to both the server and the client, an existing database application can be utilized with it nearly without modifying its sources.

Key Features

  • Replication of the connection pool.
  • Load balancing.
  • Failover occurs automatically.
  • Queries that run in parallel.
  • It can be used in conjunction with streaming replication.

RubyRep

A multi-DBMS, asynchronous, multi-master, multiplatform replication solution. It does not support DDL, users, or grants because of triggers. Its primary goal is to make use and management as simple as possible.

Key Features

  • Configuration is simple.
  • Installation is simple.
  • Platform and table design are not dependent on each other.

Bucardo

Asynchronous row-based cascading master-slave replication with triggers and database queueing, as well as asynchronous master-master replication with triggers and configurable conflict resolution.

Bucardo requires a dedicated database and runs as a Perl daemon that communicates with it as well as any other databases in the replication process. It has two modes of operation: Multimaster and multi slave.

Key Features

  • Load balancing.
  • Slaves are not bound by any rules.
  • Replication in parts.
  • On-demand replication.
  • Slaves can be “pre-warmed” to save time during setup.

Postgres-XC

Postgres-XC is an open-source project that aims to provide a PostgreSQL cluster solution that is write-scalable, synchronous, symmetric, and transparent. It’s a set of tightly coupled database components that can be installed on several machines (physical or virtual).

The term “write-scalable” refers to the ability to scale up When opposed to a single database server, Postgres-XC can be configured with as many database servers as you wish and handle considerably more writes (updating SQL statements).

Key Features

  • Postgres-XC can be set up to run on several servers. 
  • Your data is kept in a distributed manner, i.e., partitioned or replicated, depending on the option you select for each table. 
  • When you run queries, Postgres-XC determines where the desired data is stored and sends queries to the servers that contain the data.

Citus

Citus is a PostgreSQL substitute that includes high-availability features like auto-sharding and replication. Citus partitions your database and replicate multiple copies of each shard among commodity machines in a cluster.

It seamlessly sends any writes or queries to one of the other nodes, which houses a replica of the impacted shard, if any node in the cluster becomes unavailable.

Key Features

  • Sharding logic is done automatically.
  • Replication is built-in.
  • Replication that is aware of the data center for disaster recovery.
  • Fault tolerance in the middle of a query with advanced load balancing.

You can improve the uptime of your PostgreSQL-powered real-time applications and reduce the impact of hardware failures on performance. This can be accomplished using built-in high availability tools, which reduce the need for costly and error-prone manual intervention.

Postgres-XL

It is a shared-nothing, Multimaster clustering system that can transparently distribute a table across a collection of nodes and execute queries on those nodes in parallel.

It also contains a Global Transaction Manager (GTM) component that provides a globally consistent view of the cluster. The project is based on the PostgreSQL 9.5 version. Commercial support is provided by some companies, such as 2nd Quadrant.

Key Features

They can handle a wide range of databases like:

  • OLTP workloads that need a lot of writing.
  • Business Intelligence that requires MPP parallelism.
  • An operational data store is available.
  • A key-value store is also available.
  • Mixed-workload scenarios. 
  • Multi-tenant environments.

Optimizing Postgres Cluster to Get High Throughput

You can efficiently increase the speed and efficiency of your database by optimizing your Postgres Cluster. It is crucial to consider hardware requirements, query patterns, and the database architecture to get high throughput in your Postgres cluster.

‍You can start by ensuring that your hardware can handle the load and throughput requirement of your DB. This includes considering the type of CPU, RAM, and the number of servers required to scale up your cluster. 

Then, you can ensure that your database is partitioned correctly and the tables can be accessed efficiently. It is also important to ensure that the query patterns are optimized to get quick and efficient database access.

Finally, you have to ensure that the database is maintained properly and regularly backed up. Once your Postgres cluster is optimized for high throughput, it needs to be monitored to measure the performance.

Monitoring PostgreSQL Database Cluster Performance

In PostgreSQL, there are two forms of monitoring:

  • System Resource Monitoring: Examines the infrastructure on which PostgreSQL is installed, revealing information such as CPU and memory use.
  • Database Monitoring: Ensures that your database is in good shape and performing at its best.

Here are a few tools you can use to monitor PostgreSQL Database Cluster Performance:

  • Sematext
  • Prometheus and Grafana
  • Pganalyze
  • pgDash
  • Datadog APM

Active-active PostgreSQL Cluster

In an active-active PostgreSQL cluster, the application can write to any instance in the cluster, and the data is written to all instances in the cluster, allowing any instance in the cluster to:

  • Provide near-zero downtime because the new instance is already configured for the read/write state; no reconfiguration is required.
  • Offer near-zero downtime changes from one version to another.
  • Reduce latency for users in geographically dispersed clusters. Latency can be minimized when an instance is closer to the user.

Before wrapping up, let’s cover some basics.

What are PostgreSQL Clusters?

The term “cluster” in PostgreSQL refers to a “cluster” of databases, as opposed to the more common notion of a collection of servers or virtual machines operating together.

The term “cluster” in PostgreSQL is a historical quirk, and it has nothing to do with the common definition of “compute cluster,” which often refers to groups of machines that collaborate to improve speed and/or availability.

It also has nothing to do with the PostgreSQL CLUSTER command, which is used to organize tables. A Postgres cluster is made up of two elements:

  • The Data Directory: contains a collection of files created and maintained by Postgres tools and processes. Databases, Tables, and all other logical entities with which you interact are stored in these files.
  • Postgres: which is the master process. It controls the files in this directory and provides interfaces for manipulating their contents.

Using PostgreSQL clusters turns out to be extremely handy in some cases, such as establishing test fixtures for automated tests, bringing up a temporary database without disrupting any existing Postgres configuration, running multiple versions of Postgres at the same time, and so on.

Need for a Postgres Cluster

You might consider setting up a Postgres cluster if your server requires the following key features:

  • Reduced Downtime: A Postgres cluster can minimize downtime during hardware failure or maintenance by providing an automatic failover mechanism.
  • Load Balancing: Incoming traffic can be efficiently distributed across nodes using a Postgres cluster to prevent overloading on a single server.
  • Improved Response Time: Data can be dispersed among nodes with improved server response time and data distribution.

Other similar functions boost the performance and reliability of a database system.

You can explore more about migration : Postgres to MySQL

Conclusion

There are many more products to create a high availability environment for PostgreSQL. But you need to consider your infrastructure. If you only have one application server, no matter how well the databases are configured for high availability, you will be inaccessible if the application server dies. You must thoroughly investigate the infrastructure’s single points of failure and attempt to resolve them.

You have learned how to set up clusters in PostgreSQL and how to optimize them for high availability. You also read how to manage these and monitor the cluster performance.

As a Developer, extracting complex data from a diverse set of data sources like Databases, CRMs, Project management Tools, Streaming Services, Marketing Platforms to your PostgreSQL Database can seem to be quite challenging. If you are from non-technical background or are new in the game of data warehouse and analytics, Hevo Data can help!

Want to take Hevo for a spin? Sign Up or a 14-day free trial and experience the feature-rich Hevo suite firsthand. Also checkout our unbeatable pricing to choose the best plan for your organization.

FAQ on Postgres Cluster Setup

When to use Postgres cluster?

You can use a Postgres cluster when:
1. There is High Availability Requirements
2. Scalability Needs
3. Load Balancing
4. Data Distribution
5. Geographic Distribution

What is the difference between cluster and database in PostgreSQL?

A. Database:
1A) A database in PostgreSQL refers to a logical container that stores schemas, tables, indexes, and other objects.
2A) Each database operates independently within a PostgreSQL cluster.
B. Cluster:
1B) A PostgreSQL cluster refers to a collection of databases managed by a single PostgreSQL server process.
2B) It includes shared resources such as configuration settings, transaction logs (WAL), and background processes (e.g., WAL sender, background writer).

How to build a PostgreSQL cluster?

1. Choose a replication strategy
2. Install PostgreSQL
3. Configure PostgreSQL settings
4. Initialize the primary node
5. Set Up replication
6. Enable High Availability
7. Test and Validate
8. Monitor and Maintain

Sharon Rithika
Content Writer, Hevo Data

Sharon is a data science enthusiast with a hands-on approach to data integration and infrastructure. She leverages her technical background in computer science and her experience as a Marketing Content Analyst at Hevo Data to create informative content that bridges the gap between technical concepts and practical applications. Sharon's passion lies in using data to solve real-world problems and empower others with data literacy.

No-Code Data Pipeline for PostgreSQL