PostgreSQL is an open-source object-relational database that is good for OLTP workloads. Amazon Redshift is a cloud data warehouse optimized for scalable analytics. Redshift is based on PostgreSQL 8.0.2 but adds columnar storage and MPP architecture.
Companies often want to migrate operational data from PostgreSQL to Redshift for analysis. This enables them to build scalable OLAP systems leveraging Redshift’s analytics capabilities.
This article outlines two straightforward methods to migrate data from PostgreSQL to Amazon Redshift. By moving data into Redshift, you can optimize analytical workloads while retaining PostgreSQL for transactional needs. Read on to learn simple approaches to enable analytics on your PostgreSQL data.
Methods to Connect PostgreSQL to Redshift
The following 2 methods can be used to connect PostgreSQL to Redshift for data transfer process:
Method 1: Using Hevo Data to Connect PostgreSQL to Redshift
Hevo Data, an Automated No-code Data Pipeline, helps you directly load and transfer data from PostgreSQL to Redshift in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of migrating data from PostgreSQL to Redshift, enriching the data, and transforming it into an analysis-ready form without having to write a single line of code.
Get Started with Hevo for Free
Using Hevo Data, you can connect PostgreSQL to Redshift in the following 2 steps:
- Authenticate Source: You must authenticate your PostgreSQL account as the data source for Hevo Data as shown in the below image.
- Configure Destination: You must configure your Redshift account as the data destination for Hevo Data as shown in the image below.
With this, you have successfully set up PostgreSQL to Redshift Integration using Hevo Data.
For more information on the integration process, refer to the Hevo documentation:
Method 2: Manual ETL Process to Connect PostgreSQL to Redshift
You can build a pipeline to load data from PostgreSQL to Redshift using the following steps:
Each of these steps are elaborated along with code snippets in the sections below. Let’s dive in
Step 1: Build a Schema in the Target Database
The first step of this migrating PostgreSQL to Redshift is to keep your target database ready by building an Amazon Redshift compatible schema.
Postgres’s table definition is stored in the information_schema.tables and column definition stored in
information_schema.columns. Redshift’s column types differ slightly from Postgres and a few of them are not supported.
Here is a list of all Postgres data types that are not supported in Redshift:
- BIT, BIT VARYING
- Composite Types
- Date/Time Types
- Enumerated Types
- Geometric Types
- Network Address Types
- Numeric Types
- SERIAL, BIGSERIAL, SMALLSERIAL
- Object Identifier Types
- Range Types
- Text Search Types
You will have to convert the above data types into Redshift compatible types first.
Redshift makes use of Sort key and Distribution key to achieve performance gain. It is very important to choose these keys wisely at the time of table creation. With Sort Keys, you can configure the column on which sorting needs to be applied upon storing in Redshift. Distribution key is used to distribute data across multiple nodes based on the values of a particular column.
You can read more about choosing Amazon Redshift sort keys and Amazon Redshift distribution keys here.
Let us take an example of migrating Customer table from PostgreSQL to Redshift. To create a table in Redshift, we can use below command:
CREATE TABLE CUSTOMERS (ID INT, NAME VARCHAR(20), ADDRESS VARCHAR(50), BIRTH_DATE TIMESTAMP)
SORTKEY (ID, BIRTH_DATE);
Note that the maximum character length supported by Redshift is 65535 hence any field larger than that (e.g. text) will need to be truncated/ split across multiple columns.
Step 2: Extracting Data from PostgreSQL to S3 Buckets
The best way to load a large amount of data to Redshift table is to use a
COPY command. Using
COPY command, you can load data from various sources like Amazon S3, Amazon EMR, and Remote Host(SSH). The most commonly used source for
COPY command is Amazon S3 as it offers the best performance by loading multiple data files in parallel.
You can use PostgreSQL
COPY statement to export a table data into a file using specific delimiters. You will need to login to the database via command line with the psql command. The following command connects to PostgreSQL Database and exports all data from customers table (an example) to a file customers_data.csv
COPY (SELECT (ID, NAME, ADDRESS, BIRTH_DATE) FROM CUSTOMERS) TO ‘CUSTOMERS_DATA.CSV WITH DELIMITER '|' CSV;
You can zip the data files for maximum performance since Redshift gives an option to directly import zipped files.
$ gzip customers_data.csv
Now the next step is to upload these files to Amazon S3 bucket. It can be done either manually from AWS Console or by using Amazon CLI. The AWS Command Line Interface (CLI) is a unified tool to manage your AWS services. With just one tool to download and configure, you can control multiple AWS services from the command line and automate them through scripts. To install AWS CLI on your machine you can refer to AWS documentation here.
Post setup of AWS CLI on your machine, you can use CP command to copy data from your local machine to S3 bucket. Below command uploads customer_data.csv.gzip into a folder “my_folder” which is located inside an S3 bucket named as “my_s3_bucket”.
$ aws s3 cp customer_data.csv.gzip s3://my_s3_bucket/my_folder/
Step 3: Loading Data from S3 to Redshift Temporary Tables
By now, your data files that are to be imported are ready in S3. The final step is to load this data into Redshift target table.
COPY command inserts the data from a file to a table. An important thing to note is that if the table is not empty then it will simply append the new records without changing the existing ones. So, the best approach is to load the data from the file into a temporary table and then inserting into final target table using insert into command. With this approach, you can easily handle situations where you want to upsert records instead of appending them. The following steps are required:
- Create a Temporary Table in Redshift: Below SQL will create a temporary table named “TEMP_CUSTOMERS” in Redshift. Temporary tables are valid for a session in which they are created and will be auto-dropped once the session is over. The following command creates a temporary customer table.
CREATE TEMPORARY TABLE TEMP_CUSTOMERS (ID INT, NAME VARCHAR(20), ADDRESS VARCHAR(50), BIRTH_DATE TIMESTAMP)
SORTKEY (ID, BIRTH_DATE);
- Copy Data from S3 to a Temporary Table in Redshift
- Insert Data from the Temporary Table to the Target Table: Now that you have data in Redshift temporary table, you can simply execute an insert query to load this into the final target table.
INSERT INTO CUSTOMERS
SELECT * FROM TEMP_CUSTOMERS;
Loading data into a temporary table first instead of directly inserting it into the target table helps in data validation. An important thing to note while using
COPY command is to handle NULLS. You can specify which character in your data file should be mapped as NULL in Redshift table. If NULL values are stored as blank in your data file then you use
EMPTYASNULL option in
COPY command to convert blank values into NULL values.
Any errors while loading the data can be checked in the STL_LOAD_ERRORS table in Redshift. The table gives you the row that was being loaded and the associated error like column length exceeded, inserting text to a number field, etc.
Take a look at this article that talks about replicating Postgres data to Snowflake for more understanding.
Limitations of Using Custom ETL Scripts
These challenges have an impact on ensuring that you have consistent and accurate data available in your Redshift in near Real-Time.
- The Custom ETL Script method works well only if you have to move data only once or in batches from PostgreSQL to Redshift.
- The Custom ETL Script method also fails when you have to move data in near real-time from PostgreSQL to Redshift.
- A more optimal way is to move incremental data between two syncs from Postgres to Redshift instead of full load. This method is called the Change Data Capture method.
- When you write custom SQL scripts to extract a subset of data often those scripts break as the source schema keeps changing or evolving.
Why Postgres to Redshift Migration?
Both PostgreSQL and Redshift have specific functions and are among the best tools to be used because:
- Online transactional processing (OLTP) databases capture and store transactional-oriented data in real-time and are used for the effective functioning of an organization’s everyday activities. For complex analysis, OLTP databases slow down the website’s operations. Examples of OLTP databases are Postgres and MySQL.
Online Analytical Processing (OLAP) databases allow fast execution of complex analytical queries. They are commonly known as data warehouses and have the ability to store large volumes of data and run complex queries on them to generate insights. OLAP data warehouses are AWS Redshift, Google BigQuery, Azure Synapse, and Snowflake.
So, to get the best of both OLTP and OLAP, moving data from PostgreSQL to Redshift appears to be a wiser option.
- PostgreSQL is highly customizable but is quite slow to draw analytical insights from a large volume of datasets as it is a row-based database. Here comes the role of Redshift, as it is a Massively Parallel Processing machine and can deliver data insights within a fraction of a second.
- Postgres can handle concurrency but has the risk of crashing when the data load increases. Pairing it with Redshift can ensure the safety of data and applications.
This article provided a step-by-step description of the 2 methods you can use to migrate PostgreSQL to Redshift. The second method which involves building the ETL process manually has many challenges and limitations. Whereas, if you require Real-Time data replication and looking for a fully automated real-time solution, then Hevo is the right choice for you.
With Hevo, you can achieve simple and efficient data migration from PostgreSQL to Redshift in minutes. Hevo can help you replicate data from PostgreSQL and 150+ data sources (including 50+ data sources) to Redshift or a destination of your choice and visualize it in a BI tool. This makes Hevo the right partner to be by your side as your business scales.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite firsthand. Check out our transparent pricing to make an informed decision!
Share your understanding of PostgreSQL to Redshift migration in the comments section below!