PostgreSQL to Redshift: 2 Easy Methods

on Tutorial, Data Integration, Data Warehouse, ETL • October 10th, 2018 • Write for Hevo

PostgreSQL is an open source Object-relational Database System that uses and extends the SQL language combined with many features that safely store and scale the most complicated data workloads. It works as a traditional OLTP Database.

Amazon Redshift is a fully managed Database optimized for Data Warehousing. If your objective is to build a scalable OLAP system then you can choose Redshift. It is based on PostgreSQL 8.0.2. Amazon Redshift being a columnar Database supports scalable architecture and multi-node processing. As it is offered on the Amazon Cloud Computing platform, there is no initial setup effort required and users can get started in minutes by using simple API calls or the AWS Management Console.

Often for analytical or archival purposes, one would need to move data from PostgreSQL to Redshift. This article explains 2 simple methods that you need to follow to migrate your data from PostgreSQL to Redshift for building a Data Warehousing solution. Read along to learn more about these methods and decide which method suits you the best!

Table of Contents

Prerequisites

  • A PostgreSQL account.
  • An Amazon Redshift account.
  • Working knowledge of Redshift commands.

Introduction to PostgreSQL

PostgreSQL Logo
Image Source

PostgreSQL is a powerful open-source Object-Relational Database. It not only uses the SQL language but also adds many features to it, allowing you to store and expand very complex data workloads. It has a reputation for proven Architecture, Data Integrity, Reliability, and Scalability. Powerful features and commitment to a rich open source community are some aspects of this Database.

Due to its Robustness and Feature-Rich Query Layer, PostgreSQL is very popular in use cases that traditionally require strict table structure. It has been developed for more than 30 years and most of the major problems have been solved. This fact makes people confident in using it in a business environment

To learn more about PostgreSQL, visit here.

Introduction to Amazon Redshift

Amazon Redshift Logo
Image Source

AWS Redshift is a Data Warehouse built using MPP (massively parallel processing) architecture. It is owned and maintained by Amazon Web Services (AWS). It has the ability to handle a large volume of data and analytical workloads. It is completely managed, easy to set up, and has a high performance even for large data sets. These features combined with its pricing have made it one of the preferred data warehouses among the modern data teams.

To learn more about Amazon Redshift, visit here.

Methods to Connect PostgreSQL to Redshift

Method 1: Manual ETL Process to Connect PostgreSQL to Redshift

This method requires you to manually custom code the ETL process that will connect PostgreSQL to Redshift. It will use various Redshift commands to build this ETL process.

Method 2: Using Hevo Data to Connect PostgreSQL to Redshift

Hevo Data provides a hassle-free solution and helps you directly transfer data from PostgreSQL to Redshift and numerous other Databases/Data Warehouses or destinations of your choice instantly without having to write any code. Hevo comes with a graphical interface that allows you to configure your PostgreSQL source and load data in real-time. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Hevo’s pre-built with 100+ data sources (including 30+ free data sources) will take full charge of the data transfer process, allowing you to focus on key business activities.

Get Started with Hevo for Free

Methods to Connect PostgreSQL to Redshift

The following 2 methods can be used to connect PostgreSQL to Redsift for data transfer process:

Method 1: Manual ETL Process to Connect PostgreSQL to Redshift

You can build a pipeline to load data fro PostgreSQL to Redshift using the following steps:

Each of these steps are elaborated along with code snippets in the sections below. Let’s dive in

Step 1: Build a Schema in the Target Database

The first step of this migrating PostgreSQL to Redshift is to keep your target database ready by building an Amazon Redshift compatible schema. 

Postgres’s table definition is stored in the information_schema.tables and column definition stored in information_schema.columns. Redshift’s column types differ slightly from Postgres and a few of them are not supported.

Here is a list of all Postgres data types that are not supported in Redshift:

  • Arrays
  • BIT, BIT VARYING
  • BYTEA
  • Composite Types
  • Date/Time Types
    • INTERVAL
    • TIME
  • Enumerated Types
  • Geometric Types
  • JSON
  • Network Address Types
  • Numeric Types
    • SERIAL, BIGSERIAL, SMALLSERIAL
    • MONEY
  • Object Identifier Types
  • Pseudo-Types
  • Range Types
  • Text Search Types
  • TXID_SNAPSHOT
  • UUID
  • XML

You will have to convert the above data types into Redshift compatible types first.

Redshift makes use of Sort key and Distribution key to achieve performance gain. It is very important to choose these keys wisely at the time of table creation. With Sort Keys, you can configure the column on which sorting needs to be applied upon storing in Redshift. Distribution key is used to distribute data across multiple nodes based on the values of a particular column.

You can read more about choosing Amazon Redshift sort keys and Amazon Redshift distribution keys here.

Let us take an example of migrating Customer table from PostgreSQL to Redshift. To create a table in Redshift, we can use below command:

CREATE TABLE CUSTOMERS (ID INT, NAME VARCHAR(20), ADDRESS VARCHAR(50), BIRTH_DATE TIMESTAMP)
DISTSTYLE KEY
DISTKEY (ID)
SORTKEY (ID, BIRTH_DATE);

Note that the maximum character length supported by Redshift is 65535 hence any field larger than that (e.g. text) will need to be truncated/ split across multiple columns.

Step 2: Extracting Data from PostgreSQL to S3 Buckets

The best way to load a large amount of data to Redshift table is to use a COPY command. Using COPY command, you can load data from various sources like Amazon S3, Amazon EMR, and Remote Host(SSH). The most commonly used source for COPY command is Amazon S3 as it offers the best performance by loading multiple data files in parallel.

You can use PostgreSQL copy statement to export a table data into a file using specific delimiters. You will need to login to the database via command line with the psql command. The following command connects to PostgreSQL Database and exports all data from customers table (an example) to a file customers_data.csv

CONNECT DATABASE_NAME;
COPY (SELECT (ID, NAME, ADDRESS, BIRTH_DATE) FROM CUSTOMERS) TO ‘CUSTOMERS_DATA.CSV WITH DELIMITER '|' CSV;

You can zip the data files for maximum performance since Redshift gives an option to directly import zipped files.

$ gzip customers_data.csv

Now the next step is to upload these files to Amazon S3 bucket. It can be done either manually from AWS Console or by using Amazon CLI. The AWS Command Line Interface (CLI) is a unified tool to manage your AWS services. With just one tool to download and configure, you can control multiple AWS services from the command line and automate them through scripts. To install AWS CLI on your machine you can refer to AWS documentation here.

Post setup of AWS CLI on your machine, you can use CP command to copy data from your local machine to S3 bucket. Below command uploads customer_data.csv.gzip into a folder “my_folder” which is located inside an S3 bucket named as “my_s3_bucket”.

$ aws s3 cp customer_data.csv.gzip s3://my_s3_bucket/my_folder/

Step 3: Loading Data from S3 to Redshift Temporary Tables

By now, your data files that are to be imported are ready in S3. The final step is to load this data into Redshift target table. COPY command inserts the data from a file to a table. An important thing to note is that if the table is not empty then it will simply append the new records without changing the existing ones. So, the best approach is to load the data from the file into a temporary table and then inserting into final target table using insert into command. With this approach, you can easily handle situations where you want to upsert records instead of appending them. The following steps are required:

  • Create a Temporary Table in Redshift: Below SQL will create a temporary table named “TEMP_CUSTOMERS” in Redshift. Temporary tables are valid for a session in which they are created and will be auto-dropped once the session is over. The following command creates a temporary customer table.
CREATE TEMPORARY TABLE TEMP_CUSTOMERS (ID INT, NAME VARCHAR(20), ADDRESS VARCHAR(50), BIRTH_DATE TIMESTAMP)
DISTSTYLE ALL
SORTKEY (ID, BIRTH_DATE);
  • Copy Data from S3 to a Temporary Table in Redshift
COPY TEMP_CUSTOMERS
FROM 's3://my_s3_bucket/my_folder/customers_data.csv.gz'
'AWS_ACCESS_KEY_ID=MY_ACCESS_KEY AWS_SECRET_ACCESS_KEY=MY_SECRET_KEY'
GZIP;
  • Insert Data from the Temporary Table to the Target Table: Now that you have data in Redshift temporary table, you can simply execute an insert query to load this into the final target table.
INSERT INTO CUSTOMERS
SELECT * FROM TEMP_CUSTOMERS;

Loading data into a temporary table first instead of directly inserting it into the target table helps in data validation. An important thing to note while using COPY command is to handle NULLS. You can specify which character in your data file should be mapped as NULL in Redshift table. If NULL values are stored as blank in your data file then you use EMPTYASNULL option in COPY command to convert blank values into NULL values.  

Any errors while loading the data can be checked in the STL_LOAD_ERRORS table in Redshift. The table gives you the row that was being loaded and the associated error like column length exceeded, inserting text to a number field, etc.

The Limitations of Using Custom ETL Scripts

These challenges have an impact on ensuring that you have consistent and accurate data available in your Redshift in near Real-Time.

  • The Custom ETL Script method works well only if you have to move data only once or in batches from PostgreSQL to Redshift.
  • The Custom ETL Script method also fails when you have to move data in near real-time from PostgreSQL to Redshift.
  • A more optimal way is to move incremental data between two syncs from Postgres to Redshift instead of full load. This method is called Change Data Capture method.
  • When you write custom SQL scripts to extract a subset of data often those scripts break as the source schema keeps changing or evolving.

Method 2: Using Hevo Data to Connect PostgreSQL to Redshift

Hevo Data Logo
Image Source

Hevo Data, a No-code Data Pipeline, helps you directly transfer data from Optimizely and 100+ other data sources to Data Warehouses such as Redshift, Databases, BI tools, or a destination of your choice in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

Hevo Data takes care of all your Data Preprocessing needs and lets you focus on key business activities and draw a much powerful insight on how to generate more leads, retain customers, and take your business to new heights of profitability. It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination. 

Sign up here for a 14-Day Free Trial!

Using Hevo Data, you can connect PostgreSQL to Redshift in the following 2 steps:

  • Authenticate Source: You must authenticate your PostgreSQL account as the data source for Hevo Data as shown in the below image. To learn more about this step, visit here.
Configuring PostgreSQL as Source
Image Source
  • Configure Destination: You must configure your Redshift account as the data destination for Hevo Data as shown in the image below. To learn more about this step, visit here.
Configuring Redshift as Destination
Image Source

The Advantages of Using Hevo Data:

  • No Data Loss – Hevo is a fault-tolerant architecture that ensures the data is moved from PostgreSQL to Redshift reliably without data loss. 
  • 100’s of Out of the Box Integrations – In addition to PostgreSQL, Hevo brings data from Databases, Cloud Applications, SDKs, and so on into Redshift. This ensures that you have a reliable partner to cater to your growing data needs.
  • Minimal Setup: The data load from PostgreSQL to Redshift can be set up on a point and click interface, without any developer assistance. 
  • Automatic Schema Detection and Mapping: Hevo scans the schema of incoming PostgreSQL data automatically. When there are any detected changes, it handles seamlessly by incorporating the change on Redshift.
  • Exceptional Support: Hevo provides live support over email and chat. This ensures that you always have a team that can solve your troubles promptly.

Conclusion

This article introduced PostgreSQL and AWS Redshift and provided a step-by-step description of the 2 methods you can use to connect PostgreSQL to Redshift. The first method which involves building the ETL process manually has many chllenges and limitations.

Visit our Website to Explore Hevo

To avoid these problems, you may try an Hevo Data Integration Platform, which takes care of all the issues described above. You can simply connect your source and destination and move data instantly without writing any code.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your understanding of connecting PostgreSql to Redshift in the comments section below!

No-code Data Pipeline for Redshift