Often for analytical or archival purposes, one would need to move data from PostgreSQL to Redshift. This post covers the detailed steps you need to follow to migrate data from Postgres to Redshift for building a data warehousing solution.
Overview of PostgreSQL and Amazon Redshift
PostgreSQL is an open source object-relational database system that uses and extends the SQL language combined with many features that safely store and scale the most complicated data workloads. It works as a traditional OLTP database.
Amazon Redshift is a fully managed database optimized for data warehousing. If your objective is to build a scalable OLAP system then you can choose Redshift. It is based on PostgreSQL 8.0.2. Amazon Redshift being a columnar database supports scalable architecture and multi-node processing. As it is offered on Amazon cloud computing platform, there is no initial setup effort required and users can get started in minutes by using simple API calls or the AWS Management Console.
Methods to move data from PostgreSQL to Redshift
Method 1: A ready to use Hevo Data Integration Platform (7 Days Free Trial).
Method 2: Write Custom ETL Scripts to move data from PostgreSQL to Amazon Redshift.
In this article, we are going to describe Method 2 in detail along with its limitations and challenges. Towards the end of this article, we have also compared method 1 (Using Hevo Data Integration Platform) and Method 2 (Custom ETL Script).
PostgreSQL to Redshift Replication using Custom ETL Scripts
You can build a pipeline to load data fro PostgreSQL to Redshift using the following steps:
- Build a compatible schema on Redshift
- Extract data from PostgreSQL to S3 buckets
- Load data from S3 to Temporary table on Redshift
Each of these steps are elaborated along with code snippets in the sections below. Let’s dive in
Step 1: Build a Schema in the Target Database
The first step of this migrating PostgreSQL to Redshift is to keep your target database ready by building an Amazon Redshift compatible schema.
Postgres’s table definition is stored in the information_schema.tables and column definition stored in information_schema.columns. Redshift’s column types differ slightly from Postgres and a few of them are not supported.
Here is a list of all Postgres data types that are not supported in Redshift:
- BIT, BIT VARYING
- Composite Types
- Date/Time Types
- Enumerated Types
- Geometric Types
- Network Address Types
- Numeric Types
- SERIAL, BIGSERIAL, SMALLSERIAL
- Object Identifier Types
- Range Types
- Text Search Types
You will have to convert the above data types into Redshift compatible types first.
Redshift makes use of Sort key and Distribution key to achieve performance gain. It is very important to choose these keys wisely at the time of table creation. With Sort Keys, you can configure the column on which sorting needs to be applied upon storing in Redshift. Distribution key is used to distribute data across multiple nodes based on the values of a particular column.
Let us take an example of migrating Customer table from Postgres to Redshift. To create a table in Redshift, we can use below command:
CREATE TABLE CUSTOMERS (ID INT, NAME VARCHAR(20), ADDRESS VARCHAR(50), BIRTH_DATE TIMESTAMP) DISTSTYLE KEY DISTKEY (ID) SORTKEY (ID, BIRTH_DATE);
Note that the maximum character length supported by Redshift is 65535 hence any field larger than that (e.g. text) will need to be truncated/ split across multiple columns.
Step 2: Extracting data from PostgreSQL to S3 buckets
The best way to load a large amount of data to Redshift table is to use a COPY command. Using COPY command, you can load data from various sources like Amazon S3, Amazon EMR, and Remote Host(SSH). The most commonly used source for COPY command is Amazon S3 as it offers the best performance by loading multiple data files in parallel.
You can use Postgres copy statement to export a table data into a file using specific delimiters. You will need to login to the database via command line with the psql command. The following command connects to Postgres database and exports all data from customers table (an example) to a file customers_data.csv
CONNECT DATABASE_NAME; COPY (SELECT (ID, NAME, ADDRESS, BIRTH_DATE) FROM CUSTOMERS) TO ‘CUSTOMERS_DATA.CSV WITH DELIMITER '|' CSV;
You can zip the data files for maximum performance since Redshift gives an option to directly import zipped files.
$ gzip customers_data.csv
Now the next step is to upload these files to Amazon S3 bucket. It can be done either manually from AWS Console or by using Amazon CLI. The AWS Command Line Interface (CLI) is a unified tool to manage your AWS services. With just one tool to download and configure, you can control multiple AWS services from the command line and automate them through scripts. To install AWS CLI on your machine you can refer to AWS documentation.
Post setup of AWS CLI on your machine, you can use CP command to copy data from your local machine to S3 bucket. Below command uploads customer_data.csv.gzip into a folder “my_folder” which is located inside an S3 bucket named as “my_s3_bucket”.
$ aws s3 cp customer_data.csv.gzip s3://my_s3_bucket/my_folder/
Step 3: Loading Data from S3 to Redshift Temporary Tables
By now, your data files that are to be imported are ready in S3. The final step is to load this data into Redshift target table. COPY command inserts the data from a file to a table. An important thing to note is that if the table is not empty then it will simply append the new records without changing the existing ones. So, the best approach is to load the data from the file into a temporary table and then inserting into final target table using insert into command. With this approach, you can easily handle situations where you want to upsert records instead of appending them.
a. Create a temporary table in Redshift
Below SQL will create a temporary table named “TEMP_CUSTOMERS” in Redshift. Temporary tables are valid for a session in which they are created and will be auto-dropped once the session is over. The following command creates a temporary customer table.
CREATE TEMPORARY TABLE TEMP_CUSTOMERS (ID INT, NAME VARCHAR(20), ADDRESS VARCHAR(50), BIRTH_DATE TIMESTAMP) DISTSTYLE ALL SORTKEY (ID, BIRTH_DATE);
b. Copy data from S3 to a temporary table in Redshift
COPY TEMP_CUSTOMERS FROM 's3://my_s3_bucket/my_folder/customers_data.csv.gz' 'AWS_ACCESS_KEY_ID=MY_ACCESS_KEY AWS_SECRET_ACCESS_KEY=MY_SECRET_KEY' GZIP;
c. Insert data from the temporary table to the target table
Now that you have data in Redshift temporary table, you can simply execute an insert query to load this into the final target table.
INSERT INTO CUSTOMERS SELECT * FROM TEMP_CUSTOMERS;
Loading data into a temporary table first instead of directly inserting it into the target table helps in data validation. An important thing to note while using COPY command is to handle NULLS. You can specify which character in your data file should be mapped as NULL in Redshift table. If NULL values are stored as blank in your data file then you use EMPTYASNULL option in COPY command to convert blank values into NULL values.
Any errors while loading the data can be checked in the STL_LOAD_ERRORS table in Redshift. The table gives you the row that was being loaded and the associated error like column length exceeded, inserting text to a number field, etc.
The Limitations of PostgreSQL to Redshift Replication using Custom ETL Scripts:
- The Custom ETL Script method works well only if you have to move data only once or in batches from PostgreSQL to Redshift.
- The Custom ETL Script method also fails when you have to move data in near real-time from PostgreSQL to Redshift.
- A more optimal way is to move incremental data between two syncs from Postgres to Redshift instead of full load. This method is called Change Data Capture method.
- When you write custom SQL scripts to extract a subset of data often those scripts break as the source schema keeps changing or evolving.
These challenges have an impact on ensuring that you have consistent and accurate data available in your Redshift in near Real-Time.
To address the problems you may try an alternative method – Hevo Data Integration Platform, which takes care of all the issues described above. You can simply connect your source and destination and move data instantly without writing any code. You can explore the product features here.
Feel free to compare the two methods of replicating data from PostgreSQL to Amazon Redshift and select the one that meets your needs.
Hevo offers a 7 Day Free Trial for you to experience how easy it is to use a Data Integration Platform to move data from any source to destination.