PostgreSQL is an open source Object-relational Database System that uses and extends the SQL language combined with many features that safely store and scale the most complicated data workloads. It works as a traditional OLTP Database. Amazon Redshift is a fully managed Database optimized for Data Warehousing. If your objective is to migrate data from PostgreSQL to Redshift, you have come to the right place.
However, before we discuss the methods for this data migration, let’s talk a bit about Postgres and Redshift. If you want to build a scalable OLAP system then you can choose Redshift. It is based on PostgreSQL 8.0.2. Amazon Redshift being a columnar database supports scalable architecture and multi-node processing. As it is offered on the Amazon Cloud Computing platform, there is no initial setup effort required and users can get started in minutes by using simple API calls or the AWS Management Console.
Often for analytical or archival purposes, one would need to move data from PostgreSQL to Redshift. This article explains 2 simple methods in which you can do that. Read along to learn more about these methods and decide which method suits you the best!
Prerequisites
- A PostgreSQL account
- An Amazon Redshift account
- Working knowledge of Redshift commands
Introduction to PostgreSQL
Image Source
PostgreSQL is a powerful open-source Object-Relational Database. It not only uses the SQL language but also adds many features to it, allowing you to store and expand very complex data workloads. It has a reputation for proven Architecture, Data Integrity, Reliability, and Scalability. Powerful features and commitment to a rich open source community are some aspects of this Database.
Due to its Robustness and Feature-Rich Query Layer, PostgreSQL is very popular in use cases that traditionally require strict table structure. It has been developed for more than 30 years and most of the major problems have been solved. This fact makes people confident in using it in a business environment, and migrating data from it to different destinations. Take a gander at this article that talks about replicating Postgres data to Snowflake for more understanding.
Introduction to Amazon Redshift
Image Source
AWS Redshift is a Data Warehouse built using MPP (massively parallel processing) architecture. It is owned and maintained by Amazon Web Services (AWS). It has the ability to handle a large volume of data and analytical workloads. It is completely managed, easy to set up, and has a high performance even for large data sets. These features combined with its pricing have made it one of the preferred data warehouses among modern data teams.
Methods to Connect PostgreSQL to Redshift
The following 2 methods can be used to connect PostgreSQL to Redshift for data transfer process:
Method 1: Manual ETL Process to Connect PostgreSQL to Redshift
You can build a pipeline to load data from PostgreSQL to Redshift using the following steps:
Each of these steps are elaborated along with code snippets in the sections below. Let’s dive in
Step 1: Build a Schema in the Target Database
The first step of this migrating PostgreSQL to Redshift is to keep your target database ready by building an Amazon Redshift compatible schema.
Postgres’s table definition is stored in the information_schema.tables and column definition stored in information_schema.columns. Redshift’s column types differ slightly from Postgres and a few of them are not supported.
Here is a list of all Postgres data types that are not supported in Redshift:
- Arrays
- BIT, BIT VARYING
- BYTEA
- Composite Types
- Date/Time Types
- Enumerated Types
- Geometric Types
- JSON
- Network Address Types
- Numeric Types
- SERIAL, BIGSERIAL, SMALLSERIAL
- MONEY
- Object Identifier Types
- Pseudo-Types
- Range Types
- Text Search Types
- TXID_SNAPSHOT
- UUID
- XML
You will have to convert the above data types into Redshift compatible types first.
Redshift makes use of Sort key and Distribution key to achieve performance gain. It is very important to choose these keys wisely at the time of table creation. With Sort Keys, you can configure the column on which sorting needs to be applied upon storing in Redshift. Distribution key is used to distribute data across multiple nodes based on the values of a particular column.
You can read more about choosing Amazon Redshift sort keys and Amazon Redshift distribution keys here.
Let us take an example of migrating Customer table from PostgreSQL to Redshift. To create a table in Redshift, we can use below command:
CREATE TABLE CUSTOMERS (ID INT, NAME VARCHAR(20), ADDRESS VARCHAR(50), BIRTH_DATE TIMESTAMP)
DISTSTYLE KEY
DISTKEY (ID)
SORTKEY (ID, BIRTH_DATE);
Note that the maximum character length supported by Redshift is 65535 hence any field larger than that (e.g. text) will need to be truncated/ split across multiple columns.
Step 2: Extracting Data from PostgreSQL to S3 Buckets
The best way to load a large amount of data to Redshift table is to use a COPY command. Using COPY command, you can load data from various sources like Amazon S3, Amazon EMR, and Remote Host(SSH). The most commonly used source for COPY command is Amazon S3 as it offers the best performance by loading multiple data files in parallel.
You can use PostgreSQL copy statement to export a table data into a file using specific delimiters. You will need to login to the database via command line with the psql command. The following command connects to PostgreSQL Database and exports all data from customers table (an example) to a file customers_data.csv
CONNECT DATABASE_NAME;
COPY (SELECT (ID, NAME, ADDRESS, BIRTH_DATE) FROM CUSTOMERS) TO ‘CUSTOMERS_DATA.CSV WITH DELIMITER '|' CSV;
You can zip the data files for maximum performance since Redshift gives an option to directly import zipped files.
$ gzip customers_data.csv
Now the next step is to upload these files to Amazon S3 bucket. It can be done either manually from AWS Console or by using Amazon CLI. The AWS Command Line Interface (CLI) is a unified tool to manage your AWS services. With just one tool to download and configure, you can control multiple AWS services from the command line and automate them through scripts. To install AWS CLI on your machine you can refer to AWS documentation here.
Post setup of AWS CLI on your machine, you can use CP command to copy data from your local machine to S3 bucket. Below command uploads customer_data.csv.gzip into a folder “my_folder” which is located inside an S3 bucket named as “my_s3_bucket”.
$ aws s3 cp customer_data.csv.gzip s3://my_s3_bucket/my_folder/
Step 3: Loading Data from S3 to Redshift Temporary Tables
By now, your data files that are to be imported are ready in S3. The final step is to load this data into Redshift target table. COPY command inserts the data from a file to a table. An important thing to note is that if the table is not empty then it will simply append the new records without changing the existing ones. So, the best approach is to load the data from the file into a temporary table and then inserting into final target table using insert into command. With this approach, you can easily handle situations where you want to upsert records instead of appending them. The following steps are required:
- Create a Temporary Table in Redshift: Below SQL will create a temporary table named “TEMP_CUSTOMERS” in Redshift. Temporary tables are valid for a session in which they are created and will be auto-dropped once the session is over. The following command creates a temporary customer table.
CREATE TEMPORARY TABLE TEMP_CUSTOMERS (ID INT, NAME VARCHAR(20), ADDRESS VARCHAR(50), BIRTH_DATE TIMESTAMP)
DISTSTYLE ALL
SORTKEY (ID, BIRTH_DATE);
- Copy Data from S3 to a Temporary Table in Redshift
COPY TEMP_CUSTOMERS
FROM 's3://my_s3_bucket/my_folder/customers_data.csv.gz'
'AWS_ACCESS_KEY_ID=MY_ACCESS_KEY AWS_SECRET_ACCESS_KEY=MY_SECRET_KEY'
GZIP;
- Insert Data from the Temporary Table to the Target Table: Now that you have data in Redshift temporary table, you can simply execute an insert query to load this into the final target table.
INSERT INTO CUSTOMERS
SELECT * FROM TEMP_CUSTOMERS;
Loading data into a temporary table first instead of directly inserting it into the target table helps in data validation. An important thing to note while using COPY command is to handle NULLS. You can specify which character in your data file should be mapped as NULL in Redshift table. If NULL values are stored as blank in your data file then you use EMPTYASNULL option in COPY command to convert blank values into NULL values.
Any errors while loading the data can be checked in the STL_LOAD_ERRORS table in Redshift. The table gives you the row that was being loaded and the associated error like column length exceeded, inserting text to a number field, etc.
Limitations of Using Custom ETL Scripts
These challenges have an impact on ensuring that you have consistent and accurate data available in your Redshift in near Real-Time.
- The Custom ETL Script method works well only if you have to move data only once or in batches from PostgreSQL to Redshift.
- The Custom ETL Script method also fails when you have to move data in near real-time from PostgreSQL to Redshift.
- A more optimal way is to move incremental data between two syncs from Postgres to Redshift instead of full load. This method is called the Change Data Capture method.
- When you write custom SQL scripts to extract a subset of data often those scripts break as the source schema keeps changing or evolving.
Method 2: Using Hevo Data to Connect PostgreSQL to Redshift
Image Source
Hevo Data, an Automated No-code Data Pipeline, helps you directly load and transfer data from PostgreSQL to Redshift in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of migrating data from PostgreSQL to Redshift and also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that your data is handled in a secure, consistent manner while transferring from PostgreSQL to Redshift for analyzing data through BI tools and training Machine Learning models.
Using Hevo Data, you can connect PostgreSQL to Redshift in the following 2 steps:
- Authenticate Source: You must authenticate your PostgreSQL account as the data source for Hevo Data as shown in the below image.
Image Source
- Configure Destination: You must configure your Redshift account as the data destination for Hevo Data as shown in the image below.
Image Source
With this, you have successfully set up PostgreSQL to Redshift Integration using Hevo Data.
- Blazing Fast Setup: Hevo comes with a no-code and highly intuitive interface that allows you to create a data pipeline in minutes with only a few clicks. Moreover, you don’t need any extensive training to use Hevo; even non-data professionals can set up their own data pipelines seamlessly.
- Built To Scale: As the number of your data sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency. This ensures the long-term viability of your business.
- Ample Connectors: Hevo’s fault-tolerant data pipeline offers you a secure option to unify data from 150+ data sources and store it in a data warehouse of your choice.
- Best Security: Hevo handles your data in a secure and consistent way to ensure zero data loss. The database and API credentials you provide are encrypted with keys specific to you and can’t be accessed by anyone apart from the application itself.
- Smooth Schema Mapping: Hevo takes away the tedious task of schema management and automatically detects the schema of incoming data to map it to the destination schema.
- Live Support: The Hevo team is available round the clock to extend exceptional customer support through chat, email, and support calls.
- Live Monitoring: Hevo allows you to monitor and view samples of incoming data in real-time as it loads from your source into your destination.
- Analysis Ready Data: Hevo houses an in-built functionality for data formatting and transformation that can automatically prepare your data for analysis in minutes.
With continuous real-time data movement, Hevo helps you combine your data from PostgreSQL along with your other data sources and seamlessly load it into Redshift with a no-code, easy to setup interface. Try our 14-day full access free trial or check out our transparent pricing to make an informed decision!
Get Started with Hevo for Free
Conclusion
This article introduced PostgreSQL and AWS Redshift and provided a step-by-step description of the 2 methods you can use to migrate PostgreSQL to Redshift. The first method which involves building the ETL process manually has many challenges and limitations. Whereas, if you require Real-Time data replication and looking for a fully automated real-time solution, then Hevo is the right choice for you.
With Hevo, you can achieve simple and efficient data migration from PostgreSQL to Redshift in minutes. Hevo can help you replicate data from PostgreSQL and 150+ data sources (including 50+ data sources) to Redshift or a destination of your choice and visualize it in a BI tool. This makes Hevo the right partner to be by your side as your business scales.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite firsthand.
Share your understanding of PostgreSQL to Redshift migration in the comments section below!