PostgreSQL is an open source Object-relational Database System that uses and extends the SQL language combined with many features that safely store and scale the most complicated data workloads. It works as a traditional OLTP Database.
Amazon Redshift is a fully managed Database optimized for Data Warehousing. If your objective is to build a scalable OLAP system then you can choose Redshift. It is based on PostgreSQL 8.0.2. Amazon Redshift being a columnar Database supports scalable architecture and multi-node processing. As it is offered on the Amazon Cloud Computing platform, there is no initial setup effort required and users can get started in minutes by using simple API calls or the AWS Management Console.
Often for analytical or archival purposes, one would need to move data from PostgreSQL to Redshift. This article explains 2 simple methods that you need to follow to migrate your data from PostgreSQL to Redshift for building a Data Warehousing solution. Read along to learn more about these methods and decide which method suits you the best!
Table of Contents
- Introduction to PostgreSQL
- Introduction to Amazon Redshift
- Methods to Connect PostgreSQL to Redshift
- A PostgreSQL account.
- An Amazon Redshift account.
- Working knowledge of Redshift commands.
Introduction to PostgreSQL
PostgreSQL is a powerful open-source Object-Relational Database. It not only uses the SQL language but also adds many features to it, allowing you to store and expand very complex data workloads. It has a reputation for proven Architecture, Data Integrity, Reliability, and Scalability. Powerful features and commitment to a rich open source community are some aspects of this Database.
Due to its Robustness and Feature-Rich Query Layer, PostgreSQL is very popular in use cases that traditionally require strict table structure. It has been developed for more than 30 years and most of the major problems have been solved. This fact makes people confident in using it in a business environment
To learn more about PostgreSQL, visit here.
Introduction to Amazon Redshift
AWS Redshift is a Data Warehouse built using MPP (massively parallel processing) architecture. It is owned and maintained by Amazon Web Services (AWS). It has the ability to handle a large volume of data and analytical workloads. It is completely managed, easy to set up, and has a high performance even for large data sets. These features combined with its pricing have made it one of the preferred data warehouses among the modern data teams.
To learn more about Amazon Redshift, visit here.
Methods to Connect PostgreSQL to Redshift
This method requires you to manually custom code the ETL process that will connect PostgreSQL to Redshift. It will use various Redshift commands to build this ETL process.
Hevo Data, an Automated No-code Data Pipeline, provides a hassle-free solution and helps you transfer data from PostgreSQL to Redshift within minutes without having to write any code. Hevo comes with a graphical interface that allows you to configure your PostgreSQL source and load data in real-time to the Redshift destination. Hevo’s pre-built integrations with 100+ data sources (including 40+ free data sources) will take full charge of the Data Integration process, allowing you to focus on key business activities.
Methods to Connect PostgreSQL to Redshift
The following 2 methods can be used to connect PostgreSQL to Redsift for data transfer process:
- Method 1: Manual ETL Process to Connect PostgreSQL to Redshift
- Method 2: Using Hevo Data to Connect PostgreSQL to Redshift
Method 1: Manual ETL Process to Connect PostgreSQL to Redshift
You can build a pipeline to load data fro PostgreSQL to Redshift using the following steps:
- Step 1: Build a Compatible Schema on Redshift
- Step 2: Extracting Data from PostgreSQL to S3 Buckets
- Step 3: Load Data from S3 to Temporary Table on Redshift
Each of these steps are elaborated along with code snippets in the sections below. Let’s dive in
Step 1: Build a Schema in the Target Database
The first step of this migrating PostgreSQL to Redshift is to keep your target database ready by building an Amazon Redshift compatible schema.
Postgres’s table definition is stored in the information_schema.tables and column definition stored in information_schema.columns. Redshift’s column types differ slightly from Postgres and a few of them are not supported.
Here is a list of all Postgres data types that are not supported in Redshift:
- BIT, BIT VARYING
- Composite Types
- Date/Time Types
- Enumerated Types
- Geometric Types
- Network Address Types
- Numeric Types
- SERIAL, BIGSERIAL, SMALLSERIAL
- Object Identifier Types
- Range Types
- Text Search Types
You will have to convert the above data types into Redshift compatible types first.
Redshift makes use of Sort key and Distribution key to achieve performance gain. It is very important to choose these keys wisely at the time of table creation. With Sort Keys, you can configure the column on which sorting needs to be applied upon storing in Redshift. Distribution key is used to distribute data across multiple nodes based on the values of a particular column.
You can read more about choosing Amazon Redshift sort keys and Amazon Redshift distribution keys here.
Let us take an example of migrating Customer table from PostgreSQL to Redshift. To create a table in Redshift, we can use below command:
CREATE TABLE CUSTOMERS (ID INT, NAME VARCHAR(20), ADDRESS VARCHAR(50), BIRTH_DATE TIMESTAMP) DISTSTYLE KEY DISTKEY (ID) SORTKEY (ID, BIRTH_DATE);
Note that the maximum character length supported by Redshift is 65535 hence any field larger than that (e.g. text) will need to be truncated/ split across multiple columns.
Step 2: Extracting Data from PostgreSQL to S3 Buckets
The best way to load a large amount of data to Redshift table is to use a COPY command. Using COPY command, you can load data from various sources like Amazon S3, Amazon EMR, and Remote Host(SSH). The most commonly used source for COPY command is Amazon S3 as it offers the best performance by loading multiple data files in parallel.
You can use PostgreSQL copy statement to export a table data into a file using specific delimiters. You will need to login to the database via command line with the psql command. The following command connects to PostgreSQL Database and exports all data from customers table (an example) to a file customers_data.csv
CONNECT DATABASE_NAME; COPY (SELECT (ID, NAME, ADDRESS, BIRTH_DATE) FROM CUSTOMERS) TO ‘CUSTOMERS_DATA.CSV WITH DELIMITER '|' CSV;
You can zip the data files for maximum performance since Redshift gives an option to directly import zipped files.
$ gzip customers_data.csv
Now the next step is to upload these files to Amazon S3 bucket. It can be done either manually from AWS Console or by using Amazon CLI. The AWS Command Line Interface (CLI) is a unified tool to manage your AWS services. With just one tool to download and configure, you can control multiple AWS services from the command line and automate them through scripts. To install AWS CLI on your machine you can refer to AWS documentation here.
Post setup of AWS CLI on your machine, you can use CP command to copy data from your local machine to S3 bucket. Below command uploads customer_data.csv.gzip into a folder “my_folder” which is located inside an S3 bucket named as “my_s3_bucket”.
$ aws s3 cp customer_data.csv.gzip s3://my_s3_bucket/my_folder/
Step 3: Loading Data from S3 to Redshift Temporary Tables
By now, your data files that are to be imported are ready in S3. The final step is to load this data into Redshift target table. COPY command inserts the data from a file to a table. An important thing to note is that if the table is not empty then it will simply append the new records without changing the existing ones. So, the best approach is to load the data from the file into a temporary table and then inserting into final target table using insert into command. With this approach, you can easily handle situations where you want to upsert records instead of appending them. The following steps are required:
- Create a Temporary Table in Redshift: Below SQL will create a temporary table named “TEMP_CUSTOMERS” in Redshift. Temporary tables are valid for a session in which they are created and will be auto-dropped once the session is over. The following command creates a temporary customer table.
CREATE TEMPORARY TABLE TEMP_CUSTOMERS (ID INT, NAME VARCHAR(20), ADDRESS VARCHAR(50), BIRTH_DATE TIMESTAMP) DISTSTYLE ALL SORTKEY (ID, BIRTH_DATE);
- Copy Data from S3 to a Temporary Table in Redshift
COPY TEMP_CUSTOMERS FROM 's3://my_s3_bucket/my_folder/customers_data.csv.gz' 'AWS_ACCESS_KEY_ID=MY_ACCESS_KEY AWS_SECRET_ACCESS_KEY=MY_SECRET_KEY' GZIP;
- Insert Data from the Temporary Table to the Target Table: Now that you have data in Redshift temporary table, you can simply execute an insert query to load this into the final target table.
INSERT INTO CUSTOMERS SELECT * FROM TEMP_CUSTOMERS;
Loading data into a temporary table first instead of directly inserting it into the target table helps in data validation. An important thing to note while using COPY command is to handle NULLS. You can specify which character in your data file should be mapped as NULL in Redshift table. If NULL values are stored as blank in your data file then you use EMPTYASNULL option in COPY command to convert blank values into NULL values.
Any errors while loading the data can be checked in the STL_LOAD_ERRORS table in Redshift. The table gives you the row that was being loaded and the associated error like column length exceeded, inserting text to a number field, etc.
Limitations of Using Custom ETL Scripts
These challenges have an impact on ensuring that you have consistent and accurate data available in your Redshift in near Real-Time.
- The Custom ETL Script method works well only if you have to move data only once or in batches from PostgreSQL to Redshift.
- The Custom ETL Script method also fails when you have to move data in near real-time from PostgreSQL to Redshift.
- A more optimal way is to move incremental data between two syncs from Postgres to Redshift instead of full load. This method is called the Change Data Capture method.
- When you write custom SQL scripts to extract a subset of data often those scripts break as the source schema keeps changing or evolving.
Method 2: Using Hevo Data to Connect PostgreSQL to Redshift
Hevo Data, an Automated No-code Data Pipeline, helps you directly load and transfer data from PostgreSQL to Redshift in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of migrating data from PostgreSQL to Redshift and also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that your data is handled in a secure, consistent manner while transferring from PostgreSQL to Redshift for analyzing data through BI tools and training Machine Learning models.
Using Hevo Data, you can connect PostgreSQL to Redshift in the following 2 steps:
- Authenticate Source: You must authenticate your PostgreSQL account as the data source for Hevo Data as shown in the below image.
- Configure Destination: You must configure your Redshift account as the data destination for Hevo Data as shown in the image below.
With this, you have successfully set up PostgreSQL to Redshift Integration using Hevo Data.
Here are more reasons to love Hevo:
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Auto Schema Mapping: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Minimal learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
- Live monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
With continuous real-time data movement, Hevo helps you combine your data from PostgreSQL along with your other data sources and seamlessly load it into Redshift with a no-code, easy to setup interface. Try our 14-day full access free trial!Get Started with Hevo for Free
This article introduced PostgreSQL and AWS Redshift and provided a step-by-step description of the 2 methods you can use to connect PostgreSQL to Redshift. The first method which involves building the ETL process manually has many challenges and limitations. Whereas, if you require Real-Time data replication and looking for a fully automated real-time solution, then Hevo is the right choice for you.
With Hevo, you can achieve simple and efficient data migration from PostgreSQL to Redshift in minutes. Hevo can help you replicate data from PostgreSQL and 100+ data sources (including 40+ data sources) to Redshift or a destination of your choice and visualize it in a BI tool. This makes Hevo the right partner to be by your side as your business scales.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite firsthand.
Share your understanding of connecting PostgreSQL to Redshift in the comments section below!