How to Replicate Postgres to Snowflake: 4 Easy Steps
Before diving deep into PostgreSQL to Snowflake data migration steps, let me give you a brief introduction to Snowflake. As you might have heard, Snowflake is an Analytic Data Warehouse on the Cloud. It is offered as a true Software-as-a-Service (SaaS) – no hardware to configure, no software to install, no maintenance. Snowflake’s architecture is defined newly from scratch, not an extension of the existing Big Data framework like Hadoop.
Table of Contents
Snowflake has a hybrid of the traditional shared-disk database and modern shared-nothing database architectures. It uses a central repository for persisted data that is accessible from all compute nodes in the data warehouse and processes queries using MPP (Massively Parallel Processing) compute clusters where each node in the cluster stores a portion of the data set. Snowflake processes using “Virtual Warehouses” which is an MPP compute cluster composed of multiple compute nodes. All components of Snowflake’s service run in a public Cloud-like AWS Redshift.
Snowflake Data Warehouse is considered a cost-effective high performing analytical solution and is used by many organizations for critical workloads. In this post, we will discuss how to move real-time data from Postgres to Snowflake. So, read along and understand the steps to migrate data from Postgres to Snowflake.
Table Of Contents
- What is Postgres?
- What is Snowflake?
- Method 1: Use Hevo ETL to Move Data From Postgres to Snowflake With Ease
- Method 2: Write a Custom Code to Move Data from Postgres to Snowflake
- Updating Snowflake Table
What is Postgres?
Postgres is an open-source Relational Database Management System (RDBMS) developed at the University of California, Berkeley. It is widely known for reliability, feature robustness, and performance, and has been in use for over 20 years.
Postgres supports not only object-relational data but also supports complex structures and a wide variety of user-defined data types. This gives Postgres a definitive edge over other open-source SQL databases like MySQL, MariaDB, and Firebird.
Businesses rely on Postgres as their primary data storage/data warehouse for online, mobile, geospatial, and analytics applications. Postgres runs on all major operating systems, including Linux, UNIX (AIX, BSD, HP-UX, SGI IRIX, Mac OS X, Solaris, Tru64), and Windows.
What is Snowflake?
Snowflake is a fully-managed Cloud-based Data Warehouse that helps businesses modernize their analytics strategy. Snowflake can query both structured and unstructured data using standard SQL. It delivers results of user queries spanning Gigabytes and Petabytes of data in seconds.
Snowflake automatically harnesses thousands of CPU cores to quickly execute queries for you. You can even query streaming data from your web, mobile apps, or IoT devices in real-time.
Snowflake comes with a web-based UI, a command-line tool, and APIs with client libraries that makes interacting with Snowflake pretty simple. Snowflake is secure and meets the most secure regulatory standards such as HIPAA, FedRAMP, and PCI DSS. When you store your data in Snowflake, your data is encrypted in transit and at rest by default and it’s automatically replicated, restored, and backed up to ensure business continuity.
In this guide, we explore Postgres to Snowflake data migration, using which you can perform advanced analytics on a system without having to worry about the production setup. This Postgres to Snowflake guide lists two steps to achieve Postgres to Snowflake data migration. Here are those:
Methods to Replicate Data From Postgres to Snowflake
There are two popular methods to perform Postgres to Snowflake ETL.
Method 1: A ready-to-use Hevo, Official Snowflake ETL Partner (14 Days Free Trial) to move data from Postgres to Snowflake.
Hevo Data, a No-Code Automation Platform can set up your Data Pipelines in minutes and start migrating your data from Postgres to Snowflake Data Warehouse effortlessly. Hevo supports multiple Database Connectors like PostgreSQL, MySQL, MS SQL Server, and Azure SQL Database and doesn’t require any maintenance or engineering support from your team.
GET STARTED WITH HEVO FOR FREE[/hevoButton]
Method 2: Write a Custom Code to Move Data from Postgres to Snowflake.
The second method of migrating data from Postgres to Snowflake involves writing a custom code. This is a four-step process that involves extracting data from Postgres using the COPY TO command, optional data type conversions and formatting, staging of data files, and copying these staged files from Postgres to the Snowflake table.
Method 1: Use Hevo ETL to Move Data From Postgres to Snowflake With Ease
Using Hevo, official Snowflake ETL partner you can easily load data from Postgres to Snowflake with just 3 simple steps: Select your Source, Provide Credentials, and Load to Destination.
This does not need you to write any code and will provide you with an error-free, fully managed set up to move data in minutes.
Check out some of the cool features of Hevo:
- Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
- Real-time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
- 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
- Scalable Infrastructure: Hevo has in-built integrations for 100+ sources that can help you scale your data infrastructure as required.
- 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
Method 2: Write a Custom Code to Move Data from Postgres to Snowflake
As in the above-shown figure, the four steps to replicate Postgres to Snowflake using custom code (Method 2) are as follows:
1. Extract Data from Postgres
COPY TO command is the most popular and efficient method to extract data from the Postgres table to a file. We can also use the pg_dump utility for the first time for full data extraction. We will have a look at both methods.
a. Extract Data Using the COPY Command
As mentioned above, COPY TO is the command used to move data between Postgres tables and standard file-system files. It copies an entire table or the results of a SELECT query to a file:
COPY table or sql_query TO out_file_name WITH options.
COPY employees TO 'C:tmpemployees_db.csv' WITH DELIMITER ',' CSV HEADER; COPY (select * from contacts where age < 45) TO 'C:tmpyoung_contacts_db.csv' WITH DELIMITER ',' CSV HEADER;
Some frequently used options are:
- FORMAT: The format of the data to be written are text, CSV, or binary (default is text).
- ESCAPE: The character that should appear before a data character that matches the QUOTE value.
- NULL: Represents the string that is a null value. The default is N (backslash-N) in text and an unquoted empty string in CSV.
- ENCODING: Encoding of the output file. The default value is the current client encoding.
- HEADER: If it is set, on the output file, the first line contains the column names from the table.
- QUOTE: The quoting character to be used when data is quoted. The default is double-quote(“).
- DELIMITER: The character that separates columns within each line of the file.
Next, we can have a look at how the COPY command can be used to extract data from multiple tables using a PL/PgSQL procedure. Here, the table named tables_to_extract contains details of tables to be exported.
CREATE OR REPLACE FUNCTION table_to_csv(path TEXT) RETURNS void AS $ declare tables RECORD; statement TEXT; begin FOR tables IN SELECT (schema || '.' || table_name) AS table_with_schema FROM tables_to_extract LOOP statement := 'COPY ' || tables.table_with_schema || ' TO ''' || path || '/' || tables.table_with_schema || '.csv' ||''' DELIMITER '';'' CSV HEADER'; EXECUTE statement; END LOOP; return; end; $ LANGUAGE plpgsql; SELECT db_to_csv('/home/user/dir'/dump); -- This will create one csv file per table, in /home/user/dir/dump/
Sometimes you want to extract data incrementally. To do that, add more metadata like the timestamp of the last data extraction to the table tables_to_extract and use that information while creating the COPY command to extract data changed after that timestamp.
Consider you are using a column name last_pull_time corresponding to each table in the table tables_to_extract which stores the last successful data pull time. Each time data in the table which are modified after that timestamp has to be pulled. The body of the loop in procedure will change like this:
Here a dynamic SQL is created with a predicate comparing last_modified_time_stamp from the table to be extracted and last_pull_time from table list_of_tables.
begin FOR tables IN SELECT (schema || '.' || table_name) AS table_with_schema, last_pull_time AS lt FROM tables_to_extract LOOP statement := 'COPY (SELECT * FROM ' || tables.table_with_schema || ' WHERE last_modified_time_stamp > ' || last_pull_time ') TO ' '' || path || '/' || tables.table_with_schema || '.csv' ||''' DELIMITER '';'' CSV HEADER'; EXECUTE statement; END LOOP; return; End;
b. Extract Data Using the pg_dump
As mentioned above, pg_dump is the utility for backing up a Postgres database or tables. It can be used to extract data from the tables also.
pg_dump --column-inserts --data-only --table=<table> <database> > table_name.sql
Here output file table_name.sql will be in the form of INSERT statements like
INSERT INTO my_table (column1, column2, column3, ...) VALUES (value1, value2, value3, ...);
This output has to be converted into a CSV file with the help of a small script in your favorites like Bash or Python.
2. Data Type Conversion from Postgres to Snowflake
There will be domain-specific logic to be applied while transferring data. Apart from that following things are to be noted while migrating data to avoid surprises.
- Snowflake out-of-the-box supports a number of character sets including UTF-8. Check out the full list of encodings.
- Unlike many other cloud analytical solutions, Snowflake supports SQL constraints like UNIQUE, PRIMARY KEY, FOREIGN KEY, NOT NULL constraints.
- Snowflake by default has a rich set of data types. Below is the list of Snowflake data types and corresponding PostgreSQL types.
|PostgreSQL Data Type||Snowflake Data Type|
|DOUBLE PRECISION||DOUBLE, DOUBLE PRECISION|
|ENUM||No type for ENUM. Must use any type which can represent values in ENUM|
|TIME [WITHOUT TIME ZONE]||TIME|
|TIMESTAMP [WITHOUT TIME ZONE]||TIMESTAMP|
- Snowflake allows almost all of the date/time formats. The format can be explicitly specified while loading data to the table using the File Format Option which we will discuss in detail later. The complete list of supported date/time formats can be found.
3. Stage Data Files
Before inserting data from Postgres to Snowflake table it needs to be uploaded to a temporary location which is called staging. There are two types of stages – internal and external.
a. Internal Stage
Each user and table is automatically allocated an internal stage for data files. It is also possible to create named internal stages.
- The user named and accessed as ‘@~’.
- The name of the table stage will be the same as that of the table.
- The user or table stages can’t be altered or dropped.
- The user or table stages do not support setting file format options.
As mentioned above, Internal Named Stages can be created by the user using the respective SQL statements. It provides a lot of flexibility while loading data by giving options to you to assign file format and other options to named stages.
While running DDL and commands like load data, SnowSQL is quite a handy CLI client which can be used to run those commands and is available in Linux/Mac/Windows. Read more about the tool and options.
Below are some example commands to create a stage:
Create a names stage:
create or replace stage my_postgres_stage copy_options = (on_error='skip_file') file_format = (type = 'CSV' field_delimiter = '|' skip_header = 1);
PUT command is used to stage data files to an internal stage. The syntax of the command is as given below :
PUT file://path_to_file/filename internal_stage_name
Upload a file named cnt_data.csv in the /tmp/postgres_data/data/ directory to an internal stage named postgres_stage.
put file:////tmp/postgres_data/data/cnt_data.csv @postgres_stage;
There are many useful options that can be helpful to improve performance like set parallelism while uploading the file, automatic compression of data files, etc. More information about those options is listed here.
b. External Stage
Amazon S3 and Microsoft Azure are external staging locations currently supported by Snowflake. We can create an external stage with any of those locations and load data to a Snowflake table.
To create an external stage on S3, IAM credentials have to be given. If the data is encrypted, then encryption keys should also be given.
create or replace stage postgre_ext_stage url='s3://snowflake/data/load/files/' credentials=(aws_key_id='111a233b3c' aws_secret_key='abcd4kx5y6z'); encryption=(master_key = 'eSxX0jzYfIjkahsdkjamtnBKONDwOaO8=');
Data to the external stage can be uploaded using AWS or Azure web interfaces. For S3 you can upload using the AWS web console or any AWS SDK or third-party tools.
4. Copy Staged Files from Postgres to Snowflake Table
COPY INTO is the command used to load the contents of the staged file(s) from Postgres to the Snowflake table. To execute the command compute resources in the form of virtual warehouses are needed. You know more about it this command in the Snowflake ETL best practices.
To load from a named internal stage
COPY INTO postgres_table FROM @postgres_stage;
Loading from the external stage. Only one file is specified.
COPY INTO my_external_stage_table FROM @postgres_ext_stage/tutorials/dataloading/contacts_ext.csv;
You can even copy directly from an external location:
COPY INTO postgres_table FROM s3://mybucket/snow/data/files credentials = (aws_key_id='$AWS_ACCESS_KEY_ID' aws_secret_key='$AWS_SECRET_ACCESS_KEY') encryption = (master_key = 'eSxX0jzYfIdsdsdsamtnBKOSgPH5r4BDDwOaO8=') file_format = (format_name = csv_format);
Files can be specified using patterns.
COPY INTO pattern_table FROM @postgre_stage file_format = (type = 'CSV') pattern='.*/.*/.*[.]csv[.]gz';
Some common format options for CSV format supported in the COPY command:
- COMPRESSION: Compression algorithm used for the input data files.
- RECORD_DELIMITER: Records or lines separator characters in an input CSV file.
- FIELD_DELIMITER: Character separating fields in the input file.
- SKIP_HEADER: How many header lines are to be skipped.
- DATE_FORMAT: To specify the date format.
- TIME_FORMAT: To specify the time format.
Check out the full list of options. So, now you have finally loaded data from Postgres to Snowflake.
Update Snowflake Table
We have discussed how to extract data incrementally from PostgreSQL. Now we will look at how to migrate data from Postgres to Snowflake effectively.
As we discussed in the introduction, Snowflake is not based on any big data framework and does not have any limitations for row-level updates like in systems like Hive. It supports row-level updates making delta data migration much easier. Basic idea is to load incrementally extracted data into an intermediate table and modify records in the final table as per data in the intermediate table.
There are three popular methods to modify the final table once data is loaded into the intermediate table.
- Update the rows in the final table and insert new rows from the intermediate table which are not in the final table.
UPDATE final_target_table t SET t.value = s.value FROM intermed_delta_table in WHERE t.id = in.id;
INSERT INTO final_target_table (id, value) SELECT id, value FROM intermed_delta_table WHERE NOT id IN (SELECT id FROM final_target_table);
- Delete all records from the final table which are in the intermediate table. Then insert all rows from the intermediate table to the final table.
DELETE .final_target_table f WHERE f.id IN (SELECT id from intermed_delta_table);
INSERT final_target_table (id, value) SELECT id, value FROM intermed_table;
- MERGE statement: Inserts and updates can be done with a single MERGE statement and it can be used to apply changes in the intermediate table to the final table with one SQL statement.
MERGE into final_target_table t1 using intermed_delta_table t2 on t1.id = t2.id WHEN matched then update set value = t2.value WHEN not matched then INSERT (id, value) values (t2.id, t2.value);
High-performing Data Warehouse solutions like Snowflake are getting more adoption and are becoming an integral part of a modern analytics pipeline. Migrating data from various data sources to this kind of cloud-native solution i.e from Postgres to Snowflake, requires expertise in the cloud, data security, and many other things like metadata management.
If you are looking for an ETL tool that facilitates the automatic migration and transformation of data from Postgres to Snowflake, then Hevo is the right choice for you. Hevo is a No-code Data Pipeline. It supports pre-built integration from 100+ data sources at a reasonable price. With Hevo, you can perfect, modify and enrich your data conveniently.
visit our website to explore hevo[/hevoButton]
SIGN UP for a 14-day free trial and see the difference!
Have any further queries about PostgreSQL to Snowflake? Get in touch with us in the comments section below.