Before diving deep into PostgreSQL to Snowflake data migration steps, let me walk give you a brief introduction about Snowflake. As you might have heard, Snowflake is an analytic data warehouse on the cloud. It is offered as a true Software-as-a-Service (SaaS) – no hardware to configure, no software to install, no maintenance. Snowflake’s architecture is defined newly from scratch, not an extension of the existing Big Data framework like Hadoop.
Snowflake has a hybrid of the traditional shared-disk database and modern shared-nothing database architectures. It uses a central repository for persisted data that is accessible from all compute nodes in the data warehouse and processes queries using MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the data set. Snowflake processes using “virtual warehouses” which is an MPP compute cluster composed of multiple compute nodes. All components of Snowflake’s service run in a public cloud-like AWS.
Snowflake data warehouse is considered as a cost-effective high performing analytical solution and is used by many organizations for critical workloads. In this post, we will discuss how to move real-time data from PostgreSQL to Snowflake.
Methods to replicate PostgreSQL to Snowflake
There are two popular methods to perform PostgreSQL to Snowflake ETL.
Method 1: A ready to use Hevo, Official Snowflake ETL Partner (7 Days Free Trial).
Method 2: Write a Custom Code to move data from PostgreSQL to Snowflake.
As in the above-shown figure, steps to replicate PostgreSQL to Snowflake using Custom code (Method 2) are as follows:
- Extract data from PostgreSQL using the COPY TO command
- Optional data types conversion and formatting
- Stage Data Files
- Copy staged files to Snowflake table
1. Extract data From PostgreSQL
COPY TO command is the most popular and efficient method to extract data from the PostgreSQL table to a file. We can also use pg_dump utility for the first time full data extraction. We will have a look at both methods.
Extract data using the COPY command
As mentioned above, COPY TO is the command used to move data between PostgreSQL tables and standard file-system files. It copies an entire table or the results of a SELECT query to a file:
COPY table or sql_query TO out_file_name WITH options.
COPY employees TO 'C:\tmp\employees_db.csv' WITH DELIMITER ',' CSV HEADER; COPY (select * from contacts where age < 45) TO 'C:\tmp\young_contacts_db.csv' WITH DELIMITER ',' CSV HEADER;
Some frequently used options are:
- FORMAT – Format of the data to be written are text, CSV or binary (default is text).
- ESCAPE – The character that should appear before a data character that matches the QUOTE value.
- NULL – Represents the string that is a null value. The default is \N (backslash-N) in text and an unquoted empty string in CSV.
- ENCODING – Encoding of the output file. The default value is the current client encoding.
- HEADER – If it set, on the output file, the first line contains the column names from the table.
- QUOTE – The quoting character to be used when a data is quoted. The default is double-quote(“).
- DELIMITER – The character that separates columns within each line of the file.
Next, we can have a look at how COPY command can be used to extract data from multiple tables using a PL/PgSQL procedure. Here, the table named tables_to_extract contains details of tables to be exported.
CREATE OR REPLACE FUNCTION table_to_csv(path TEXT) RETURNS void AS $$ declare tables RECORD; statement TEXT; begin FOR tables IN SELECT (schema || '.' || table_name) AS table_with_schema FROM tables_to_extract LOOP statement := 'COPY ' || tables.table_with_schema || ' TO ''' || path || '/' || tables.table_with_schema || '.csv' ||''' DELIMITER '';'' CSV HEADER'; EXECUTE statement; END LOOP; return; end; $$ LANGUAGE plpgsql; SELECT db_to_csv('/home/user/dir'/dump); -- This will create one csv file per table, in /home/user/dir/dump/
Sometimes you want to extract data incrementally. To do that, add more metadata like the timestamp of last data extraction to the table tables_to_extract and use that information while creating copy command to extract data changed after that timestamp.
Consider you are using a column name last_pull_time corresponding to each table in the table tables_to_extract which stores last successful data pull time. Each time data in the table which are modified after that timestamp has to be pulled. The body of the loop in procedure will change like this:
Here a dynamic SQL is created with predicate comparing last_modified_time_stamp from the table to be extracted and last_pull_time from table list_of_tables.
begin FOR tables IN SELECT (schema || '.' || table_name) AS table_with_schema, last_pull_time AS lt FROM tables_to_extract LOOP statement := 'COPY (SELECT * FROM ' || tables.table_with_schema || ' WHERE last_modified_time_stamp > ' || last_pull_time ') TO ' '' || path || '/' || tables.table_with_schema || '.csv' ||''' DELIMITER '';'' CSV HEADER'; EXECUTE statement; END LOOP; return; End;
Extract data using the pg_dump
As mentioned above, pg_dump is the utility for backing up a PostgreSQL database or tables. It can be used to extract data from the tables also.
pg_dump --column-inserts --data-only --table=<table> <database> > table_name.sql
Here output file table_name.sql will be in the form of INSERT statements like
INSERT INTO my_table (column1, column2, column3, ...) VALUES (value1, value2, value3, ...);
This output has to be converted into a CSV file with the help of a small script in your favorites like Bash or Python.
2. Data type conversion and formatting
There will be domain specific logic to be applied while transferring data. Apart from that following things to be noted while migrating data to avoid surprises.
- Snowflake out of box support number of character sets including UTF-8. Check out the full list of encodings.
- Unlike many other cloud analytical solutions, Snowflake supports SQL constraints like UNIQUE, PRIMARY KEY, FOREIGN KEY, NOT NULL constraints.
- Snowflake by default has a rich set of data types. Below is the list of Snowflake data types and corresponding PostgreSQL types.
|PostgreSQL Data Type||Snowflake Data Type|
|DOUBLE PRECISION||DOUBLE, DOUBLE PRECISION|
|ENUM||No type for ENUM. Must use any type which can represent values in ENUM|
|TIME [WITHOUT TIME ZONE]||TIME|
|TIMESTAMP [WITHOUT TIME ZONE]||TIMESTAMP|
- Snowflake allows almost all of the date/time format. Format it can be explicitly specified while loading data to the table using File Format Option which we will discuss this in detail later. The complete list of supported date/time formats can be found.
3. Stage Data Files
Before inserting the data into a Snowflake table it needs to be uploaded to a temporary location which is called staging. There are two types of stages – internal and external.
Each user and table is automatically allocated an internal stage for data files. It is also possible to create named internal stages.
- The user named and accessed as ‘@~’.
- The name of a table stage will be the same as that of the table.
- The user or table stages can’t be altered or dropped.
- The user or table stages do not support setting file format options.
As mentioned above, Internal Named Stages can be created by the user using the respective SQL statements. It provides a lot of flexibility while loading data by giving options to you to assign file format and other options to named stages.
While running DDL and commands like load data, SnowSQL is quite handy CLI client which can be used to run those commands and is available in Linux/Mac/Windows. Read more about the tool and options.
Below are some example commands to create stage:
Create a names stage:
create or replace stage my_postgres_stage copy_options = (on_error='skip_file') file_format = (type = 'CSV' field_delimiter = '|' skip_header = 1);
PUT command is used to stage data files to an internal stage. The syntax of the command is as given below :
PUT file://path_to_file/filename internal_stage_name
Upload a file named cnt_data.csv in the /tmp/postgres_data/data/ directory to an internal stage named postgres_stage.
put file:////tmp/postgres_data/data/cnt_data.csv @postgres_stage;
There are many useful options which can be helpful to improve performance like set parallelism while uploading the file, automatic compression of data files etc. More information about those options is listed here.
Amazon S3 and Microsoft Azure are external staging location currently supported by Snowflake. We can create an external stage with any of those locations and load data to a Snowflake table.
To create an external stage on S3, IAM credentials have to be given. If the data is encrypted, then encryption keys should also be given.
create or replace stage postgre_ext_stage url='s3://snowflake/data/load/files/' credentials=(aws_key_id='111a233b3c' aws_secret_key='abcd4kx5y6z'); encryption=(master_key = 'eSxX0jzYfIjkahsdkjamtnBKONDwOaO8=');
Data to the external stage can be uploaded using AWS or Azure web interfaces. For S3 you can upload using AWS web console or any AWS SDK or third-party tools.
4. Copy staged files from PostgreSQL to Snowflake table.
COPY INTO is the command used to load the contents of the staged file(s) into a Snowflake table. To execute the command compute resources in the form of virtual warehouses are needed. You know more about it this command in the Snowflake ETL best practices.
To load from a named internal stage
COPY INTO postgres_table FROM @postgres_stage;
Loading from the external stage. Only one file is specified.
COPY INTO my_external_stage_table FROM @postgres_ext_stage/tutorials/dataloading/contacts_ext.csv;
You can even copy directly from an external location:
COPY INTO postgres_table FROM s3://mybucket/snow/data/files credentials = (aws_key_id='$AWS_ACCESS_KEY_ID' aws_secret_key='$AWS_SECRET_ACCESS_KEY') encryption = (master_key = 'eSxX0jzYfIdsdsdsamtnBKOSgPH5r4BDDwOaO8=') file_format = (format_name = csv_format);
Files can be specified using patterns.
COPY INTO pattern_table FROM @postgre_stage file_format = (type = 'CSV') pattern='.*/.*/.*[.]csv[.]gz';
Some common format options for CSV format supported in the COPY command:
- COMPRESSION – Compression algorithm used for the input data files.
- RECORD_DELIMITER – Records or lines separator character in an input CSV file.
- FIELD_DELIMITER – Character separating fields in the input file.
- SKIP_HEADER – How many header lines to be skipped.
- DATE_FORMAT – To specify the date format.
- TIME_FORMAT – To specify the time format.
Check out the full list of options.
Update Snowflake Table
We have discussed how to extract data incrementally from PostgreSQL. Now we will look at how to migrate PostgreSQL data into Snowflake effectively.
As we discussed in the introduction, Snowflake is not based on any big data framework and do not have any limitations for row-level updates like in systems like Hive. It supports row-level updates makes delta data migration much easier. Basic idea is to load incrementally extracted data into an intermediate table and modify records in final table as per data in the intermediate table.
There are three popular methods to modify the final table once data is loaded into the intermediate table.
- Update the rows in the final table and insert new rows from the intermediate table which are not in the final table.
UPDATE final_target_table t SET t.value = s.value FROM intermed_delta_table in WHERE t.id = in.id;
INSERT INTO final_target_table (id, value) SELECT id, value FROM intermed_delta_table WHERE NOT id IN (SELECT id FROM final_target_table);
- Delete all records from the final table which are in the intermediate table. Then insert all rows from the intermediate table to the final table.
DELETE .final_target_table f WHERE f.id IN (SELECT id from intermed_delta_table);
INSERT final_target_table (id, value) SELECT id, value FROM intermed_table;
- MERGE statement – Inserts and updates can be done with a single MERGE statement and it can be used to apply changes in the intermediate table to the final table with one SQL statement.
MERGE into final_target_table t1 using intermed_delta_table t2 on t1.id = t2.id WHEN matched then update set value = t2.value WHEN not matched then INSERT (id, value) values (t2.id, t2.value);
Easier Way to move data from PostgreSQL to Snowflake
Using Hevo, official Snowflake ETL partner you can easily load data from PostgreSQL to Snowflake with just 3 simple steps. This does not need you to write any code and will provide you with an error-free, fully managed set up to move data in minutes.
- Connect to your PostgreSQL database.
- Select the replication mode: (a) load selected PostgreSQL tables (b) load data via Custom Query
- Configure the Snowflake destination where it should be copied.
High performing Data Warehouse solutions like Snowflake are getting more adoption and are becoming an integral part of a modern analytics pipeline. Migrating data from various data sources to this kind of cloud-native solution requires expertise in the cloud, data security and many other things like metadata management. There are data pipeline or ETL platforms with in-built intelligence to take care of all aspects of data movement including security which might be a better choice most of the time – so that you can concentrate on your business. Sign up for a free trial on Hevo to create your safe and fast PostgreSQL to Snowflake data pipeline.