Often businesses have a different Database to store transactions (Eg: Amazon Aurora) and another Data Warehouse (Eg. Snowflake) for the company’s Analytical needs. There are 2 prime reasons to move data from your transactional Database to a Warehouse (Eg: Aurora to Snowflake).
Firstly, the transaction Database is optimized for fast writes and responses. Running Analytics queries on large data sets with many aggregations and Joins will slow down the Database. This might eventually take a toll on the customer experience. Secondly, Data Warehouses are built to handle scaling data sets and Analytical queries. Moreover, they can host the data from multiple data sources and aid in deeper analysis.
This post will introduce you to Aurora and Snowflake. It will also highlight the steps to move data from Aurora to Snowflake. In addition, you will explore some of the limitations associated with this method. You will be introduced to an easier alternative to solve these challenges. So, read along to gain insights and understand how to migrate data from Aurora to Snowflake.
Table of Contents
Understanding Aurora and Snowflake
AWS RDS (Relational Database) is the initial Relation Database service from AWS which supports most of the open-source and proprietary databases. Open-source offerings of RDS like MySQL and PostgreSQL are much cost-effective compared to enterprise Database solutions like Oracle. But most of the time open-source solutions require a lot of performance tuning to get par with enterprise RDBMS in performance and other aspects like concurrent connections.
AWS introduced a new Relational Database service called Aurora which is compatible with MySQL and PostgreSQL to overcome the much-known weakness of those databases costing much lesser than enterprise Databases. No wonder many organizations are moving to Aurora as their primary transaction Database system.
On the other end, Snowflake might be the best cost-effective and fast Data Warehousing solution. It has dynamically scaling compute resources and storage is completely separated and billed. Snowflake can be run on different Cloud vendors including AWS. So data movement from Aurora to Snowflake can also be done with less cost. Read about Snowflake’s features here.
This post will discuss Method 1 in detail to migrate data from Aurora to Snowflake. The blog will also highlight the limitations of this approach and the workarounds to solve them.
Move Data from Aurora to Snowflake using ETL Scripts
The steps to replicate data from Amazon Aurora to Snowflake are as follows:
1. Extract Data from Aurora Cluster to S3
SELECT INTO OUTFILE S3 statement can be used to query data from an Aurora MySQL cluster and save the result to S3. In this method, data reaches the client-side in a fast and efficient manner. To save data to S3 from an Aurora cluster proper permissions need to be set. For that –
- Create a proper IAM policy to access S3 objects – Refer to AWS documentation here.
- Create a new IAM role, and attach the IAM policy you created in the above step.
- Set aurora_select_into_s3_role or aws_default_s3_role cluster parameter to the ARN of the new IAM role.
- Associate the IAM role that you created with the Aurora cluster.
- Configure the Aurora cluster to allow outbound connections to S3 – Read more on this here.
Other important points to be noted while exporting data to S3:
- User Privilege – The user that issues the SELECT INTO OUTFILE S3 should have the privilege to do so.
To grant access –
GRANT SELECT INTO S3 ON *.* TO 'user'@'domain'.
- Output Files – The output is stored as delimited text files. As of now compressed or encrypted files are not supported.
- Overwrite Existing File – Set option OVERWRITE ON to delete if a file with exact name exists in S3.
- The default file size is 6 GB. If the data selected by the statement is lesser then a single file is created. Otherwise, multiple files are created. No rows will be split across file boundaries.
- If the data volume to be exported is larger than 25 GB, it is recommended to run multiple statements to export data. Each statement for a different portion of data.
- No metadata like table schema will be uploaded to S3
- As of now, there is no direct way to monitor the progress of data export. One simple method is set to manifest option on and the manifest file will be the last file created. Examples:
- The below statement writes to S3 of located in a different region. Each field is terminated by a comma and each row is terminated by ‘n’.
SELECT * FROM students INTO OUTFILE S3 's3-us-west-2://aurora-out/sample_students_data'
FIELDS TERMINATED BY ','
LINES TERMINATED BY 'n';
- Below is another example that writes to S3 of located in the same region. A manifest file will also be created.
SELECT * FROM students INTO OUTFILE S3 's3://aurora-out/sample_students_data'
FIELDS TERMINATED BY ','
LINES TERMINATED BY 'n'
2. Convert Data Types and Format them
There might be data transformations corresponding to business logic or organizational standards to be applied while transferring data from Aurora to Snowflake. Apart from those high-level mappings, some basic things to be considered generally are listed below:
- All popular character sets including UTF-8, UTF-16 are supported by Snowflake. The full list can be found here.
- Many Cloud-based and open source Big Data systems compromise on standard Relational Database constraints like Primary Key. But, note that Snowflake supports all SQL constraints like UNIQUE, PRIMARY KEY, FOREIGN KEY, NOT NULL constraints. This might be helpful when you load data.
- Data types support in Snowflake is fairly rich including nested data structures like an array. Below is the list of Snowflake data types and corresponding MySQL Aurora types.
|Aurora(MySQL) Data Type||Snowflake Data Type|
|FLOAT||FLOAT, FLOAT4, FLOAT8|
|DOUBLE||DOUBLE, DOUBLE PRECISION, REAL|
|ENUM||No type for ENUM.Must use any type which can represent values in ENUM|
|SET||No type for SET.Must use any type which can represent values in SET|
- Snowflake is really flexible with the date or time format. If a custom format is used in your file that can be explicitly specified using the File Format Option while loading data to the table. The complete list of date and time formats can be found here.
3. Stage Data Files to the Snowflake Staging Area
Snowflake requires the data to be uploaded to a temporary location before loading to the table. This temporary location is an S3 location that Snowflake has access to. This process is called staging. The snowflake stage can be either internal or external.
(A) Internal Stage
In Snowflake, each user and table is automatically assigned to an internal stage for data files. It is also possible internal stages explicitly and can be named.
- The stage assigned to the user is named as ‘@~’.
- The stage assigned to a table will have the name of the table.
- The default stages assigned to a user or table can’t be altered or dropped.
- The default stages assigned to a user or table do not support setting file format options.
As mentioned above, internal stages can also be created explicitly by the user using SQL statements. While creating stages explicitly like this, many data loading options can be assigned to those stages like file format, date format, etc.
While interacting with Snowflake for data loading or creating tables, SnowSQL is a very handy CLI client available in Linux/Mac/Windows which can be used to run Snowflake commands. Read more about the tool and options here.
Below are some example commands to create a stage:
- Create a named internal stage as shown below:
my_aurora_stage and assign some default options:
create or replace stage my_aurora_stage
copy_options = (on_error='skip_file')
file_format = (type = 'CSV' field_delimiter = '|' skip_header = 1);
- PUT is the command used to stage files to an internal Snowflake stage. The syntax of the PUT command is :
PUT file://path_to_file/filename internal_stage_name
Upload a file named students_data.csv in the /tmp/aurora_data/data/ directory to an internal stage named aurora_stage.
put file:////tmp/aurora_data/data/students_data.csv @aurora_stage;
Snowflake provides many options which can be used to improve the performance of data load like the number of parallelisms while uploading the file, automatic compression, etc. More information and the complete list of options are listed here.
(B) External Stage
Just like the internal stage Snowflake supports Amazon S3 and Microsoft Azure as an external staging location. If data is already uploaded to an external stage that can be accessed from Snowflake, that data can be loaded directly to the Snowflake table. No need to move the data to an internal stage.
To create an external stage on S3, IAM credentials with proper access permissions need to be provided. In case the data is encrypted, encryption keys should be provided.
create or replace stage aurora_ext_stage url='s3://snowflake_aurora/data/load/files/'
encryption=(master_key = 'eSxX0jzsdsdYfIjkahsdkjamNNNaaaDwOaO8=');
Data can be uploaded to the external stage with respective Cloud services. Data from Amazon Aurora will be exported to S3 and that location itself can be used as an external staging location which helps to minimize data movement.
4. Import Staged Files to Snowflake Table
Now data is present in an external or internal stage and has to be loaded to a Snowflake table. The command used to do this is COPY INTO. To execute the COPY INTO command compute resources in the form of Snowflake virtual warehouses are required and will be billed as per consumption.
- To load from a named internal stage:
copy into aurora_table
- To load data from the external stage. Only a single file is specified.
copy into my_external_stage_table
- You can even copy directly from an external location:
copy into aurora_table
encryption=(master_key = 'eSxX009jhh76jkIuLPH5r4BD09wOaO8=')
file_format = (format_name = csv_format);
- Files can be specified using patterns:
copy into aurora_pattern_table
file_format = (type = 'TSV')
Some commonly used options for CSV file loading using the COPY command
- COMPRESSION to specify compression algorithm used for the files
- RECORD_DELIMITER to indicate lines separator character
- FIELD_DELIMITER is the character separating fields in the file
- SKIP_HEADER is the number of header lines skipped
- DATE_FORMAT is the date format specifier
- TIME_FORMAT is the time format specifier
There are many other options. For the full list click here.
5. Update Snowflake Table
So far the blog talks about how to extract data from Aurora and simply insert it into a Snowflake table. Next, let’s look deeper into how to handle incremental data upload to the Snowflake table.
Snowflake’s architecture is unique. It is not based on any current/existing big data framework. Snowflake does not have any limitations for row-level updates. This makes delta data uploading to a Snowflake table much easier compared to systems like Hive. The way forward is to load incrementally extracted data to an intermediate table. Next, as per the data in the intermediate table, modify the records in the final table.
3 common methods that are used to modify the final table once data is loaded into a landing table ( intermediate table) are mentioned below.
1. Update the rows in the target table. Next, insert new rows from the intermediate or landing table which are not in the final table.
UPDATE aurora_target_table t SET t.value = s.value
FROM landing_delta_table in WHERE t.id = in.id;
INSERT INTO auroa_target_table (id, value)
SELECT id, value FROM landing_delta_table WHERE NOT id IN (SELECT id FROM aurora_target_table);
2. Delete all records from the target table which are in the landing table. Then insert all rows from the landing table to the final table.
DELETE .aurora_target_table f
WHERE f.id IN (SELECT id from landing_table);
INSERT aurora_target_table (id, value)
SELECT id, value FROM landing_table;
3. MERGE statement – Inserts and updates combined in a single MERGE statement and it is used to apply changes in the landing table to the target table with one SQL statement.
MERGE into aurora_target_table t1 using landing_delta_table t2 on t1.id = t2.id
WHEN matched then update set value = t2.value
WHEN not matched then INSERT (id, value) values (t2.id, t2.value);
Limitations of Writing Custom ETL Code to Move Data from Aurora to Snowflake
While the approach may look very straightforward to migrate data from Aurora to Snowflake, it does come with limitations. Some of these are listed below:
- You would have to invest precious engineering resources to hand-code the pipeline. This will increase the time for the data to be available in Snowflake.
- You will have to invest in engineering resources to constantly monitor and maintain the infrastructure. Code Breaks, Schema Changes at the source, Destination Unavailability – these issues will crop up more often than you would account for while starting the ETL project.
- The above approach fails if you need data to be streamed in real-time from Aurora to Snowflake. You would need to add additional steps, set up cron jobs to achieve this.
So, to overcome these limitations and to load your data seamlessly from Amazon Aurora to Snowflake you can use a third-party tool like Hevo.
On the other hand, a Data Pipeline Platform such as Hevo, an official Snowflake ETL partner, can help you bring data from Aurora to Snowflake in no time. Zero Code, Zero Setup Time, Zero Data Loss. Here are the simple steps to load data from Aurora to Snowflake using Hevo:
SIGN UP HERE FOR A 14-DAY FREE TRIAL!
- Authenticate and Connect to your Aurora DB.
- Select the replication mode: (a) Full Dump and Load (b) Incremental load for append-only data (c) Change Data Capture
- Configure the Snowflake Data Warehouse for data load.
For a next-generation digital organization, there should be a seamless data movement between Transactional and Analytical systems. Using an intuitive and reliable platform like Hevo to migrate your data from Aurora to Snowflake ensures that accurate and consistent data is available in Snowflake in real-time.
In this article, you gained a basic understanding of AWS Aurora and Snowflake. Moreover, you understood the steps to migrate your data from Aurora to Snowflake using Custom ETL scripts. In addition, you explored the limitations of this method. Hence, you were introduced to an easier alternative, Hevo to move your data from Amazon Aurora to Snowflake seamlessly.
VISIT OUR WEBSITE TO EXPLORE HEVO
Hevo Data is a No-Code Data Pipeline that offers a faster way to move data from 150+ Data Sources including 50+ Free Sources, into your Data Warehouse like Amazon Redshift to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code. You can easily load your data from Aurora to Snowflake in a hassle-free manner.
Want to take Hevo for a spin? Check out our transparent pricing to make an informed decision.
SIGN UP and experience a hassle-free data replication from Aurora to Snowflake.
Share your experience of migrating data from Aurora to Snowflake in the comments section below!