Press "Enter" to skip to content

Amazon Aurora to Snowflake ETL: Steps to Move Data Easily

Aurora to SnowflakeOften businesses have a different database to store transactions (Eg: Amazon Aurora) and another warehouse (Eg. Snowflake) for companies analytical needs. There are two prime reasons to move data from your transactional DB to a Warehouse (Eg: Amazon Aurora to Snowflake):

  1. The transaction database is optimized for fast writes and responses. Running analytics queries on large data sets with many aggregations and joins will slow down the database. This might eventually take a toll on the customer experience
  2. Warehouses are built to handle scaling data sets and analytical queries. Moreover, they can host the data from multiple data sources and aid in deeper analysis

This post will highlight the steps to move data from Aurora DB to Snowflake.

Understanding Aurora and Snowflake

RDS is the initial relation database service from AWS which supports most of the open source and proprietary databases. Opensource offerings of RDS like MySQL and PostgreSQL are much cost effective compared to enterprise database solutions like Oracle. But most of the time opensource solutions requires a lot of performance tuning to get par with enterprise RDBMS in performance and other aspects like concurrent connections.

AWS introduced a new relational database service called Aurora which is compatible with MySQL and PostgreSQL to overcome the much-known weakness of those databases costing much lesser than enterprise databases. No wonder many organizations are moving to Aurora as their primary transaction database system.

At the other end, Snowflake might be the best cost-effective and fast data warehousing solution. It has dynamically scaling compute resources and storage is completely separated and billed. Snowflake can be run on different cloud vendors including AWS. So data movement from Aurora to Snowflake can also be done with less cost. Read about Snowflake’s features here.

Methods to load data from Amazon Aurora to Snowflake

Here are two ways that can be used to approach Aurora to Snowflake ETL:

Method 1: Implement a hassle-free, no-code Data Integration Platform like Hevo Data – 14 Day Free Trial (Official Snowflake ETL Partner)

Method 2: Build custom scripts to move data from Aurora to Snowflake

This post will discuss the method 2 in detail. The blog will also highlight the limitations of this approach and the workarounds to solve them.

Move data from Aurora to Snowflake using ETL Scripts

Steps to Move Aurora to Snowflake Using Custom Scripts

As in the above-shown figure, steps to migrate data from Aurora to Snowflake are as follows:

  1. Extract data from Aurora cluster to S3
  2. Convert data types and format them
  3. Stage data files to Snowflake staging area
  4. Import staged files to Snowflake table

1. Extract Data from Aurora Cluster to S3

SELECT INTO OUTFILE S3 statement can be used to query data from an Aurora MySQL cluster and save the result to S3.In this method, data reaches the client side in a fast and efficient manner. To save data to S3 from an Aurora cluster proper permissions need to be set. For that –

  1. Create a proper IAM policy to access S3 objects – Refer AWS documentation here
  2. Create a new IAM role, and attach the IAM policy you created in the above step
  3. Set aurora_select_into_s3_role or aws_default_s3_role cluster parameter to the ARN of the new IAM role
  4. Associate the IAM role that you created with Aurora cluster
  5. Configure the Aurora cluster to allow outbound connections to S3 – Read more on this here

Other important points to be noted while exporting data to S3:

  • User Privilege – The user that issues the SELECT INTO OUTFILE S3 should have the privilege to do so.
    To grant access –

    GRANT SELECT INTO S3 ON *.* TO 'user'@'domain'.\

    Note that this privilege is specific to Aurora. RDS doesn’t have such a privilege option

  • Manifest File – You can set MANIFEST ON option to create a manifest file which is in JSON format that lists the output files uploaded to S3 path. Note that files will be listed in the same order in which they would be created.
    Eg:

    {
    "entries": [
    {
    "url":"s3-us-east-1://s3_bucket/file_prefix.part_00000"
    },
    {
    "url":"s3-us-east-1://s3_bucket/file_prefix.part_00001"
    },
    {
    "url":"s3-us-east-1://s3_bucket/file_prefix.part_00002"
    }
    ]
    }
  • Output Files – The output is stored as delimited text files. As of now compressed or encrypted files are not supported
  • Overwrite Existing File – Set option OVERWRITE ON  to delete if a file with exact name exists in S3.
  • The default file size is 6 GB. If the data selected by the statement is lesser then a single file is created. Otherwise, multiple files are created.No rows will be split across file boundaries
  • If data volume to be exported is larger than 25 GB, it is recommended to run multiple statements to export data. Each statement for a different portion of data
  • No metadata like table schema will be uploaded to S3
  • As of now, there is no direct way to monitor the progress of data export. One simple method is set to manifest option on and the manifest file will be the last file created. Examples:
    • Below statement writes to S3 of located in a different region. Each field is terminated by a comma and each row is terminated by ‘\n’
      SELECT * FROM students INTO OUTFILE S3 's3-us-west-2://aurora-out/sample_students_data'
      FIELDS TERMINATED BY ','
      LINES TERMINATED BY '\n';
    • Below is another example which writes to S3 of located in the same region. A manifest file will also be created
      SELECT * FROM students INTO OUTFILE S3 's3://aurora-out/sample_students_data'
      FIELDS TERMINATED BY ','
      LINES TERMINATED BY '\n'
      MANIFEST ON;

2. Data Type Conversion and Formatting

There might be data transformations corresponding to business logic or organizational standards to be applied while transferring data from Aurora to Snowflake. Apart from those high-level mappings, some basic things to be considered generally are listed below:

  • All popular character sets including UTF-8, UTF-16 are supported by Snowflake. The full list can be found here
  • Many cloud-based and open source big data systems compromise on standard relational database constraints like Primary Key. But, note that Snowflake supports all SQL constraints like UNIQUE, PRIMARY KEY, FOREIGN KEY, NOT NULL constraints. This might be helpful when you load data
  • Data types support in Snowflake is fairly rich including nested data structures like an array. Below is the list of Snowflake data types and corresponding MySQL Aurora types
Aurora(MySQL) Data Type Snowflake Data Type
TINYINT TINYINT
SMALLINT SMALLINT
MEDIUMINT INTEGER
INT INTEGER
BIGINT BIGINT
DECIMAL DECIMAL
FLOAT FLOAT, FLOAT4, FLOAT8
DOUBLE DOUBLE, DOUBLE PRECISION, REAL
BIT BOOLEAN
CHAR CHAR
VARCHAR VARCHAR
BINARY BINARY
VARBINARY VARBINARY
TINYTEXT STRING, TEXT
TEXT STRING, TEXT
MEDIUMTEXT STRING, TEXT
LONGTEXT STRING, TEXT
ENUM No type for ENUM.Must use any type which can represent values in ENUM
SET No type for SET.Must use any type which can represent values in SET
DATE DATE
TIME TIME
DATETIME DATETIME
TIMESTAMP TIMESTAMP
  • Snowflake is really flexible with the date or time format. If a custom format is used in your file that can be explicitly specified using File Format Option while loading data to the table. The complete list of date and time formats can be found here.

3. Stage Data Files

Snowflake requires the data to be uploaded to a temporary location before loading to table. This temporary location is an S3 location that Snowflake has access to. This process is called staging. Snowflake stage can be either internal or external.

(A) Internal Stage

In Snowflake, each user and table is automatically assigned to an internal stage for data files. It is also possible internal stages explicitly and can be named.

  • The stage assigned to the user is named as ‘@~’
  • The stage assigned to a table will have the name of the table
  • The default stages assigned to a user or table can’t be altered or dropped
  • The default stages assigned to a user or table do not support setting file format options

As mentioned above, internal stages can also be created explicitly by the user using  SQL statements. While creating stages explicitly like this, many data loading options can be assigned to those stages like file format, date format, etc.

While interacting with Snowflake for data loading or creating tables, SnowSQL is a very handy CLI client available in Linux/Mac/Windows which can be used to run Snowflake commands. Read more about the tool and options here.

Below are some example commands to create stage:
Create a named internal stage

my_aurora_stage  and assign some default options:

create or replace stage my_aurora_stage

copy_options = (on_error='skip_file')

file_format = (type = 'CSV' field_delimiter = '|' skip_header = 1);

PUT is the command used to stage files to an internal Snowflake stage. The syntax of the PUT command is :

PUT file://path_to_file/filename internal_stage_name

Eg:

Upload a file named students_data.csv in the /tmp/aurora_data/data/ directory to an internal stage named aurora_stage

put file:////tmp/aurora_data/data/students_data.csv @aurora_stage;

Snowflake provides many options which can be used to improve the performance of data load like the number of parallelisms while uploading the file, automatic compression, etc. More information and the complete list of options are listed here.

(B) External Stage

Just like internal stage Snowflake supports Amazon S3 and Microsoft Azure as an external staging location. If data is already uploaded to an external stage which can be accessed from Snowflake, that data can be loaded directly to the Snowflake table. No need to move the data to an internal stage.

To create an external stage on S3, IAM credentials with proper access permissions needs to be provided. In case the data is encrypted, the encryption keys should be provided.

create or replace stage aurora_ext_stage url='s3://snowflake_aurora/data/load/files/'

credentials=(aws_key_id='13311a23344rrb3c' aws_secret_key='abddfgrrcd4kx5y6z');

encryption=(master_key = 'eSxX0jzsdsdYfIjkahsdkjamNNNaaaDwOaO8=');

Data can be uploaded to the external stage with respective cloud services. Data from Amazon Aurora will be exported to S3 and that location itself can be used as an external staging location which helps to minimize data movement.

4. Copy Files that are Staged into Snowflake Table

Now data is present in an external or internal stage and has to be loaded to a Snowflake table. The command used to do this is COPY INTO. To execute the COPY INTO command compute resources in the form of Snowflake virtual warehouses are required and will be billed as per consumption.

Eg:

To load from a named internal stage

copy into aurora_table

from @aurora_stage;

To load data from the external stage. Only a single file is specified.

copy into my_external_stage_table

from @aurora_ext_stage/tutorials/dataloading/students_ext.csv;

You can even copy directly from an external location:

copy into aurora_table

from s3://mybucket/aurora_snow/data/files

credentials=(aws_key_id='$AWS_ACCESS_KEY_ID' aws_secret_key='$AWS_SECRET_ACCESS_KEY')

encryption=(master_key = 'eSxX009jhh76jkIuLPH5r4BD09wOaO8=')

file_format = (format_name = csv_format);

Files can be specified using patterns:

copy into aurora_pattern_table

from @aurora_stage

file_format = (type = 'TSV')

pattern='.*/.*/.*[.]csv[.]gz';

Some commonly used options for CSV file loading using the COPY command

  • COMPRESSION  to specify compression algorithm used for the files
  • RECORD_DELIMITER to indicate lines separator character
  • FIELD_DELIMITER  is the character separating fields in the file
  • SKIP_HEADER is the number of header lines to skipped
  • DATE_FORMAT is the date format specifier
  • TIME_FORMAT is the time format specifier

There are many other options. For full list click here.

Update Snowflake Table

So far the blog talks about how to extract data from Aurora and simply insert into a Snowflake table. Next, let’s look deeper into how to handle incremental data upload to Snowflake table.

Snowflake’s architecture is unique. It is not based on any current/existing big data framework. Snowflake does not have any limitations for row-level updates. This makes delta data uploading to a Snowflake table much easier compared to systems like Hive. The way forward is to load incrementally extracted data to an intermediate table. Next, as per the data in the intermediate table, modify the records in the final table. 

3 common methods that are used to modify the final table once data is loaded into a landing table ( intermediate table) is mentioned below. 

    1. Update the rows in the target table. Next,  insert new rows from the intermediate or landing table which are not in the final table.
      UPDATE  aurora_target_table t
      
      SET t.value = s.value FROM  landing_delta_table in 
      
      WHERE t.id = in.id; 
      
      INSERT INTO auroa_target_table (id, value) 
      
      SELECT id, value FROM  landing_delta_table 
      
      WHERE NOT id IN (SELECT id FROM aurora_target_table);
    2. Delete all records from the target table which are in the landing table. Then insert all rows from the landing table to the final table.
      DELETE .aurora_target_table f
      
      WHERE f.id IN (SELECT id from landing_table);
      
      INSERT aurora_target_table (id, value)
      
      SELECT id, value
      
      FROM  landing_table;
    3. MERGE statement – Inserts and updates combined in a single MERGE statement and it is used to apply changes in the landing table to the target table with one SQL statement.

      MERGE into aurora_target_table t1 using landing_delta_table t2 on t1.id = t2.id
      
      WHEN matched then update set value = t2.value
      
      WHEN not matched then INSERT (id, value) values (t2.id, t2.value);

Limitations of Writing Custom  ETL Code to Move Data from Aurora to Snowflake

While this approach may look very straightforward, it does come with limitations.

  1. You would have to invest precious engineering resources to hand code the pipeline. This will increase the time for the data to available in Snowflake
  2. You will have to invest in engineering resources to constantly monitor and maintain the infrastructure. Code breaks, schema changes at the source, destination unavailability – these issues will crop up more often than you would account for while starting the ETL project
  3. The above approach fails if you need data to be streamed in real-time from Aurora to Snowflake. You would need to add additional steps, set up cron jobs to achieve this

 EASY WAY TO MOVE DATA FROM AURORA TO SNOWFLAKE

On the other hand, a Data Pipeline Platform such as Hevo, an official Snowflake ETL partner, can help you bring data from Aurora to Snowflake in no time. Zero Code, Zero Setup Time, Zero Data Loss. Here are the simple steps to load Aurora DB to Snowflake using Hevo:

  • Authenticate and Connect to your Aurora DB
  • Select the replication mode: (a) Full Dump and Load (b) Incremental load for append-only data (c) Change Data Capture
  • Configure the Snowflake data warehouse for data load

For a next-generation digital organization, there should be a seamless data movement between transactional and analytical systems. Using an intuitive and reliable platform like Hevo ensure that accurate and consistent data is available in Snowflake in real-time.

Sign up for a 14-day free trial and experience a hassle-free data replication from Aurora to Snowflake

ETL Data to Redshift, Bigquery, Snowflake

Move Data from any Source to Warehouse in Real-time

Sign up today to get $500 Free Credits to try Hevo!
Start Free Trial