Snowflake is a relatively younger modern Data Warehousing solution on the cloud. From the first day, Snowflake is well accepted by enterprises and medium/small businesses as it resolves many challenges with traditional Data Warehouse solutions. If you are thinking of a new cloud-native Data Warehousing solution then you should consider evaluating Snowflake. Hence, before diving into MySQL to Snowflake data migration steps, let’s do a quick walkthrough of Snowflake benefits:
- Breathtaking Performance – Compute and storage is separated. You can scale up and down on the fly without any disruption.
- Support of Unlimited Concurrency – You can scale up computation dynamically. No need to worry about occasional high usages.
- Cloud Vendor Agnostic – Snowflake architecture supports multiple cloud vendors and they keep on adding new vendors to list. So as a user, you will have more choices and even if you are working with different cloud vendors data can be analyzed with the same tools. No need to learn vendor-specific solutions.
- Simplicity – Snowflake will take care of almost everything from autoscaling of computing to the proper encoding of columns. Generally, no indexes to be defined – Snowflake will cluster data automatically. But in, particularly as DML, particularly a lot of DML occurs on very large tables, the user can define clustering keys which are used by Snowflake to co-locate the table data.
Methods for MySQL to Snowflake replication
There are two popular methods to perform MySQL to Snowflake ETL.
Method 1: A ready to use Hevo, Cloud ETL Platform (7 Days Free Trial).
Method 2: Write a Custom Code to move data from MySQL to Snowflake.
Now, we can have a look at how to move data from your MySQL database to Snowflake using Custom Code. To perform MySQL to Snowflake replication, we have to put data files into the Snowflake staging area (internal or external stage). Next, copy data from the stage area to the table.The high-level steps to be followed for MySQL to Snowflake migration as shown in the figure above are,
Extract data from MySQL
Data Types and Formatting
Stage Data Files
Copy staged files to Snowflake table.
1. Extract Data from MySQL
Mainly two methods are followed to extract from MySQL. One is using the command line tool – mysqldump and the other is running SQL query using MySQL client and saving the output to files.
Extracting data with mysqldump:
Mysqldump is a client utility available by default with standard Mysql installation. Its main usage is to create a logical backup of a database/table. It can be used to extract one table as shown below:
mysqldump -u <username> -h <host_name> -p database_name my_table > my_table_out.sql
Here, the output file table_name.sql will be in the form of insert statements like
INSERT INTO table_name (column1, column2, column3, ...) VALUES (value1, value2, value3, ...);
To convert this format into a CSV file you have to write a small script or use some open source library available. You can refer MySQL official documentation for more information.
If the mysqldump is running on the same machine or different machine where the mysqld server runs, you have another simpler option to get CSV directly. Use below command to get CSV file:
mysqldump -u [username] -p -t -T/path/to/directory [database_name] --fields-terminated-by=,
Extract Data Using SQL Query
SQL commands can be executed using MySQL client utility and redirect output to a file.
mysql -B -u user database -h mysql_host -e "select * from my_table;" > my_table_data_raw.txt
The output can be transformed using text editing utilities like sed or awk to clean and format data.
mysql -B -u user database -h mysql_host -e "select * from my_table;" | sed "s/'/\'/;s/\t/\",\"/g;s/^/\"/;s/$/\"/;s/\n//g" \ > my_table_final_data.csv
2. Data Types and Formatting
Other than business-specific transformations, following things to be noted while replicating data from MySQL to Snowflake.
- Snowflake support a number of character sets including UTF-8, UTF-16 etc. To see the full list – click here.
- Snowflake supports UNIQUE, PRIMARY KEY, FOREIGN KEY, NOT NULL constraints, unlike many other cloud analytical solutions.
- Snowflake has a rich set of data types. Here is the list of Snowflake data types and corresponding MySQL.
|MySQL Data Type||Snowflake Data Type|
|FLOAT||FLOAT, FLOAT4, FLOAT8|
|DOUBLE||DOUBLE, DOUBLE PRECISION, REAL|
|ENUM||No type for ENUM. Must use any type which can represent values in ENUM.|
|SET||No type for SE. Must use any type which can represent values in SET.|
- Snowflake allows most of the date/time format and it can be explicitly specified while loading data to table using File Format Option ( we will discuss this in detail later). For the complete list of supported format please click here.
3. Stage Data Files
To insert MySQL data into a Snowflake table first data files needs to be uploaded to a temporary location which is called staging. Snowflake support internal and external stages.
Each user and table is automatically allocated an internal stage for staging data files. You can also create named internal stages.
- User stage is referenced using ‘@~’.
- The name of a table stage will be the same as the table name.
- User/Table stages can’t be altered or dropped.
- User/Table stages do not support setting file format options.
Internal Named Stages are explicitly created by the user using respective SQL statements. It provides a greater degree of flexibility while loading data. You can assign file format and other options to named stages which makes data load easier.
While working with Snowflake you will need to run a lot of DML and DDL statements in SQL and some specific commands like for data load as shown below. SnowSQL is a very handy CLI client which can be used to run those commands and is available in Linux/Mac/Windows.
create or replace stage my_mysql_stage copy_options = (on_error='skip_file') file_format = (type = 'CSV' field_delimiter = '|' skip_header = 1);
PUT command is used to stage data files to an internal stage. The syntax of the command is as given below :
PUT file://path/to/file/filename internal_stage_name
Upload a file named mysql_data.csv in the /tmp/mysql/data directory to an internal stage named mysql_stage.
PUT file:////tmp/mysql/data/mysql_data.csv @mysql_stage;
There are many useful options like set parallelism while uploading the file, automatic compression of data files etc. For more information click here.
Currently, Snowflake supports Amazon S3 and Microsoft Azure as an external staging location. You can create an external stage with those locations and load data to a Snowflake Table. To create an external stage on S3, you have to provide IAM credentials and encryption keys if data is encrypted as shown in the example below:
create or replace stage mysql_ext_stage url='s3://snoflake/load/files/' credentials= (aws_key_id='111a222b3c' aws_secret_key='abcd4x5y6z'); encryption= (master_key = 'eSxX0jzYfIamtnBKOEOwq80Au6NDwOaO8=');
Data to the external stage can be uploaded using respective cloud vendor interfaces. For S3 you can upload using web console or any SDK or third-party tools.
4. Copy Staged Files to Snowflake Table.
COPY INTO command is to load the contents of the staged file(s) into a Snowflake table. This command needs to compute resources in the form of virtual warehouses to run.
To load from a named internal stage:
COPY INTO mysql_table FROM @mysql_stage;
Loading from the external stage. Only one file is specified.
COPY INTO mycsvtable FROM @mysql_ext_stage/tutorials/dataloading/contacts1.csv;
You can even copy directly from an external location:
COPY INTO mysql_table FROM s3://mybucket/data/files credentials= (aws_key_id='$AWS_ACCESS_KEY_ID' aws_secret_key='$AWS_SECRET_ACCESS_KEY') encryption= (master_key = 'eSxX0jzYfIamtnBKOEOwq80Au6NbSgPH5r4BDDwOaO8=') file_format = (format_name = my_csv_format);
Files can be specified using patterns:
COPY INTO mytable FROM @mysql_stage file_format = (type = 'CSV') pattern='.*/.*/.*[.]csv[.]gz';
Some common format options for CSV format supported in the COPY command are the following:
- COMPRESSION – Compression algorithm for the data files to be loaded.
- RECORD_DELIMITER – Character that separates records(lines) in an input CSV file
- FIELD_DELIMITER – Fields separating character in the input file.
- SKIP_HEADER – Number of header lines to be skipped.
- DATE_FORMAT – String to specify the date format.
- TIME_FORMAT – String to specify the time format.
For the full list of options available please visit here.
Incremental Data Load
After initial full data is loaded to the target table, most of the time changed data is extracted from the source and migrated to the target table at a regular interval. Sometimes for small tables, full data dump can be used even for recurring data migration but for the larger table we have to go with delta approach.
Increment Extract from MySQL
To get only modified records after a particular time, run SQL with proper predicates against the table and write output to file. mysqldump not useful here as it always extracts full data.
Example: Extracting records based on last_updated_timestamp column and formatting data using sed command.
mysql -B -u user database -h mysql_host -e "select * from my_table where last_updated_timestamp < now() and last_updated_timestamp >'#max_updated_ts_in_last_run#'"| sed "s/'/\'/;s/\t/\",\"/g;s/^/\"/;s/$/\"/;s/\n//g" \ > my_table_data.csv
Any records deleted physically will be missing here and will not be reflected in the target.
Update Snowflake Table
Snowflake supports row-level updates which makes delta data migration much easier. Basic idea is to load incrementally extracted data into an intermediate table and modify records in final table as per data in the intermediate table.
We can choose three methods to modify the final table once data is loaded into the intermediate table.
- Update the existing rows in the final table and insert new rows from the intermediate table which are not in the final table.
UPDATE final_table t SET t.value = s.value FROM intermed_table in WHERE t.id = in.id;
INSERT INTO final_table (id, value) SELECT id, value FROM intermed_table WHERE NOT id IN (SELECT id FROM final_table);
- Delete all rows from the final table which are present in the intermediate table. Then insert all rows from the intermediate table to the final table.
DELETE .final_table f WHERE f.id IN (SELECT id from intermed_table);
INSERT final_table (id, value) SELECT id, value FROM intermed_table;
- MERGE statement – Insert and update can be done with a single MERGE statement which can be used to apply changes in the intermediate table to the final table.
MERGE into final_table t1 using intermed_table t2 on t1.id = t2.id WHEN matched then update set value = t2.vaue WHEN not matched then INSERT (id, value) values (t2.id, t2.value);
Easier Way to move data from MySQL to Snowflake
Using Hevo, official Snowflake ETL partner you can easily load data from MySQL to Snowflake with just 3 simple steps. This does not need you to write any code and will provide you with an error-free, fully managed set up to move data in minutes.
- Connect to your MySQL database.
- Select the replication mode: (a) load selected MySQL tables (b) load data via Custom Query (c) load data through Binlog.
- Configure the Snowflake destination where it should be copied.
Nowadays, high performing Data Warehouse solution like Snowflake is an integral part of an analytics pipeline. Moving data to this kind of cloud-native solution requires familiarity with the cloud best practices, especially with data security. There are third-party tools which take care of all aspects of data movement including security, connecting with other data sources, etc., which might be a better choice like Hevo, a cloud data integration platform.