MySQL to Redshift ETL – Steps to Migrate Data

on Tutorial • October 5th, 2018 • Write for Hevo

Is your MySQL server getting too slow for analytical queries now? Or are you looking to join data from another database while running queries? Whichever your use case, it is a great decision to move the data from MySQL to Redshift for analytics. This post covers the detailed steps you need to follow to migrate data from MySQL to Redshift. Here is a brief outline of this blog:

Overview of MySQL

MySQL is one of the world’s most known open-source relational database management system. With great security, reliability, and ease of use, MySQL has emerged as the leading choice for OLTP systems.

image via Logo.vine

Overview of Redshift

Amazon Redshift (based on PostgreSQL 8.0.2) is a columnar database that supports scalable architecture and multi-node processing. Due to its ability to handle and process a huge influx of data and easy setup options, it is emerging as the engineer’s favorite choice for an OLAP system.

Image via AWSforbusiness

Methods for MySQL to Redshift replication

There are two popular methods to move data from MySQL to Amazon Redshift.

Method 1: A ready to use Hevo Data Integration Platform (Sign-up here for the 14-Day Free Trial).

Method 2: Write Custom ETL Scripts to move data from MySQL to Amazon Redshift.

We will deep dive on Method 2 in this post and detail out its limitations and hardships. Additionally, towards the end of this article, we will compare method 1 (Using Hevo Data Integration Platform) and Method 2 (Custom ETL Scripts).

In Method 1, although the task seems easy enough, it involves quite a few steps and some of them are error-prone.

Let’s go through the steps to move data from MySQL to Redshift:

Step 1. Dump the Data to Files

The most efficient way of loading data in Redshift is through the COPY command that loads CSV/JSON files into Redshift. So, the first step is to bring the data in your MySQL database into CSV/JSON files.

There are essentially two ways of achieving this:

a. We can use mysqldump command.

mysqldump -h mysql_host -u user database_name table_name --result-file table_name_data.sql

The above command will dump data from a table table_name to the file table_name_data.sql. But, the file will not in CSV/JSON format required for loading into Redshift. This is how a typical row may look like in the output file:

INSERT INTO `users` (`id`, `first_name`, `last_name`, `gender`) VALUES (3562, ‘Kelly’, ‘Johnson’, 'F'),(3563,’Tommy’,’King’, 'M');

The above rows will need to be converted to the following format:

"3562","Kelly","Johnson", "F"

b. Query the data into a file.

mysql -B -u user database_name -h mysql_host  
-e "SELECT * FROM table_name;" |  
sed "s/'/'/;s/t/","/g;s/^/"/;s/$/"/;s/n//g"  
> table_name_data.csv

You will have to do this for all tables:

for tb in $(mysql -u user -ppassword database_name -sN -e "SHOW TABLES;"); do
     echo .....;

Step 2. Clean and Transform

There might be several transformations required before you load this data into Redshift. e.g. ‘0000-00-00’ is a valid DATE value in MySQL but in Redshift, it is not. Redshift accepts ‘0001-01-01’ though. Apart from this, you may want to clean up some data according to your business logic, you may want to make time zone adjustments, concatenate two fields, or split a field into two. All these operations will have to be done over files and will be error-prone.

Step 3. Upload to S3 and Import into Redshift

Once you have the files to be imported ready, you will upload them to an S3 bucket. Then run copy command:

COPY table_name FROM 's3://my_redshift_bucket/some-path/table_name/' credentials  

Again, the above operation has to be done for every table.

Once the COPY has been run you can check the stl_load_errors table for any copy failures.

In a happy scenario, the above steps should just work fine. However, in real-life scenarios, you may encounter errors in each of these steps. e.g. :

  • Network failures or timeouts during dumping MySQL data into files.
  • Errors encountered during transforming data due to an unexpected entry or a new column that has been added
  • Network failures during S3 Upload.
  • Timeout or data compatibility issues during Redshift COPY. COPY might fail due to various reasons, a lot of them will have to be manually looked into and retried.

Incremental Load to Redshift

If your MySQL tables are big (read > 50 GB), you may not want to upload complete tables every time you run the jobs. You will want to upload only the data that has been newly added or changed.

The following steps in your process will change for incremental load:

1. Dump the data to files

mysqldump cannot be used anymore. You will have to query the data into a file:

mysql -B -u user database_name -h mysql_host  
-e "SELECT * FROM table_name WHERE updated_ts < now() AND ((updated_ts = '#max_updated_ts_in_last_run#' AND id > #max_id_in_last_run#) OR updated_ts >'#max_updated_ts_in_last_run#') ORDER BY updated_ts, id ASC"|  
sed "s/'/'/;s/t/","/g;s/^/"/;s/$/"/;s/n//g"  > table_name_data.csv

2. Import Mysql Data To  Redshift

You can not simply COPY the files into Redshift. Doing that might result in duplicate Primary Keys as Redshift doesn’t enforce Primary Key constraints. The process to import these Mysql data files to Redshift will be as follows:

a. Create a temporary table to copy data into

CREATE TEMP TABLE temp_table_name (like table_name)

b. COPY the files into the temporary table

COPY temp_table_name FROM 's3://my_redshift_bucket/some-path/table_name/' credentials  'aws_access_key_id=my_access_key;aws_secret_access_key=my_secret_key';

c. DELETE rows that have been updated from the original table

BEGIN transaction;
DELETE FROM table_name USING temp_table_name WHERE table_name.primarykey = temp_table_name.primarykey;

d. Insert all of the rows from the temporary table.

INSERT INTO table_name SELECT * FROM temp_table_name;
END transaction;

e. Drop the temporary table.

DROP TABLE temp_table_name;

Challenges with Custom ETL Scripts while moving Data from MySQL to Redshift

  1. In cases where data needs to be moved once or in batches only, the custom script method works. This approach fails if you have to move data from MySQL to Redshift in real-time.
  2. Incremental load (change data capture) becomes tedious as there will be additional steps that you need to follow to achieve this.
  3. Often, when you write code to extract a subset of data often those scripts break as the source schema keeps changing or evolving. This can result in data loss.

The process mentioned above is brittle, error-prone and often frustrating. These challenges impact the consistency and accuracy of the data available in your Redshift in near real-time.

But fortunately, there is a better way!

Hevo Data, an easier way to address these problems.

A ready to use Hevo Data Integration Platform helps you migrate this data without having to write any code. This is how the process will look like when done through Hevo:

  • Connect to your MySQL database.
  • Select the replication mode: (a) load selected MySQL tables (b) load data via Custom Query (c) load data through Binlog.
  • For each table in MySQL choose a table name in Redshift where it should be copied.

Sign-up here for the 14 day free trial!

That’s it! You are all set. Hevo will take care of fetching your data in an incremental fashion and will upload that seamlessly from MySQL to Redshift in real-time.

In addition to the benefits mentioned above, you can use the Hevo Data Integration Platform to move data from 100s of different sources – databases, cloud applications, SDKs, and more. This will future proof your data integration set up and provide the flexibility to immediately replicate data from any source into Redshift.

Sign-up for a 14 Day Free Trial to see how easy it is to move Data from MySQL (and 100s of other sources) to Redshift using Hevo Data Integration Platform.

No-code Data Pipeline for Redshift