TokuDB to Redshift ETL – Steps to Replicate Data

on Tutorial • June 27th, 2019 • Write for Hevo

TokuDB displays high-performance and provides much higher storage capabilities, all without slowing down. Hence, organizations with a write-heavy load for their databases will be motivated to use TokuDB instead of the typical InnoDB engine for MySQL. Since MySQL as such is not ideal for offline analysis and aggregation requirements, these databases will need to be periodically copied to a data warehouse like AWS Redshift. Few reasons why it is imperative to move data from TokuDB to Redshift is as follows:

  1. Any transactional database is optimized for fast writes and short response times. It may not be ideal for the business analysts to use the live transactional database for their complex queries. This will lead to a slow down of the online database
  2. A data warehouse can contain data from multiple sources, thereby giving the analysts a one-stop place where everything they need is stored and catalogued. This helps them perform deep, meaningful analysis

Understanding TokuDB and Amazon Redshift

TokuDB provides an alternate high-performance storage engine for MySQL in place of the ubiquitous InnoDB storage engine. This high-performance nature of the TokuDB engine comes from its use of fractal tree indexing. The engine is specifically designed for write-heavy workloads and performs at its best if used with SSDs. The important thing to note here is that once installed as a storage engine for MySQL, it behaves exactly similar to any other MySQL instance with InnoDB and there is no difference to the querying capability. At the same time, it increases the storage limit of MySQL to even higher than tens of TBs, which is where a normal MySQL instance starts to slow down. That said TokuDB has a limitation, which may be of a deal breaker to some use cases. It does not enforce the foreign key constraint and it is the responsibility of the user to ensure foreign key constraints are met properly.

AWS redshift offers a cloud-based data warehouse with a very clean interface and all the required APIs to query and analyze petabytes of data. The biggest advantage is that it allows the developer to focus only on the analysis jobs and leave all the complexities related to managing a reliable warehouse service.

Now that we are done with the reasons why the two technologies are important, let us move to the problem of copying data between the two.

Methods for TokuDB to Redshift replication

There are two popular methods to move data from TokuDB to Amazon Redshift.

Method 1: A ready to use Hevo Data Integration Platform (14 Days Free Trial).

Method 2: Build custom ETL code to move data from TokuDB to Redshift.

Steps to move data from TokuDB to Redshift

Copying data from TokuDB

Since TokuDB does not change the basic behaviour of a MySQL instance, the simplest way to copy data from a TokuDB instance is to use mysqldump command. The challenge with this approach is that you need sufficiently large disk space to hold the complete data in your server where MySQL is installed. In case the server has enough space, execute the below command to take a copy of the source table.

mysqldump -u username -p -t -T/path/to/target_directory source_db_name source_table_name --fields-terminated-by=','

The above command will output two files to the target_directory –  source_table_name.sql and source_table_name.txt. The TXT file will contain data rows with columns separated by a comma.

If the MySQL server does not have enough storage space to hold the files and the copying operation needs to use a different machine to hold the data, use the below command from the helper machine.

Mysql -B -u username -p password sourcedb -h dbhost -e  "select * from source_table" -B | sed "s/'/'/;s/t/","/g;s/^/"/;s/$/"/;s/n//g" > source_table.csv

This command will output CSV files with the name source_table.csv directly in the location where the command is executed.

Moving the source data files to S3

Once the CSV/TXT file is generated, we need to copy this data to a place where redshift can access the data. Create an S3 folder in your AWS account and install AWS CLI. After that execute the below command.

aws s3 cp source_table.csv  s3://my_bucket/source_table/

Loading data to Redshift – Complete overwrite

If a complete overwrite of the Redshift data is permissible for the use case, the data in S3 folder can be copied to the Redshift table in one single command using the native copying command of Redshift. For doing this, login to AWS management console and type the following command in the Query Editor in Redshift console and execute.

copy target_table_name from ‘s3://my_bucket/source_table’ credentials access_key_id <access_key_id>  secret_access_key <secret_access_key>

Where access_key_id and secret_access_key represent the IAM credentials.

If a complete overwrite is not preferred for the use case, please go to the below step to execute an incremental load.

Incremental load using a temporary table

You may have noted that in the above step complete data in the target table is overwritten by the command. There will be scenarios where this is unacceptable and the existing data in the target tables needs to be retained. If that is the case, we will first need to create a temporary table, copy all the data to that table and then insert it to the original table after removing the duplicates. The below command creates a temporary stage_table.

create temp table stage_table  (like target_table_name)

It is imperative that we use the ‘like’ keyword here so that the table structure of the staging table is the same as that of our target table even in the case of the distribution keys.

Copy the complete data into the temporary table

Next, you need to copy data from the temporary table in S3 folder using AWS credentials.

copy target_table_name from ‘s3://my_bucket/source_table’ credentials access_key_id <access_key_id>  secret_access_key <secret_access_key>

Delete the rows which are already present in the target table

Our temporary table now contains the complete data from TokuDB instance. If we load this data directly to the target table, it would result in the target table having duplicate rows. To avoid this, we first delete all the rows in the target table which are present in the temporary table.

begin transaction; delete from target_table_name using stage_table where target_table_name.primarykey = stage_table.primarykey;

Load the data rows from the temporary table

This step loads all the data from the temporary table to the target table and the transaction is completed. Please note to wrap these commands inside a transaction so that the target table is not corrupted in case something goes wrong.

insert into target_table_name  select * from stage_table; end transaction;

Limitations and Challenges of Writing Code to Move Data from TokuDB to Redshift

  1. Both mysqldump command and select table approach results in the table being slowed down during the operation in case of large datasets. 
  2. There might be use cases where you want the data to be streamed in real-time to the data warehouse. You would have to put in additional effort in terms of engineering resources to build a system like this. 
  3. Since this is a completely manual process, there will be problems related to data type conversions between TokuDB and Redshift. There is no scope for enforcing a specific data type conversion between TokuDB and Redshift in the above method. 
  4. In case you build this infrastructure to load only a subset of data, often scripts break as the source schema is evolving. This can eventually result in data loss. 

There is an easier way to move data from TokuDB to Redshift.

The best way to avoid the above limitations is to use a Data Integration Platform that works out of the box. Using a fully-managed Data Pipeline platform like Hevo can help you bring data from TokuDB to Redshift instantly. Hevo’s real-time streaming architecture reliably moves the latest data into your warehouse. No coding, No ETL Scripts, No Cron Job management. With Hevo, move your data from Tokudb (MySQL) to Redshift in 3 simple steps:

  • Connect to TokuDB (MySQL) source by providing connection settings
  • Select the mode of replication you want: (a) Load the result set of a Custom Query (b) Full dump of MySQL Tables (c) Load Data via BinLog
  • Configure Redshift warehouse where the data needs to be moved

That’s it! Simple and straightforward. Hevo will do all the groundwork to ensure that the data is fetched in an incremental mode and copied to Redshift in real-time. 

In addition to TokuDB, you can use Hevo Data Integration Platform to move data from 100s of different sources – Cloud Applications, Databases, SDKs, etc. You can check out the complete list here. (www.hevodata.com/integrations). This will ensure that all your Data integration needs are taken care of as your both your business and data sources grow. 

Sign-up for a 7 Day Free Trial to experience the simplicity of data replication from TokuDB (and 100s of other sources) to Redshift using Hevo Data Integration Platform.

No-code Data Pipeline for Redshift