TokuDB to Redshift: 2 Easy Methods

on Tutorial • September 9th, 2021 • Write for Hevo

TokuDB displays high-performance and provides much higher storage capabilities, all without slowing down. Hence, organizations with a write-heavy load for their databases will be motivated to use TokuDB instead of the typical InnoDB engine for MySQL. Since MySQL as such is not ideal for offline analysis and aggregation requirements, these databases will need to be periodically copied to a data warehouse like AWS Redshift. Few reasons why it is imperative to move data from TokuDB to Redshift is as follows:

  1. Any transactional database is optimized for fast writes and short response times. It may not be ideal for the business analysts to use the live transactional database for their complex queries. This will lead to a slow down of the online database
  2. A data warehouse can contain data from multiple sources, thereby giving the analysts a one-stop place where everything they need is stored and catalogued. This helps them perform deep, meaningful analysis

Table of Contents

Introduction to TokuDB

TokuDB Logo
Image Source

TokuDB provides an alternate high-performance storage engine for MySQL in place of the ubiquitous InnoDB storage engine. This high-performance nature of the TokuDB engine comes from its use of fractal tree indexing. The engine is specifically designed for write-heavy workloads and performs at its best if used with SSDs.

The important thing to note here is that once installed as a storage engine for MySQL, it behaves exactly similar to any other MySQL instance with InnoDB and there is no difference to the querying capability. At the same time, it increases the storage limit of MySQL to even higher than tens of TBs, which is where a normal MySQL instance starts to slow down. That said TokuDB has a limitation, which may be of a deal breaker to some use cases. It does not enforce the foreign key constraint and it is the responsibility of the user to ensure foreign key constraints are met properly.

Introduction to Amazon Redshift

Amazon Redshift Logo
Image Source

Amazon Redshift offers a cloud-based data warehouse with a very clean interface and all the required APIs to query and analyze petabytes of data. The biggest advantage is that it allows the developer to focus only on the analysis jobs and leave all the complexities related to managing a reliable warehouse service.

Now that we are done with the reasons why the two technologies are important, let us move to the problem of copying data between the two.

Methods for TokuDB to Redshift replication

There are two popular methods to move data from TokuDB to Redshift:

Method 1: Using Custom ETL Scripts to Connect TokuDB to Redshift

This is a 7-step method that utilizes custom ETL Scripts to connect TokuDB to Redshift. It has considerable advantages and a few limitations as well.

Method 2: Using Hevo Data to Connect TokuDB to Redshift

Hevo provides a hassle-free solution and helps you set up GA4 Data Studio Integration without any intervention in an effortless manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. 

Get Started with Hevo for Free

Hevo’s pre-built integration with Google Analytics 4 (among 100+ Sources) will take full charge of the data transfer process, allowing you to focus on key business activities. 

Understanding the Methods to Connect TokuDB to Redshift

Here are the methods you can use to connect TokuDB to Redshift in a seamless fashion:

Method 1: Using Custom ETL Scripts to Connect TokuDB to Redshift

Here are the steps involved in using Custom ETL Scripts to connect TokuDB to Redshift:

Step 1: Copying Data from TokuDB

Since TokuDB does not change the basic behaviour of a MySQL instance, the simplest way to copy data from a TokuDB instance is to use mysqldump command. The challenge with this approach is that you need sufficiently large disk space to hold the complete data in your server where MySQL is installed. In case the server has enough space, execute the below command to take a copy of the source table.

mysqldump -u username -p -t -T/path/to/target_directory source_db_name source_table_name --fields-terminated-by=','

The above command will output two files to the target_directory –  source_table_name.sql and source_table_name.txt. The TXT file will contain data rows with columns separated by a comma.

If the MySQL server does not have enough storage space to hold the files and the copying operation needs to use a different machine to hold the data, use the below command from the helper machine.

Mysql -B -u username -p password sourcedb -h dbhost -e  "select * from source_table" -B | sed "s/'/'/;s/t/","/g;s/^/"/;s/$/"/;s/n//g" > source_table.csv

This command will output CSV files with the name source_table.csv directly in the location where the command is executed.

Step 2: Moving the Source Data Files to S3

Once the CSV/TXT file is generated, we need to copy this data to a place where redshift can access the data. Create an S3 folder in your AWS account and install AWS CLI. After that execute the below command.

aws s3 cp source_table.csv  s3://my_bucket/source_table/

Step 3: Loading Data to Redshift

If a complete overwrite of the Redshift data is permissible for the use case, the data in S3 folder can be copied to the Redshift table in one single command using the native copying command of Redshift. For doing this, login to AWS management console and type the following command in the Query Editor in Redshift console and execute.

copy target_table_name from ‘s3://my_bucket/source_table’ credentials access_key_id <access_key_id>  secret_access_key <secret_access_key>

Where access_key_id and secret_access_key represent the IAM credentials.

If a complete overwrite is not preferred for the use case, please go to the below step to execute an incremental load.

Step 4: Incremental Load Using a Temporary Table

You may have noted that in the above step complete data in the target table is overwritten by the command. There will be scenarios where this is unacceptable and the existing data in the target tables needs to be retained. If that is the case, we will first need to create a temporary table, copy all the data to that table and then insert it to the original table after removing the duplicates. The below command creates a temporary stage_table.

create temp table stage_table  (like target_table_name)

It is imperative that we use the ‘like’ keyword here so that the table structure of the staging table is the same as that of our target table even in the case of the distribution keys.

Step 5: Copying the Data into the Temporary Table

Next, you need to copy data from the temporary table in S3 folder using AWS credentials.

copy target_table_name from ‘s3://my_bucket/source_table’ credentials access_key_id <access_key_id>  secret_access_key <secret_access_key>

Step 6: Delete the Rows already present in the Temporary Table

Our temporary table now contains the complete data from TokuDB instance. If we load this data directly to the target table, it would result in the target table having duplicate rows. To avoid this, we first delete all the rows in the target table which are present in the temporary table.

begin transaction; delete from target_table_name using stage_table where target_table_name.primarykey = stage_table.primarykey;

Step 7: Load the Data Rows from the Temporary Table

This step loads all the data from the temporary table to the target table and the transaction is completed. Please note to wrap these commands inside a transaction so that the target table is not corrupted in case something goes wrong.

insert into target_table_name  select * from stage_table; end transaction;

Limitations of Using Custom ETL Scripts to Connect TokuDB to Redshift

Here are the limitations involved in using Custom ETL Scripts to connect TokuDB to Redshift:

  • Both mysqldump command and select table approach results in the table being slowed down during the operation in case of large datasets. 
  • There might be use cases where you want the data to be streamed in real-time to the data warehouse. You would have to put in additional effort in terms of engineering resources to build a system like this. 
  • Since this is a completely manual process, there will be problems related to data type conversions between TokuDB and Redshift. There is no scope for enforcing a specific data type conversion between TokuDB and Redshift in the above method. 
  • In case you build this infrastructure to load only a subset of data, often scripts break as the source schema is evolving. This can eventually result in data loss. 

Method 2: Using Hevo Data to Connect TokuDB to Redshift

Hevo Logo
Image Source

The best way to avoid the above limitations is to use a Data Integration Platform that works out of the box. Using a fully managed Data Pipeline platform like Hevo can help you bring data from TokuDB to Redshift instantly. Hevo’s real-time streaming architecture reliably moves the latest data into your warehouse.

Sign up here for a 14-Day Free Trial!

Hevo takes care of all your data preprocessing and lets you focus on key business activities and draw a much powerful insight on how to generate more leads, retain customers, and take your business to new heights of profitability. It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination. 

No coding, No ETL Scripts, No Cron Job management. With Hevo, move your data from Tokudb (MySQL) to Redshift in 3 simple steps:

  • Step 1: Connect to TokuDB (MySQL) source by providing connection settings.
  • Step 2: Select the mode of replication you want: (a) Load the result set of a Custom Query (b) Full dump of MySQL Tables (c) Load Data via BinLog.
  • Step 3: Configure the Redshift warehouse where the data needs to be moved.
Redshift Destination Configuration
Image Source

That’s it! Simple and straightforward. Hevo will do all the groundwork to ensure that the data is fetched in an incremental mode and copied to Redshift in real-time. 

Check out what makes Hevo amazing:

  • Real-Time Data Transfer: Hevo with its strong Integration with 100+ sources, allows you to transfer data quickly & efficiently. This ensures efficient utilization of bandwidth on both ends.
  • Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer. 
  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Tremendous Connector Availability: Hevo houses a large variety of connectors and lets you bring in data from numerous Marketing & SaaS applications, databases, etc. such as Google Analytics 4, Google Firebase, Airflow, HubSpot, Marketo, MongoDB, Oracle, Salesforce, Redshift, etc. in an integrated and analysis-ready form.
  • Simplicity: Using Hevo is easy and intuitive, ensuring that your data is exported in just a few clicks. 
  • Completely Managed Platform: Hevo is fully managed. You need not invest time and effort to maintain or monitor the infrastructure involved in executing codes.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.

Conclusion

This blog talks about the two methods you can use to move data from TokuDB to Redshift in a seamless fashion: using Google Data Studio and a third-party tool, Hevo.

Visit our Website to Explore Hevo

In addition to TokuDB, you can use Hevo Data Integration Platform to move data from 100s of different sources – Cloud Applications, Databases, SDKs, etc. You can check out the complete list here. (www.hevodata.com/integrations). This will ensure that all your Data integration needs are taken care of as your both your business and data sources grow. 

Sign Up for a 14 Day Free Trial to experience the simplicity of data replication from TokuDB to Redshift using Hevo Data Integration Platform. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

No-code Data Pipeline for Redshift