TokuDB displays high-performance and provides much higher storage capabilities, all without slowing down. Hence, organizations with a write-heavy load for their databases will be motivated to use TokuDB instead of the typical InnoDB engine for MySQL. Since MySQL as such is not ideal for offline analysis and aggregation requirements, these databases will need to be periodically copied to a data warehouse like AWS Redshift.

Any transactional database is optimized for fast writes and short response times. It may not be ideal for business analysts to use the live transactional database for their complex queries. This will lead to a slow down of the online database. Moreover, a Data Warehouse can contain data from multiple sources, thereby giving the analysts a one-stop place where everything they need is stored and cataloged. This helps them perform deep, meaningful analysis. Therefore, it is imperative to move data from TokuDB to Redshift.

This article will introduce you to TokuDB and Amazon Redshift. It will also explain easy method using which you can transfer data from TokuDB to Amazon Redshift easily.

Introduction to TokuDB

TokuDB provides an alternate high-performance storage engine for MySQL in place of the ubiquitous InnoDB storage engine. This high-performance nature of the TokuDB engine comes from its use of fractal tree indexing. The engine is specifically designed for write-heavy workloads and performs at its best if used with SSDs.

The important thing to note here is that once installed as a storage engine for MySQL, it behaves exactly similar to any other MySQL instance with InnoDB and there is no difference to the querying capability. At the same time, it increases the storage limit of MySQL to even higher than tens of TBs, which is where a normal MySQL instance starts to slow down. That said TokuDB has a limitation, which may be a deal-breaker to some use cases. It does not enforce the foreign key constraint and it is the responsibility of the user to ensure foreign key constraints are met properly.

Introduction to Amazon Redshift

TokuDB to Redshift: Amazon Redshift Logo

Amazon Redshift offers a cloud-based data warehouse with a very clean interface and all the required APIs to query and analyze petabytes of data. The biggest advantage is that it allows the developer to focus only on the analysis jobs and leave all the complexities related to managing a reliable warehouse service.

Now that we are done with the reasons why the two technologies are important, let us move to the problem of copying data between the two.

Need a No-Code Data Pipeline to Transfer Your Data? Try Hevo

With its no-code data pipeline and easy-to-use interface, Hevo provides a hassle-free solution for transferring data from various sources to desired destination/warehouse. It is fully managed and completely automates the process of not only loading data but also transforming it into an analysis-ready form. 

Advantages of using Hevo Data:

  • Minimal Setup- You will require minimal setup and bandwidth to load data from Webhooks using the Hevo platform. 
  • No Data Loss- The Hevo architecture is fault-tolerant and allows easy, reliable, and seamless data transfer from various data sources to any data warehouse without data loss. 
  • 100’s of Out-of-the-Box Integrations– The Hevo platform brings data from various sources, such as SDKs, Cloud Applications, Databases, and so on, into Data Warehouses and Databases. 
  • Automatic Schema Detection and Mapping- The schema of incoming data is scanned automatically. If changes are detected, they are handled seamlessly and incorporated into the Database or Data Warehouse. 
  • Exceptional Support- Hevo has 24×7 Technical support through emails, calls, and chat.
  • Transparent Pricing-  Hevo offers transparent pricing with no hidden fees, allowing you to budget effectively while scaling your data integration needs.

Elevate your data migration and transformation journey with Hevo!

Get Started with Hevo for Free

Using Custom ETL Scripts to Connect TokuDB to Redshift

Here are the steps involved in using Custom ETL Scripts to connect TokuDB to Redshift:

Step 1: Copying Data from TokuDB

Since TokuDB does not change the basic behavior of a MySQL instance, the simplest way to copy data from a TokuDB instance is to use mysqldump command. The challenge with this approach is that you need sufficiently large disk space to hold the complete data in your server where MySQL is installed. In case the server has enough space, execute the below command to take a copy of the source table.

mysqldump -u username -p -t -T/path/to/target_directory source_db_name source_table_name --fields-terminated-by=','
  • mysqldump -u username -p: Starts the mysqldump utility to export a MySQL database. -u username specifies the username for authentication, and -p prompts for the password.
  • -t: Indicates that only the table data should be dumped, not the table structure.
  • -T/path/to/target_directory: Specifies the target directory where the output files will be stored. Each table will be dumped as a .txt file in this directory.
  • source_db_name: The name of the source database from which the table data will be exported.
  • source_table_name: The name of the specific table to export from the source database.
  • --fields-terminated-by=',': Specifies that fields in the output file should be separated by commas, creating a CSV format.

The above command will output two files to the target_directory –  source_table_name.sql and source_table_name.txt. The TXT file will contain data rows with columns separated by a comma.

If the MySQL server does not have enough storage space to hold the files and the copying operation needs to use a different machine to hold the data, use the below command from the helper machine.

Mysql -B -u username -p password sourcedb -h dbhost -e  "select * from source_table" -B | sed "s/'/'/;s/t/","/g;s/^/"/;s/$/"/;s/n//g" > source_table.csv
  • mysql -B -u username -p password sourcedb -h dbhost:
    • Connects to the MySQL database using:
      • -B: Enables batch mode for tab-separated output.
      • -u username: Specifies the username for authentication.
      • -p password: The password for the MySQL user.
      • sourcedb: The name of the source database.
      • -h dbhost: The hostname of the database server.
  • -e "select * from source_table": Executes the SQL command to select all records from source_table.
  • -B: Ensures the output is in batch mode (tab-separated).
  • | sed "s/'/'/;s/t/","/g;s/^/"/;s/$/"/;s/n//g":
    • Pipes the output through sed, which processes the text as follows:
      • s/'/'/: Keeps single quotes intact.
      • s/t/","/g: Replaces tabs with a comma (,).
      • s/^/"/: Adds a starting double quote (") to each line.
      • s/$/"/: Adds an ending double quote (") to each line.
      • s/n//g: Removes newline characters.
  • > source_table.csv: Redirects the final output into a file named source_table.csv.

This command will output CSV files with the name source_table.csv directly in the location where the command is executed.

Step 2: Moving the Source Data Files to S3

Once the CSV/TXT file is generated, we need to copy this data to a place where redshift can access the data. Create an S3 folder in your AWS account and install AWS CLI. After that execute the below command.

aws s3 cp source_table.csv  s3://my_bucket/source_table/
  • aws s3 cp: This is the AWS CLI command used to copy files to or from Amazon S3.
  • source_table.csv: Specifies the local file that you want to copy. In this case, it’s source_table.csv.
  • s3://my_bucket/source_table/: Specifies the destination in Amazon S3 where the file will be copied. Here, it’s being copied to the source_table folder within the S3 bucket named my_bucket.

Step 3: Loading Data to Redshift

If a complete overwrite of the Redshift data is permissible for the use case, the data in S3 folder can be copied to the Redshift table in one single command using the native copying command of Redshift. For doing this, login to AWS management console and type the following command in the Query Editor in Redshift console and execute.

copy target_table_name from ‘s3://my_bucket/source_table’ credentials access_key_id <access_key_id>  secret_access_key <secret_access_key>
  • COPY target_table_name: This command is used to load data into target_table_name from an external source, such as an S3 bucket.
  • FROM 's3://my_bucket/source_table': Specifies the source location of the data to be copied, which is a path to a folder or file in the S3 bucket named my_bucket.
  • CREDENTIALS: This clause is used to provide authentication details to access the S3 bucket.
  • access_key_id <access_key_id>: Specifies the AWS access key ID for authentication. You need to replace <access_key_id> with your actual access key ID.
  • secret_access_key <secret_access_key>: Specifies the AWS secret access key for authentication. You need to replace <secret_access_key> with your actual secret access key.

Where access_key_id and secret_access_key represent the IAM credentials.

If a complete overwrite is not preferred for the use case, please go to the below step to execute an incremental load.

Integrate MariaDB to Redshift
Integrate MongoDB to Snowflake
Integrate DynamoDB to BigQuery

Step 4: Incremental Load Using a Temporary Table

You may have noted that in the above step complete data in the target table is overwritten by the command. There will be scenarios where this is unacceptable and the existing data in the target tables needs to be retained. If that is the case, we will first need to create a temporary table, copy all the data to that table and then insert it to the original table after removing the duplicates. The below command creates a temporary stage_table.

create temp table stage_table  (like target_table_name)

It is imperative that we use the ‘like’ keyword here so that the table structure of the staging table is the same as that of our target table even in the case of the distribution keys.

Step 5: Copying the Data into the Temporary Table

Next, you need to copy data from the temporary table in S3 folder using AWS credentials.

copy target_table_name from ‘s3://my_bucket/source_table’ credentials access_key_id <access_key_id>  secret_access_key <secret_access_key>
  • COPY target_table_name:
    • This command loads data into a database table named target_table_name.
  • FROM 's3://my_bucket/source_table':
    • Specifies the source of the data, which is located in the S3 bucket named my_bucket, under the source_table path.
  • CREDENTIALS:
    • This section provides the necessary authentication details for accessing the S3 bucket.
  • access_key_id <access_key_id>:
    • Replace <access_key_id> with your actual AWS access key ID, used to identify your account.
  • secret_access_key <secret_access_key>:
    • Replace <secret_access_key> with your actual AWS secret access key, used for authentication.

Step 6: Delete the Rows already present in the Temporary Table

Our temporary table now contains the complete data from TokuDB instance. If we load this data directly to the target table, it would result in the target table having duplicate rows. To avoid this, we first delete all the rows in the target table which are present in the temporary table.

begin transaction; delete from target_table_name using stage_table where target_table_name.primarykey = stage_table.primarykey;
  • BEGIN TRANSACTION;:
    • Starts a transaction for grouped operations.
  • DELETE FROM target_table_name:
    • Deletes rows from target_table_name.
  • USING stage_table:
    • References stage_table to identify which rows to delete.
  • WHERE target_table_name.primarykey = stage_table.primarykey;:
    • Deletes rows in target_table_name that have matching primary keys in stage_table.

Step 7: Load the Data Rows from the Temporary Table

This step loads all the data from the temporary table to the target table and the transaction is completed. Please note to wrap these commands inside a transaction so that the target table is not corrupted in case something goes wrong.

insert into target_table_name  select * from stage_table; end transaction;

Limitations of Using Custom ETL Scripts to Connect TokuDB to Redshift

Here are the limitations involved in using Custom ETL Scripts to connect TokuDB to Redshift:

  • Both mysqldump command and select table approach result in the table being slowed down during the operation in the case of large datasets. 
  • There might be use cases where you want the data to be streamed in real-time to the data warehouse. You would have to put in additional effort in terms of engineering resources to build a system like this. 
  • Since this is a completely manual process, there will be problems related to data type conversions between TokuDB and Redshift. There is no scope for enforcing a specific data type conversion between TokuDB and Redshift in the above method. 
  • In case you build this infrastructure to load only a subset of data, often scripts break as the source schema is evolving. This can eventually result in data loss. 

Conclusion

This blog talks about the method you can use to move data from TokuDB to Redshift in a seamless fashion: using Google Data Studio.

You can use Hevo Data to move data from 150+ Data Sources, such as Cloud Applications, Databases, SDKs, etc to the Data Warehouse of your choice. This will ensure that all your Data integration needs are taken care of as both your business and data sources grow. Connect with us today to improve your data management experience and achieve more with your data.

FAQs

How to migrate MongoDB to Redshift?

Export MongoDB data to an intermediary storage like S3, then use Redshift’s COPY command to load the data. Alternatively, tools like Hevo can be used to automate the migration.

What is the difference between Redshift and DynamoDB?

Redshift is a data warehouse optimized for complex queries and analytics, while DynamoDB is a NoSQL database designed for real-time applications with key-value and document data models.

Satyam Agrawal
CX Engineer

Satyam boasts over two years of adept troubleshooting and deliverable-oriented experience. His client-focused approach has enabled seamless data pipeline management for numerous SMEs and Enterprises. Proficient in Hevo’s ETL architecture and skilled in DBMS sources, he ensures smooth data movement for clients. Satyam leverages automated tools to extract and load data from various databases to warehouses, implementing SQL principles and API calls for day-to-day troubleshooting.