TokuDB to Snowflake ETL – Steps to Move Data Instantly

on Tutorial • February 25th, 2020 • Write for Hevo

The need to store, transform, analyze and share data is growing exponentially, with demand for cloud-based data analytics and data warehouse solutions also on the rise. Using the cloud for data processing, analytics, and reporting has now become quite popular mainly due to the convenience and superior performance. In this blog post, we will go over a migration scenario where a fictional business is attempting to migrate their data from an on-prem TokuDB to Snowflake, a cloud-based data warehouse. To this aim, let’s first compare both solutions.

Understanding TokuDB

TokuDB is a highly scalable MySQL and MariaDB storage engine. It offers high data compression, fast insertions, and deletions, among many other features. This makes it a great solution for use in high-performance and write-intensive environments. It uses a fractal tree data structure and huge data pages to efficiently manage and read the data. However, concurrency, scale, resiliency, and security are some of the bottlenecks that limit TokuDB’s performance. It is available in an open-source version and an enterprise edition. 

Understanding Snowflake

Snowflake is a cloud data warehouse that came out in 2015. It is primarily available on AWS and Azure. Snowflake is similar to BigQuery in that it stores data separately from where it does it’s compute. It stores the actual data of your tables in S3 and then it can provision any number of compute nodes to process that data. 

In contrast, Snowflake offers instant access to unlimited resources (compute and storage) on-demand.

Snowflake Benefits:

  1. Snowflake is specifically optimized for analytics workloads. It’s therefore ideal for businesses dealing with very complex data sets.
  2. Snowflake offers better performance both in terms of storage capacity and query performance.
  3. Snowflake also offers better security compared to an on-prem data warehouse. This is because cloud data warehouses are required to meet stringent security requirements.
  4. Migrating your data to the cloud is also cost-effective since there is no huge initial outlay and you don’t have to maintain physical infrastructure.

Moving Data from TokuDB to Snowflake

  1. Building Custom ETL Code This approach would need you to invest in heavy engineering resources. The broad steps in this approach would need you to understand the S3 data source, write code to extract data from S3, prepare the data and finally copy it into Snowflake. The details and challenges of each step are described in the next sections.
  2. Using a Ready Data Integration Platform like Hevo (Official Snowflake ETL Partner)
    Hevo is a cloud data pipeline platform that seamlessly moves data from TokuDB to Snowflake in real-time without having to write any code. By deploying Hevo, the data transfer can be completely automated and would not need any human intervention. This will allow you to direct your team’s bandwidth in extracting meaningful insights instead of wrangling with code.

TokuDB to Snowflake: Building Custom Code to Move Data

1. Export TokuDB tables to CSV

There are multiple ways to backup a TokuDB database and we will be using a simple SQL command to perform a logical backup.

SELECT * FROM `database_name`.`table_name` INTO OUTFILE 'path_to_folder/filename.csv' 
FIELDS ENCLOSED BY '"' TERMINATED BY ';' ESCAPED BY '"' LINES TERMINATED BY 'rn';"

This command dumps the data into CSV format which can then easily be imported into Snowflake.

Repeat this command for all tables and ensure that your TokuDB server has enough storage space to hold the CSV files.

2. Upload the source data files to S3

After generating the CSV/TXT file, we need to upload this data to a place where Snowflake can access it. Install the AWS CLI on your system 

How to install the AWS CLI 

After that execute the following command.

aws s3 cp filename.csv  s3://{YOUR_BUCKET_NAME}

3. Create an S3 Stage

Using the SnowSQL CLI client, run this command:

create or replace stage my_csv_stage
  file_format = mycsvformat
  url = 's3://{YOUR_BUCKET_NAME}';

The example above creates an external stage named my_csv_stage.

4. Create a table in Snowflake

Create a table with your schema. You will load data into this table in the next step.

create or replace table {YOUR_TABLE_NAME}
  ('$TABLE_SCHEMA')

5. Loading data to Snowflake – Complete overwrite

Loading data requires a Snowflake compute cluster. Run the following command in the SnowSQL CLI client:

copy into {YOUR_TABLE_NAME}
  from s3://{YOUR_BUCKET_NAME} credentials=(aws_key_id='$AWS_ACCESS_KEY_ID' aws_secret_key='$AWS_SECRET_ACCESS_KEY')
  file_format = (type = csv field_delimiter = '|' skip_header = 1);

This command will load data from all CSV files in the S3 bucket.

6. Validation

select * from {YOUR_TABLE_NAME} limit 10;

The above approach is effort-intensive. You would need to hand-code many steps that run coherently to achieve the objective.

TokuDB to Snowflake: Limitations of Custom Code Approach

  • This method is ideal for a one-time bulk load. In case you are looking to stream data in real-time, you might have to configure cron jobs and write additional code to achieve this. 
  • More often than not, the use case to move data from TokuDB to Snowflake data warehouse is not this straightforward. You might need to clean, transform and enrich the data to make it analysis-ready. This would not be easy to achieve. 
  • Since the data moved from TokuDB is critical to your business, you will need to constantly monitor the infrastructure to ensure that nothing breaks. Failure at any step would lead you to irretrievable data loss. 

Easier Way To Move Data From TokuDB To Snowflake

Using Hevo (Official Snowflake ETL partner) — a managed system that simplifies data migration. Moving data from TokuDB to Snowflake requires just 3 steps:

  1. Connect to your TokuDB database.
  2. Select the mode of replication:
    -Load selected TokuDB tables
    -Load data via Custom Query
    -Load data through Binlog.
  3. Configure the Snowflake destination.

That is it, Hevo will now take care of reliably loading data from TokuDB to Snowflake in real-time.

In addition to TokuDB, Hevo can also bring data from a wide array of data sources into Snowflake. Database (MySQL, PostgreSQL, MongoDB and more), Cloud Applications (Google Analytics, Google Ads, Facebook Ads, Salesforce, and more). This allows Hevo to scale on-demand as your data needs grow. Sign up for a full-feature free trial (14 days) to see the simplicity of Hevo first-hand.

No-code Data Pipeline for Snowflake