TokuDB to BigQuery ETL – Steps to Move Data

on Tutorials • March 29th, 2020 • Write for Hevo

Storing and querying massive datasets is a huge challenge especially if you lack the right hardware and infrastructure. Organizations of all sizes are looking to leverage the scale, simplicity, and security of deploying their data infrastructure on data warehouses. Google BigQuery is one such data warehouse that is tailored for analyzing data at scale. This blog is aimed to help you move data from TokuDB to BigQuery. 

Before we dive in, let us briefly understand a little about each of these applications.

What is TokuDB?

TokuDB is a transactional ACID + MVCC storage engine that makes MySQL and MariaDB even better by increasing performance, adding high compression, and allowing for true schema agility. These features were achieved by implementing Fractal Tree indexes.

Where does the performance come from:

  • TokuTek’s Fractal Tree indexes
    • Much Faster than B-trees in RAM workloads
  • Significant IO reduction
    • Messages defer IO on add/update/delete operations
  • All reads and writes are compressed
    • This enables users to add more indexes
  • Queries go faster

When it comes to features it’s pretty good but it lacks:

  • Foreign key
  • Support for virtual column
  • Transportable tablespace
  • Sampling statistics

TokuDB is offered in 2 editions:

  1. Community
    • Community support (Google Groups “tokudb-user”)
  2. Enterprise subscription with
    • Commercial support
    • Hot backup
    • Access to TokuDB experts
    • Input to the product roadmap

What is Google BigQuery?

BigQuery is Google’s fully managed, highly scalable, analytics data warehouse on RESTful web service. Companies use the BigQuery BI engine to store, query, ingest, and retrieve insights from their datasets. With BigQuery, the solution is delivered as a serverless Software as a Service (SaaS) which means your Data professionals can spend most of their time analyzing data to find meaningful insights, using standard SQL. BigQuery’s per-second billing approach is also very attractive to most businesses. You can read about the features of Google BigQuery here.

Why replicate TokuDB to Google BigQuery?

A major reason to replicate your TokuDB data into Google BigQuery is the ability to join multiple data sources for valuable analysis.

TokuDB is mostly used to store information and process transactions for software as a service (SaaS) applications. The reason behind it is that TokuDB is optimized for transactions, which require the ACID properties: atomicity, consistency, durability, and isolation. 

Contrary to TokuDB, BigQuery is optimized for data fetching and analysis, as opposed to transactions.

Methods for TokuDB to BigQuery Replication

There are several approaches to database replication. The key objective of this post is to illustrate the steps required to extract, transform, and load (ETL) data from an existing TokuDB database into Google BigQuery.

  1. Full dump and load using custom code
    Since TokuDB does not change the basic behavior of a MySQL instance, a simple approach would be to copy data from a TokuDB instance and export it to a CSV file and then load it into BigQuery.
  2. Use Hevo – an automated data pipeline
    Use Hevo Data to load, transform, and cleanse the data. Use this method to load large volumes of data from multiple data sources, or to load data incrementally and automatically.

TokuDB to BigQuery: Full dump and data load

Prerequisites:

  1. TokuDB with authenticated access to the specified schema.
  2. Google Cloud SDK installed with a shell configured for a GCP project.
  3. A bucket belonging to the GCP project with write permissions.
  4. A Google BigQuery cluster.
  5. Local Drive must have adequate space to hold the table in uncompressed CSV format.

TokuDB to BigQuery – General overview of the process:

Step 1 – Exporting table data to a CSV file.

Step 2 – Creating a GCS bucket.

Step 3 – Uploading the CSV file to GCS.

Step 4 – Loading data from GCS into BigQuery.

Step-by-step:

1. Export data to a CSV file from your local TokuDB Server.

mysql --host=[INSTANCE_IP] --user=[USER_NAME] --password [DATABASE] 
-e " SELECT * FROM [TABLE] INTO OUTFILE '[FILE_NAME]' CHARACTER SET 'utf8mb4'
    FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '"' "

2. Create a Google Cloud Storage (GCS) bucket for the export:

gsutil mb -p [PROJECT_NAME] -l [LOCATION_NAME] gs://[BUCKET_NAME]


3. Upload the CSV file to your GCS bucket:

gsutil cp [OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/

4. Load CSV data from your GCS bucket into BigQuery:

bq load 
    --source_format=CSV --field_delimiter="t"     --null_marker="N" --quote="" 
    [DATASET_NAME].[TABLE_NAME] 
    gs://[SOURCE_BUCKET_NAME]/[FILE_NAME].csv 
    ./[SCHEMA_NAME].json



The following command loads data from a GCS bucket into a table in your dataset. You must provide a schema defined in a local schema file. 

Example:

    bq load 
     --source_format=CSV --field_delimiter="t"     --null_marker="N" --quote="" 
     bigquery-public-data.san_francisco_film_locations 
     gs://mygcsbucket/data.csv 
     ./schema.json


Limitations

  1. BigQuery performs best when your data is denormalized using nested and repeated fields. CSV files do not support nested or repeated data.  If you want to load nested data, you should use the JSON (newline delimited), ORC, Avro, and Parquet formats.
  2. Compressed CSV files are quicker to load and they take up less space. On the other hand, BigQuery can load uncompressed files significantly faster than compressed files. However, you cannot include both compressed and uncompressed files in the same load job.
  3. If you use gzip compression, beware that the ETL process will be slower compared to loading uncompressed data since BigQuery cannot read the data in parallel.
  4. When you load CSV data, values in DATE columns must use the dash () separator and the date must be in the following format: YYYY-MM-DD (year-month-day).
  5. When you load CSV data, values in TIMESTAMP columns must use a dash (-) separator for the date segment of the timestamp, and the date should be in the following format: YYYY-MM-DD (year-month-day). The hh:mm:ss (hour-minute-second) segment of the timestamp must use a colon (:) separator.
  6. If any field in your TokuDB database contains a tab character (t), it will break your columns.
  7. The above approach does not allow you to load data in real-time. You will need to write additional code – set up a cron job to achieve this.
  8. In case you need to clean, transform and enrich the data before loading to the warehouse, this too will need you to write custom code. This acts as additional overhead. 

It is possible to build your own TokuDB to BigQuery ETL solution. It makes sense to do so especially when starting out with small data volumes in a full dump and load approach. However, if your solution requires either high throughput, high availability, low latency or frequent schema changes, then you’re probably better off using a third-party solution like Hevo Data.

As a rule of thumb, once you have tables that are bigger than 1GB or once you have more than a million daily updates to your tables, third party solutions start to become the more feasible alternative. For instance, when dealing with terabytes of data flowing through your system daily, Hevo’s automated data pipeline ensures that you never lose an event and all errors are caught to be re-streamed, saving valuable engineering time.

Hevo Data – A Simpler Alternative to Move Data from TokuDB to BigQuery

Hevo provides a robust automated data pipeline that continuously collects, transforms and moves data from TokuDB to BigQuery in real-time. Hevo comes with an intuitive drag and drop interface that lets you accomplish this task instantly without having to write a single line of code.

Here are the steps to move data from TokuDB to BigQuery:

  1. Connect to TokuDB source by providing connection settings
  2. Select the mode of replication you want: (a) Load the result set of a Custom Query (b) Full dump of tables (c) Load data via log
  3. Configure BigQuery warehouse where you want to load the data

That’s it! Simple and straightforward. Hevo will do all the groundwork to ensure that your data is moved from TokuDB and is copied to Google BigQuery in real-time. 

In addition to TokuDB, you can use Hevo to move data from 100s of different sources – Cloud Applications, Databases, SDKs, etc. You can check out the complete list here (www.hevodata.com/integrations). This will ensure that all your data integration needs are taken care of as both your business and data sources grow. 

Sign-up for a 14-day free trial to experience the simplicity of data replication from TokuDB (and 100s of other sources) to BigQuery using Hevo.

No-Code Data Pipeline for BigQuery