Storing and querying massive datasets is a huge challenge especially if you lack the right hardware and infrastructure. Organizations of all sizes are looking to leverage the scale, simplicity, and security of deploying their data infrastructure on data warehouses. Google BigQuery is one such data warehouse that is tailored for analyzing data at scale. This blog is aimed to help you move data from TokuDB to BigQuery

Before we dive in, let us briefly understand a little about each of these applications.

Introduction to TokuDB

TokuDB Logo
Image Source

TokuDB is a transactional ACID + MVCC storage engine that makes MySQL and MariaDB even better by increasing performance, adding high compression, and allowing for true schema agility. These features were achieved by implementing Fractal Tree indexes.

A few key benefits of TokuDB are as follows:

  • TokuTek’s Fractal Tree indexes
    • Much Faster than B-trees in RAM workloads
  • Significant IO reduction
    • Messages defer IO on add/update/delete operations
  • All reads and writes are compressed
    • This enables users to add more indexes
  • Queries go faster

However, TokuDB lacks the following features:

  • Foreign key
  • Support for virtual column
  • Transportable tablespace
  • Sampling statistics

TokuDB is offered in 2 editions:

  1. Community
    • Community support (Google Groups “tokudb-user”)
  2. Enterprise subscription with
    • Commercial support
    • Hot backup
    • Access to TokuDB experts
    • Input to the product roadmap

Introduction to Google BigQuery

Google BigQuery Logo
Image Source

BigQuery is Google’s fully managed, highly scalable, analytics data warehouse on RESTful web service. Companies use the BigQuery BI engine to store, query, ingest, and retrieve insights from their datasets. With BigQuery, the solution is delivered as a serverless Software as a Service (SaaS) which means your Data professionals can spend most of their time analyzing data to find meaningful insights, using standard SQL. BigQuery’s per-second billing approach is also very attractive to most businesses. You can read about the features of Google BigQuery here.

Understanding the Importance of Connecting TokuDB to BigQuery

A major reason to replicate your TokuDB data into Google BigQuery is the ability to join multiple data sources for valuable analysis.

TokuDB is mostly used to store information and process transactions for software as a service (SaaS) applications. The reason behind it is that TokuDB is optimized for transactions, which require the ACID properties: atomicity, consistency, durability, and isolation. 

Contrary to TokuDB, BigQuery is optimized for data fetching and analysis, as opposed to transactions.

Methods for TokuDB to BigQuery Replication

There are several approaches to database replication. The key objective of this post is to illustrate the steps required to extract, transform, and load (ETL) data from an existing TokuDB database into Google BigQuery.

Method 1: Writing Custom Scripts to Move Data from TokuDB to BigQuery

Since TokuDB does not change the basic behavior of a MySQL instance, a simple approach would be to copy data from a TokuDB instance and export it to a CSV file and then load it into BigQuery.

Method 2: Using Hevo Data to Move Data from TokuDB to BigQuery

Use Hevo Data to load, transform, and cleanse the data. Use this method to load large volumes of data from multiple data sources, or to load data incrementally and automatically.

Get Started with Hevo for Free

Understanding the Methods to Connect TokuDB and BigQuery

You can use the following methods to establish a connection from TokuDB to BigQuery in a seamless fashion:

Method 1: Writing Custom Scripts to Move Data from TokuDB to BigQuery

These are the steps involved in using Custom ETL Scripts to move data from TokuDB to BigQuery:

Prerequisites:

  • TokuDB with authenticated access to the specified schema.
  • Google Cloud SDK needs to be installed with a shell configured for a GCP project.
  • A bucket belonging to the GCP project with write permissions.
  • A Google BigQuery cluster.
  • Local Drive must have adequate space to hold the table in uncompressed CSV format.

Step 1: Exporting Table Data to a CSV File

mysql --host=[INSTANCE_IP] --user=[USER_NAME] --password [DATABASE] 
-e " SELECT * FROM [TABLE] INTO OUTFILE '[FILE_NAME]' CHARACTER SET 'utf8mb4'
    FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '"' "

Step 2: Creating a GCS Bucket

Create a Google Cloud Storage (GCS) bucket for the export:

gsutil mb -p [PROJECT_NAME] -l [LOCATION_NAME] gs://[BUCKET_NAME]


Step 3: Uploading the CSV File to GCS

gsutil cp [OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/

Step 4: Loading Data from GCS to BigQuery

bq load 
    --source_format=CSV --field_delimiter="t"     --null_marker="N" --quote="" 
    [DATASET_NAME].[TABLE_NAME] 
    gs://[SOURCE_BUCKET_NAME]/[FILE_NAME].csv 
    ./[SCHEMA_NAME].json



The following command loads data from a GCS bucket into a table in your dataset. You must provide a schema defined in a local schema file. 

Example:

    bq load 
     --source_format=CSV --field_delimiter="t"     --null_marker="N" --quote="" 
     bigquery-public-data.san_francisco_film_locations 
     gs://mygcsbucket/data.csv 
     ./schema.json


Limitations of using Custom Scripts to Connect TokuDB to BigQuery

  • BigQuery performs best when your data is denormalized using nested and repeated fields. CSV files do not support nested or repeated data.  If you want to load nested data, you should use the JSON (newline delimited), ORC, Avro, and Parquet formats.
  • Compressed CSV files are quicker to load and they take up less space. On the other hand, BigQuery can load uncompressed files significantly faster than compressed files. However, you cannot include both compressed and uncompressed files in the same load job.
  • If you use gzip compression, beware that the ETL process will be slower compared to loading uncompressed data since BigQuery cannot read the data in parallel.
  • When you load CSV data, values in DATE columns must use the dash (-) separator and the date must be in the following format: YYYY-MM-DD (year-month-day).
  • When you load CSV data, values in TIMESTAMP columns must use a dash (-) separator for the date segment of the timestamp, and the date should be in the following format: YYYY-MM-DD (year-month-day). The hh:mm:ss (hour-minute-second) segment of the timestamp must use a colon (:) separator.
  • If any field in your TokuDB database contains a tab character (t), it will break your columns.
  • The above approach does not allow you to load data in real-time. You will need to write additional code – set up a cron job to achieve this.
  • In case you need to clean, transform and enrich the data before loading it to the warehouse, this too will need you to write custom code. This acts as additional overhead. 

It is possible to build your own TokuDB to BigQuery ETL solution. It makes sense to do so especially when starting with small data volumes in a full dump and load approach. However, if your solution requires either high throughput, high availability, low latency, or frequent schema changes, then you’re probably better off using a third-party solution like Hevo Data.

As a rule of thumb, once you have tables that are bigger than 1GB or once you have more than a million daily updates to your tables, third-party solutions start to become the more feasible alternative. For instance, when dealing with terabytes of data flowing through your system daily, Hevo’s automated data pipeline ensures that you never lose an event and all errors are caught to be re-streamed, saving valuable engineering time.

Method 2: Using Hevo Data to Move Data from TokuDB to BigQuery

Hevo Logo
Image Source

Hevo provides a robust automated data pipeline that continuously collects, transforms, and moves data from TokuDB to BigQuery in real-time. Hevo comes with an intuitive drag and drop interface that lets you accomplish this task instantly without having to write a single line of code.

Here are the steps to move data from TokuDB to BigQuery:

  • Step 1: Connect to TokuDB source by providing connection settings
  • Step 2: Select the mode of replication you want: (a) Load the result set of a Custom Query (b) Full dump of tables (c) Load data via log
  • Step 3: Configure BigQuery warehouse where you want to load the data
Sign up here for a 14-Day Free Trial!

Here are a few key advantages of leveraging Hevo:

  • Code-free data loading: You need not invest any engineering resources to write and maintain any ETL scripts or cron jobs. There is no need for constant monitoring and maintenance of the data transfer process.
  • Set up in minutes: Since Hevo is fully managed, the setup time is minimal. Data is moved from Asana to BigQuery in minutes.
  • Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use.
  • Zero data loss: Hevo reliably delivers your data from Asana to BigQuery ensuring that the data is accurate and there is no loss of data.
  • Automatic schema mapping: Hevo automatically detects schema, any changes, and maps the data accordingly. It will detect any change in the incoming Asana schema and make necessary changes in BigQuery, without any human intervention.

That’s it! Simple. Hevo will do all the groundwork to ensure that your data is moved from TokuDB and is copied to Google BigQuery in real-time. 

In addition to TokuDB, you can use Hevo to move data from 150+ different sources(including 40+ free sources)– Cloud Applications, Databases, SDKs, etc. This will ensure that all your data integration needs are taken care of as both your business and data sources grow. 

Conclusion

This blog talks about the two methods you can use to set up a connection from TokuDB to BigQuery: using custom ETL scripts and with the help of a third-party tool, Hevo. It also gives a brief overview of TokuDB and BigQuery highlighting their key features and benefits before diving into the setup process. Setting a custom ETL Script can be challenging especially for a beginner & this is where Hevo Data saves the day.

Visit our Website to Explore Hevo

Hevo Data, a No-code Data Pipeline, helps you transfer data from a source of your choice in a fully automated and secure manner without having to write the code repeatedly. Hevo, with its strong integration with 150+ sources(40+ free sources) & BI tools, allows you to not only export & load data but also transform & enrich your data & make it analysis-ready in a jiff.

Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at our unbeatable Hevo Pricing that will help you choose the right plan for your business needs!

What is your preferred way of migrating data from TokuDB to BigQuery? Tell us in the comments below!

mm
Former Director of Product Management, Hevo Data

Vivek Sinha has more than 10 years of experience in real-time analytics and cloud-native technologies. With a focus on Apache Pinot, he was a driving force in shaping innovation and defensible differentiators, including enhanced query processing, data mutability support, and cost-effective tiered storage solutions at Hevo. He also demonstrates a passion for exploring and implementing innovative trends within the dynamic data industry landscape.

All your customer data in one place.