Storing and querying massive datasets is a huge challenge especially if you lack the right hardware and infrastructure. Organizations of all sizes are looking to leverage the scale, simplicity, and security of deploying their data infrastructure on data warehouses. Google BigQuery is one such data warehouse that is tailored for analyzing data at scale. This blog is aimed to help you move data from TokuDB to BigQuery.
Before we dive in, let us briefly understand a little about each of these applications.
Introduction to TokuDB
TokuDB is a transactional ACID + MVCC storage engine that makes MySQL and MariaDB even better by increasing performance, adding high compression, and allowing for true schema agility. These features were achieved by implementing Fractal Tree indexes.
A few key benefits of TokuDB are as follows:
- TokuTek’s Fractal Tree indexes
- Much Faster than B-trees in RAM workloads
- Significant IO reduction
- Messages defer IO on add/update/delete operations
- All reads and writes are compressed
- This enables users to add more indexes
- Queries go faster
However, TokuDB lacks the following features:
- Foreign key
- Support for virtual column
- Transportable tablespace
- Sampling statistics
TokuDB is offered in 2 editions:
- Community
- Community support (Google Groups “tokudb-user”)
- Enterprise subscription with
- Commercial support
- Hot backup
- Access to TokuDB experts
- Input to the product roadmap
Streamline your data transfer to Google BigQuery using Hevo’s no-code, automated platform. Hevo’s seamless integration ensures your data is always ready for analysis with just a few clicks.
Why Choose Hevo for BigQuery Data Transfer?
- No-Code Setup: Easily configure data pipelines without writing any code.
- Real-Time Sync: Enjoy continuous data updates for accurate analysis.
- Reliable & Scalable: Handle growing data volumes with Hevo’s robust infrastructure.
Experience hassle-free data transfer to BigQuery with Hevo’s integration process.
Get Started with Hevo for Free
Introduction to Google BigQuery
BigQuery is Google’s fully managed, highly scalable, analytics data warehouse on RESTful web service. Companies use the BigQuery BI engine to store, query, ingest, and retrieve insights from their datasets. With BigQuery, the solution is delivered as a serverless Software as a Service (SaaS) which means your Data professionals can spend most of their time analyzing data to find meaningful insights, using standard SQL. BigQuery’s per-second billing approach is also very attractive to most businesses. You can read about the features of Google BigQuery here.
Understanding the Importance of Connecting TokuDB to BigQuery
A major reason to replicate your TokuDB data into Google BigQuery is the ability to join multiple data sources for valuable analysis.
TokuDB is mostly used to store information and process transactions for software as a service (SaaS) applications. The reason behind it is that TokuDB is optimized for transactions, which require the ACID properties: atomicity, consistency, durability, and isolation.
Contrary to TokuDB, BigQuery is optimized for data fetching and analysis, as opposed to transactions.
Writing Custom Scripts to Move Data from TokuDB to BigQuery
These are the steps involved in using Custom ETL Scripts to move data from TokuDB to BigQuery:
Prerequisites:
- TokuDB with authenticated access to the specified schema.
- Google Cloud SDK needs to be installed with a shell configured for a GCP project.
- A bucket belonging to the GCP project with write permissions.
- A Google BigQuery cluster.
- Local Drive must have adequate space to hold the table in uncompressed CSV format.
Step 1: Exporting Table Data to a CSV File
mysql --host=[INSTANCE_IP] --user=[USER_NAME] --password [DATABASE]
-e " SELECT * FROM [TABLE] INTO OUTFILE '[FILE_NAME]' CHARACTER SET 'utf8mb4'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '"' "
- Connects to MySQL database.
- Runs a query to select all data from a table.
- Exports the result to a CSV file.
- Uses UTF-8 encoding and formats fields with commas, optionally enclosed in quotes.
Step 2: Creating a GCS Bucket
Create a Google Cloud Storage (GCS) bucket for the export:
gsutil mb -p [PROJECT_NAME] -l [LOCATION_NAME] gs://[BUCKET_NAME]
gsutil mb
: Creates a new Google Cloud Storage bucket.
-p [PROJECT_NAME]
: Specifies the Google Cloud project under which the bucket will be created.
-l [LOCATION_NAME]
: Defines the location (e.g., us-central1
, europe-west1
) where the bucket will be stored.
gs://[BUCKET_NAME]
: Specifies the name of the new storage bucket.
Step 3: Uploading the CSV File to GCS
gsutil cp [OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/
gsutil cp
: Copies files or objects to or from Google Cloud Storage.
[OBJECT_LOCATION]
: Specifies the path to the local file or Google Cloud Storage object that you want to copy.
gs://[DESTINATION_BUCKET_NAME]/
: Specifies the destination bucket in Google Cloud Storage where the object will be copied.
Step 4: Loading Data from GCS to BigQuery
bq load
--source_format=CSV --field_delimiter="t" --null_marker="N" --quote=""
[DATASET_NAME].[TABLE_NAME]
gs://[SOURCE_BUCKET_NAME]/[FILE_NAME].csv
./[SCHEMA_NAME].json
bq load
: Loads data into a BigQuery table.
--source_format=CSV
: Specifies that the source data format is CSV.
--field_delimiter="t"
: Sets the field delimiter to a tab character (note: should be "\t"
for a tab).
--null_marker="N"
: Defines “N” as the marker for null values in the data.
--quote=""
: Indicates that there are no characters used to quote fields in the CSV file.
[DATASET_NAME].[TABLE_NAME]
: Specifies the destination dataset and table in BigQuery where the data will be loaded.
gs://[SOURCE_BUCKET_NAME]/[FILE_NAME].csv
: Indicates the location of the CSV file in Google Cloud Storage.
./[SCHEMA_NAME].json
: Points to a JSON file that defines the schema for the BigQuery table.
The following command loads data from a GCS bucket into a table in your dataset. You must provide a schema defined in a local schema file.
Example:
bq load
--source_format=CSV --field_delimiter="t" --null_marker="N" --quote=""
bigquery-public-data.san_francisco_film_locations
gs://mygcsbucket/data.csv
./schema.json
bq load
: Loads data into a BigQuery table.
--source_format=CSV
: Indicates the source format is CSV.
--field_delimiter="t"
: Sets the field delimiter to a tab character (note: should be "\t"
for a tab).
--null_marker="N"
: Specifies “N” as the marker for null values.
--quote=""
: Specifies that there are no characters used to quote fields in the CSV file.
bigquery-public-data.san_francisco_film_locations
: Destination dataset and table in BigQuery where data will be loaded.
gs://mygcsbucket/data.csv
: Location of the CSV file in Google Cloud Storage.
./schema.json
: Path to the JSON file that defines the schema for the BigQuery table.
Limitations of using Custom Scripts to Connect TokuDB to BigQuery
- BigQuery performs best when your data is denormalized using nested and repeated fields. CSV files do not support nested or repeated data. If you want to load nested data, you should use the JSON (newline delimited), ORC, Avro, and Parquet formats.
- Compressed CSV files are quicker to load and they take up less space. On the other hand, BigQuery can load uncompressed files significantly faster than compressed files. However, you cannot include both compressed and uncompressed files in the same load job.
- If you use gzip compression, beware that the ETL process will be slower compared to loading uncompressed data since BigQuery cannot read the data in parallel.
- When you load CSV data, values in DATE columns must use the dash (-) separator and the date must be in the following format: YYYY-MM-DD (year-month-day).
- When you load CSV data, values in TIMESTAMP columns must use a dash (-) separator for the date segment of the timestamp, and the date should be in the following format: YYYY-MM-DD (year-month-day). The hh:mm:ss (hour-minute-second) segment of the timestamp must use a colon (:) separator.
- If any field in your TokuDB database contains a tab character (t), it will break your columns.
- The above approach does not allow you to load data in real-time. You will need to write additional code – set up a cron job to achieve this.
- In case you need to clean, transform and enrich the data before loading it to the warehouse, this too will need you to write custom code. This acts as additional overhead.
It is possible to build your own TokuDB to BigQuery ETL solution. It makes sense to do so especially when starting with small data volumes in a full dump and load approach. However, if your solution requires either high throughput, high availability, low latency, or frequent schema changes, then you’re probably better off using a third-party solution like Hevo Data.
As a rule of thumb, once you have tables that are bigger than 1GB or once you have more than a million daily updates to your tables, third-party solutions start to become the more feasible alternative. For instance, when dealing with terabytes of data flowing through your system daily, Hevo’s automated data pipeline ensures that you never lose an event and all errors are caught to be re-streamed, saving valuable engineering time.
Seamlessly load data from MySQL to BigQuery
No credit card required
Why consider Hevo for Data Transfer :
- Code-free data loading: You need not invest any engineering resources to write and maintain any ETL scripts or cron jobs. There is no need for constant monitoring and maintenance of the data transfer process.
- Set up in minutes: Since Hevo is fully managed, the setup time is minimal. Data is moved from Asana to BigQuery in minutes.
- Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use.
- Zero data loss: Hevo reliably delivers your data from Asana to BigQuery ensuring that the data is accurate and there is no loss of data.
- Automatic schema mapping: Hevo automatically detects schema, any changes, and maps the data accordingly. It will detect any change in the incoming Asana schema and make necessary changes in BigQuery, without any human intervention.
You can use Hevo to move data from 150+ different sources(including 60+ free sources)– Cloud Applications, Databases, SDKs, etc. This will ensure that all your data integration needs are taken care of as both your business and data sources grow.
Learn More About:
Methods to Connect Asana and BigQuery
Conclusion
This blog talks about the two methods you can use to set up a connection from TokuDB to BigQuery: using custom ETL scripts. It also gives a brief overview of TokuDB and BigQuery highlighting their key features and benefits before diving into the setup process.
Want to take Hevo for a spin? Try Hevo’s 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at our unbeatable Hevo Pricing that will help you choose the right plan for your business needs!
How is your experience of migrating data from TokuDB to BigQuery? Tell us in the comments below!
FAQs
1. How to load data from Hive to BigQuery?
You can use Hevo to easily load data from Hive to BigQuery. Hevo’s no-code platform lets you set up data pipelines that automatically extract, transform, and load data from Hive to BigQuery in real-time without any manual effort.
2. How do I transfer data to BigQuery?
Data can be transferred to BigQuery using Google’s built-in tools like Google Cloud Data Transfer Service or through platforms like Hevo, which support multiple data sources. Hevo simplifies the process by automating the data flow to BigQuery without coding.
3. How to convert Oracle query to BigQuery?
To convert an Oracle query to BigQuery, you’ll need to modify the SQL syntax to match BigQuery’s standards. Using a tool like Hevo can help, as it transforms the data automatically, ensuring it fits BigQuery’s requirements during the transfer.
Vivek Sinha is a seasoned product leader with over 10 years of expertise in revolutionizing real-time analytics and cloud-native technologies. He specializes in enhancing Apache Pinot, focusing on query processing and data mutability. Vivek is renowned for his strategic vision and ability to deliver cutting-edge solutions that empower businesses to harness the full potential of their data.