Often for analytical or archival purposes, one would need to move data from your MariaDB transactional database to Google BigQuery. This post covers the detailed steps you need to follow to move data from MariaDB to Google BigQuery for building your analytical data warehouse.Before we go through the steps to load data, let us briefly look at the strengths of Google BigQuery and MariaDB.
Understanding Google BigQuery
Google BigQuery is a serverless NoOps, highly scalable enterprise data warehouse that comes with a built-in query engine for fast SQL analytics. It provides persistent storage for structured and semi-structured data across a petabit-per-second bandwidth network. BigQuery can run SQL queries on terabytes of data in seconds.
BigQuery is a managed service and the ease of use that comes with serverless, ad hoc querying has introduced a modern way of working with big data. Since BigQuery is serverless, you can take advantage of this and use a small agile team of engineers to work on data analytics. You can read more about Google BigQuery here.
MariaDB is a relational online transaction processing (OLTP) database. Its major advantage is that it supports querying using Structured Query Language (SQL). MariaDB takes advantage of locking to ensure data consistency. This means that you can query MariaDB even if data is concurrently being written to it. It is great for queries that require aggregation, filtering, grouping, and sorting within tables. However, it is not optimized for queries that require traversal across the entire dataset.
Why Move Data into Google BigQuery?
1. Powerful Analytics
There many types of analytics workloads that you can run in a data warehouse. Since BigQuery is an SQL engine, you can use Business Intelligence tools like Google Data Studio, Looker, and Tableau to create descriptive data visualizations and reports.
2. Machine Learning Capabilities
BigQuery goes beyond conventional data warehousing in that you can use it to create machine learning models to carry out batch predictions without having to export data out of BigQuery.
3. Simplified Workflows
BigQuery by design is meant to encourage customers to focus on gathering insights as opposed to managing infrastructure. This approach allows teams to innovate faster with fewer resources. DBAs are not needed to provision and maintain servers and this enables you to work with a lean team of data analysts.
4. Scale-out Architecture
From an architectural point of view, the only limit on speed and scale in BigQuery is the amount of hardware in the Availability Zone (AZ). Queries are automatically scaled to thousands of machines and executed in parallel. This is the same infrastructure that is in use on other Google products like AdWords, YouTube, Gmail, G-Suite, Google Search, etc.
5. Rich Product Ecosystem
BigQuery is part of the GCP ecosystem and it integrates tightly with Cloud AI, Cloud DLP, AutoML, Data Studio, Cloud Scheduler, Cloud Functions, Cloud Composer, etc. Moving Data from MariaDB to BigQuery
Moving Data from MariaDB to BigQuery
- Build an ETL pipeline in Apache Beam or Apache Spark.
- Upload MariaDB backup files to Google Cloud Storage in raw format. Load the data into BigQuery and Transform it on the fly using custom ETL scripts.
- Run federated queries against CSV files uploaded as-is on Google Cloud Storage.
- Use Hevo, an automated data pipeline solution that can move data to BigQuery without any code.
This post covers the first approach in great detail. The blog also highlights the limitations of this approach and discusses the means to overcome them.
MariaDB to BigQuery — Steps to load data using custom ETL scripts
1. Export your MariaDB data into CSV format before moving it out to GCP.
mysql --host=[INSTANCE_IP] --user=[USER_NAME] --password [DATABASE] \ -e " SELECT * FROM mydataset.mytable INTO OUTFILE 'myfile.csv' CHARACTER SET 'utf8mb4' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' ESCAPED BY '\"' "
2. Create a Google Cloud Storage (GCS) bucket for the export:
gsutil mb gs://bucket
3. Stage the data on Cloud Storage using gsutil
gsutil -m cp /some/dir/*.csv gs://bucket/some/dir
4. Invoke bq load (BigQuery Data Transfer Service)
bq load \ --source_format=CSV --field_delimiter="\t" \ --null_marker="\N" --quote="" \ mydataset.mytable \ "gs://mybucket/some/dir/*.csv" \ ./myschema.json
The following command loads data from a GCS bucket into a table in your dataset. You must provide a schema defined in a local schema file.
NOTE: If this might not be a one-off process, it would be best to launch a Cloud Function to automatically invoke bq load whenever a file shows up on Cloud Storage.
MariaDB to BigQuery: Limitations of Loading Data Using Custom Code
- During this process, your MariaDB database might get updated. This means your destination database might miss some records and need to be updated in real-time. The above approach would not work for a use case like this. If you’re not looking for real-time database transfer, then it’s not an issue.
- Unfortunately, BigQuery is not really designed for working with changes on its tables. Therefore if the use case requires streaming, implementing CDC can be a bit cumbersome.
- This method is only suitable in cases where you only want to load small datasets. As data size increases, the incidence of error also increases.
- There is a maximum row and cell size limit of 100MB.
- You can’t have more than 10,000 columns per table.
- The maximum CSV file size that can be loaded into BigQuery is 5 TB.
- The maximum size per load job is 15 TB across all input files for CSV.
- BigQuery does not guarantee data consistency when using this method to ETL data to BigQuery. If the underlying data is updated while a query is running, it will result in unpredictable behavior.
- In case you need to clean, transform and enrich the data in MariaDB before loading it to BigQuery, then you would need to write additional code to accomplish this.
EASIER WAY TO MOVE DATA FROM MARIADB TO BIGQUERY
Using Hevo, you can build an automated data pipeline to move data from MariaDB to BigQuery in real-time, without writing any code. Hevo can load data into BigQuery in just 3 simple steps.
- Connect to your MariaDB database.
- Select the replication mode: (i) Full Table dump (ii) Load data using custom query (iii) Replicate Data using Binlog
- Configure the Google BigQuery data warehouse
Hevo will now move your MariaDB data to Google BigQuery in a consistent, secure and reliable fashion.
Hevo ingests data from many other databases (including MySQL, SQL Server, Oracle, SQL Server, PostgreSQL, Amazon Aurora, et al) via log-based Change Data Capture (CDC), as well as from log files, messaging systems, SaaS applications and more. You can look at the complete list of data sources here.
The complete suite of Hevo’s features can be found here. Sign up for a 14-day free trial to experience Hevo’s simplicity and robustness first-hand.