You have been running your transaction workload on MySQL for years. Occasional simple analytical queries and reporting using materialized views or ad hoc SQL in MySQL was sufficient. However, things are changing as we are becoming more digital. The volume of data to be analyzed is increasing day by day, new data sources like Clickstreams, Page Views, etc. are growing and the results have to be generated instantly for deeper analysis. This is where an analytical solution like BigQuery is coming into play.
This blog will introduce BigQuery and MySQL and explain their key features. Moreover, it will provide 2 easy methods to connect MySQL to BigQuery in real time. The first method involves writing custom ETL Scripts to perform this data transfer from MySQL to BigQuery while the second method uses Hevo’s automated Data Pipeline to set up this connection. Read along and decide which method suits you the best!
Table of Contents
- Introduction to MySQL
- Introduction to BigQuery
- Methods to Connect MySQL to BigQuery
- Working knowledge of Databases and Data Warehouses.
- A MySQL account.
- A BigQuery account.
- Clear idea regarding what data is to be transferred.
- Working Knowledge of SQL.
Introduction to MySQL
MySQL is an open-source Relational Database Management System, or RDMS, which uses Structured Query Language or SQL for management. MySQL was originally developed and owned by the Swedish company MySQL AB, but MySQL AB was acquired by Sun Microsystems in 2008. In return, Oracle acquired Sun Microsystems two years later, making them the current owners of MySQL.
Key Features of MySQL
The following features make MySQL such a popular RDBMS in the market:
- It provides high performance and usability for multiple users anytime, anywhere.
- Provides strong transaction support, is compatible with Web development, and can be easily integrated into the Data Warehouse.
- For all types of data and users, it has a high degree of scalability and flexibility.
- It is compatible with multiple server Operating Systems such as FreeBSD, Linux, Solaris, OS X, and Windows.
To learn more about MySQL, visit here.
Introduction to BigQuery
BigQuery is an enterprise Data Warehouse solution from Google Cloud Platform (GCP). BigQuery is best used for complex analytical queries. It is very helpful for Relational Databases to carry such a heavy query load. The characteristic of this Data Warehouse is that it links to other Google services, such as Spreadsheets, Google Drive, etc. This makes BigQuery a very attractive option in the market. It is also free and available everywhere in Google Cloud.
Key Features of BigQuery
BigQuery is unique among other Data Warehouse solutions in various aspects:
- Serverless: No need to worry about cluster sizing or scaling. BigQuery will automatically scale up and down based on data volume and query complexity.
- Suitable Payment Structure: Storage cost is similar to other Cloud Storages ($0.020 per GB per Month) and Query Cost is $5/TB. The latest pricing details are available here.
- Streaming Support: BigQuery supports inserting a stream of data into the table without affecting query performance.
- Secure: All the data is encrypted by default. Security is not optional.
- Support for nested data: BigQuery natively supports analyzing semi-structured data like JSON. No need to convert JSON to relational.
- ML Support: GCP is adding ML modules on top of BigQuery. ML models can be created and trained using SQL!
To know more about BigQuery, visit here.
Methods to Connect MySQL to BigQuery
This method involves building custom ETL Scripts to replicate data from MySQL to BigQuery. It contains 2 types of replication: Full Dump and Incremental. The Full Dump method is required if the data is to be transferred for the first time. For further data transfer processes, the Incremental approach is sufficient.
Hevo Data provides a hassle-free solution and helps you directly transfer data from MySQL to BigQuery and numerous other Databases/Data Warehouses or destinations of your choice without any intervention in an effortless manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Hevo’s pre-built integration with MySQL along with 100+ other data sources (including 30+ free data sources) will take full charge of the data transfer process, allowing you to focus on key business activities.Get Started with Hevo for Free
Methods to Connect MySQL to BigQuery
Following are the 2 methods using which you can set up your MySQL to BigQuery integration:
- Method 1: Manual ETL Process to Connect MySQL to BigQuery
- Method 2: Using Hevo Data to Connect MySQL to BigQuery
Method 1: Manual ETL Process to Connect MySQL to BigQuery
The manual method of connecting MySQL to BigQuery involves writing custom ETL scripts to set up this data transfer process. This method can be implemented in 2 different forms:
1. Full Dump and Load
This is a relatively simple approach, where complete data from the source MySQL table is extracted and migrated to BigQuery. If the target table already exists, drop it and create a new table ( Or delete complete data and insert newly extracted data).
Full Dump and Load is the only option for the first-time load even if the incremental load approach is used for recurring loads. For relatively small tables, the full load approach can be followed even for further recurring loads. You can also check out MySQL to Redshift integration.
The high-level steps to be followed to replicate MySQL to BigQuery are:
- Step 1: Extract Data from MySQL
- Step 2: Clean and Transform the Data
- Step 3: Upload to Google Cloud Storage(GCS)
- Step 4: Upload to the BigQuery Table from GCS
Let’s take a detailed look at each step.
Step 1: Extract Data from MySQL
There are 2 popular ways to extract data from MySQL – using mysqldump and using SQL query.
- Extract data using mysqldump
Mysqldump is a client utility coming with Mysql installation. It is mainly used to create a logical backup of a database or table. Here, is how it can be used to extract one table:
mysqldump -u <db_username> -h <db_host> -p db_name table_name > table_name.sql
Here output file table_name.sql will be in the form of insert statements like
INSERT INTO table_name (column1, column2, column3, ...) VALUES (value1, value2, value3, ...);
This output has to be converted into a CSV file. You have to write a small script to perform this. Here is a well-accepted python library doing the same – mysqldump_to_csv.py
Alternatively, you can directly create a CSV file using the command mentioned below. However, this option works only when mysqldump is run on the same machine as the mysqld server which is not the case normally.
mysqldump -u [username] -p -t -T/path/to/directory [database] --fields-terminated-by=,
- Extract Data using SQL query
MySQL client utility can be used to run SQL commands and redirect output to file.
mysql -B -u user database_name -h mysql_host -e "select * from table_name;" > table_name_data_raw.txt
Further, it can be piped with text editing utilities like sed or awk to clean and format data.
mysql -B -u user database_name -h mysql_host -e "select * from table_name;" | sed "s/'/'/;s/t/","/g;s/^/"/;s/$/"/;s/n//g" > table_name_data.csv
Step 2: Clean and Transform the Data
Apart from transforming data for business logic, there are some basic things to keep in mind:
- BigQuery expects CSV data to be UTF-8 encoded.
- Primary Key and unique key constraints are not enforced by BigQuery. ETL process has to take care of that.
- Column types are slightly different. Most of the types have either equivalent or convertible types. Here is a list of common data types.
|MySQL Data Type||BigQuery Data Type|
|ENUM||No type for ENUM.Must use any type which can represent values in ENUM|
|SET||No type for SET.Must use any type which can represent values in SET|
Fortunately, the default date format in MySQL is the same, YYYY-MM-DD. Hence, while taking mysqldump there is no need to do any specific changes for this. If you are using a string field to store date and want to convert to date while moving to BigQuery you can useSTR_TO_DATE function.
DATE value must be dash(-) separated and in the form YYYY-MM-DD (year-month-day). You can visit their official page to know more about BigQuery data types.
Syntax : STR_TO_DATE(str,format)
Example : SELECT STR_TO_DATE(‘31,12,1999′,’%d,%m,%Y’);
Result : 1999-12-31
- The hh:mm: ss (hour-minute-second) portion of the timestamp must use a colon (:) separator.
- Make sure text columns are quoted if it can potentially have delimiter characters.
Step 3: Upload to Google Cloud Storage(GCS)
Gsutil is a command-line tool for manipulating objects in GCS. It can be used to upload files from different locations to your GCS bucket.
To copy a file to GCS:
gsutil cp table_name_data.csv gs://my-bucket/path/to/folder/
To copy an entire folder:
gsutil cp -r dir gs://my-bucket/path/to/parent/
If the files are present in S3, the same command can be used to transfer to GCS.
gsutil cp -R s3://bucketname/source/path gs://bucketname/destination/path
Storage Transfer Service
Storage Transfer Service from Google cloud is another option to upload files to GCS from S3 or other online data sources like HTTP/HTTPS location. Destination or sink is always a Cloud Storage bucket. It can also be used to transfer data from one GCS bucket to another.
This service is extremely handy when comes to data movement to GCS with support for:
- Schedule one-time or recurring data transfer.
- Delete existing objects in the destination if no corresponding source object is present.
- Deletion of source object after transferring.
- Periodic synchronization between source and sink with advanced filters based on file creation dates, file name, etc.
Upload from Web Console
If you are uploading from your local machine, web console UI can also be used to upload files to GCS. Here are the steps to upload a file to GCS with screenshots.
- Login to your GCP account. In the left bar, you can click on Storage and go to Browser.
- Select the GCS bucket you want to upload the file. Here the bucket we are using is test-data-hevo. Click on the bucket.
- On the bucket details page shown below, click on the Upload files button and select file from your system.
- Wait till the upload is completed. Now, the uploaded file will be listed in the bucket:
Step 4: Upload to the BigQuery Table from GCS
You can use the bq command to interact with BigQuery. It is extremely convenient to upload data to the table from GCS.Use the bq load command, and specify CSV as the source_format.
The general syntax of bq load:
bq --location=[LOCATION] load --source_format=[FORMAT] [DATASET].[TABLE] [PATH_TO_SOURCE] [SCHEMA]
[LOCATION] is your location. This is optional.
[FORMAT] is CSV.
[DATASET] is an existing dataset.
[TABLE] is the name of the table into which you’re loading data.
[PATH_TO_SOURCE] is a fully-qualified Cloud Storage URI.
[SCHEMA] is a valid schema. The schema can be a local JSON file or inline.– autodetect flag also can be used instead of supplying a schema definition.
There are a bunch of options specific to CSV data load :
|CSV option||Description||bq command flag|
|Field delimiter||Field delimiter: Comma, Tab, Pipe, Other||-F or –field_delimiter|
|Header rows||Header rows to skip||–skip_leading_rows|
|Number of bad records allowed||Number of errors allowed||–max_bad_records|
|Newline characters||Allow quoted newlines||–allow_quoted_newlines|
|Custom null values||Specifies a string that represents a null value in a CSV file||–null_marker|
|Trailing optional columns||Allow jagged rows||–allow_jagged_rows|
|Unknown values||Ignore unknown values||–ignore_unknown_values|
|Quote||The value that is used to quote data sections in a CSV file||–quote|
|Encoding||The character encoding of the data||-E or –encoding|
To see full list options visit Bigquery documentation on loading data cloud storage CSV, visit here.
Following are some example commands to load data:
- Specify schema using a JSON file:
bq --location=US load --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv ./myschema.json
- If you want schema auto-detected from the file:
bq --location=US load --autodetect --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv
If you are writing to the existing table, BigQuery provides three options – Write if empty, Append to the table, Overwrite table. Also, it is possible to add new fields to the table while uploading data. Let us see each with an example.
- To overwrite the existing table:
bq --location=US load --autodetect --replace --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv
- To append to an existing table:
bq --location=US load --autodetect --noreplace --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv ./myschema.json
- To add a new field to the table. Here new schema file with an extra field is given :
bq --location=asia-northeast1 load --noreplace --schema_update_option=ALLOW_FIELD_ADDITION --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv ./myschema.json
2. Incremental Dump and Load
In certain use cases, loading data once from MySQL to BigQuery will not be enough. There might be use cases where once initial data is extracted from the source, we need to keep the target table in sync with the source. For a small table doing a full data dump every time might be feasible but if the volume data is higher, we should think of a delta approach.
The following steps are used in the Incremental approach to connect MySQL to Bigquery:
Step 1: Extract Data from MySQL
For incremental data extraction from MySQL use SQL with proper predicates and write output to file. mysqldump cannot be used here as it always extracts full data.
Eg: Extracting rows based on the updated_timestamp column and converting to CSV.
mysql -B -u user database_name -h mysql_host -e "select * from table_name where updated_timestamp < now() and updated_timestamp >'#max_updated_ts_in_last_run#'"| sed "s/'/'/;s/t/","/g;s/^/"/;s/$/"/;s/n//g" > table_name_data.csv
Note: In case of any hard delete happened in the source table, it will not be reflected in the target table.
Step 2: Update Target Table in BigQuery
To upsert newly extracted data to the BigQuery table, first, upload the data into a staging table. This will be a full load. Please refer full data load section above. Let’s call it delta_table. Now there are two approaches to load data to the final table :
- Update the values existing records in the final table and insert new rows from the delta table which are not in the final table.
UPDATE data_set.final_table t SET t.value = s.value FROM data_set.delta_table s WHERE t.id = s.id;
INSERT data_set.final_table (id, value) SELECT id, value FROM data_set.delta_table WHERE NOT id IN (SELECT id FROM data_set.final_table);
2. Delete rows from the final table which are present in the delta table. Then insert all rows from the delta table to the final table.
DELETE data_set.final_table f WHERE f.id IN (SELECT id from data_set.delta_table);
INSERT data_set.final_table (id, value) SELECT id, value FROM data_set.delta_table;
Method 2: Using Hevo Data to Connect MySQL to BigQuery
Hevo Data, a No-code Data Pipeline, helps you directly transfer data from MySQL and 100+ other data sources to Data Warehouses like BigQuery, BI tools, or a destination of your choice in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.Sign up here for a 14-Day Free Trial!
With a ready-to-use Data Integration Platform, Hevo, you can easily move data from MySQL to BigQuery with just 3 simple steps. This does not need you to write any code and will provide you with an error-free, fully managed setup to move data in minutes.
- Connect and configure your MySQL database.
- Select the replication mode: (a) load selected MySQL tables (b) load data via Custom Query (c) load data through Binlog.
- Customize your BigQuery accounts to the destination as shown in the image below. For each table in MySQL choose a table name in BigQuery where it should be copied.
It is that simple. While you relax, Hevo will take care of fetching the data and sending it to your destination Warehouse.
More Reasons to Choose Hevo Data
- Faster Implementation: A very quick 2-stage process to get your pipeline setup. After that, everything’s automated while you watch data sync to Snowflake or any other destination in real-time.
- Fully Managed Service & Live Support: Every aspect of the data transfer process is fully managed by Hevo. Hevo also provides a dedicated team of talented support engineers to guide you through any issues that arise.
- Real-time Alerts & Notifications: With Hevo, you are always in the know about your data transfer pipelines. Receive real-time multiple-format notifications across your various devices.
- 100% Complete & Accurate Data Transfer: The entire ETL spectrum is covered by Hevo. Quality rules and expert engineering also ensure that data is always 100% transmitted so there’s no risk of data corruption or missing data.
- Scalability Assured: Hevo simply scales to meet your needs today and tomorrow thanks to its world-class infrastructure and platform.
- A wide array of Data Source Integrations: Hevo can bring data from not only Facebook Ads, but a wide array of data sources ranging from Sales and Marketing tools to databases and more. Explore all the tools Hevo can bring data from here.
In addition to this, Hevo lets you bring data from a wide array of sources – Cloud Apps, Databases, SDKs, and more. You can check out the complete list of available integrations.
The article discussed MySQL and Bigquery and explained their key features. Furthermore, it listed down 2 methods to set up your MySQL to BigQuery integration. The first method requires you to write custom scripts to perform ETL processes while the second method relies on Hevo’s automated Data Pipeline to transfer data from MySQL to BigQuery.
Complex analytics on data requires moving data to Data Warehouses like BigQuery. It takes multiple steps to extract data, clean it and upload it. It requires real effort to ensure there is no data loss at each stage of the process, whether it happens due to data anomalies or type mismatches.Visit our Website to Explore Hevo
To avoid all this trouble, use Hevo Data that offers a faster way to move data from 100+ data sources such as SaaS applications or Databases such as MySQL into your Data Warehouse to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.
Share your understanding of connecting MySQL to BigQuery in the comments section below!