You have been running your transaction workload on MySQL for years. Occasional simple analytical queries and reporting using materialized views or ad hoc SQL in MySQL was sufficient. However, things are changing as we are becoming more digital. The volume of data to be analyzed is increasing day by day, new data sources like clickstreams, page views are growing. Also, the results have to be generated instantly for deeper analysis. This is where an analytical solution like BigQuery is coming into play. This post talks about the steps that you need to take to move data from MySQL to BigQuery.
How to move data from MySQL to BigQuery?
There are two methods that can help you replicate data from MySQL to BigQuery:
Method 1: A ready to use Data Integration Platform like Hevo (7 Days Free Trial).
Method 2: Build Custom ETL Scripts to replicate data from MySQL to BigQuery.
BigQuery is an enterprise data warehouse solution from Google Cloud Platform (GCP). BigQuery is unique among other data warehouse solutions in various aspects:
- Serverless – No need to worry about cluster sizing or scaling. BigQuery will automatically scale up and down based on data volume and query complexity.
- Pay for what you consumed – Storage cost is similar to other cloud storages ($0.020 per GB per Month) and Query cost is $5/TB. Latest pricing details are available here.
- Streaming Support – BigQuery supports inserting stream of data into the table without affecting query performance.
- Secure – All the data is encrypted by default. Security is not optional.
- Support for nested data – BigQuery natively supports analyzing semi-structured data like JSON. No need to convert JSON to relational.
- ML support – GCP is adding ML modules on top of BigQuery. ML models can be created and trained using SQL! Read more about it in BigQuery documentation.
You can visit their official page to know more about BigQuery features.
Moving Data From MySQL to BigQuery
There are two well-accepted ways to move data from MySQL to BigQuery using ETL Scripts.
Full dump and load
This is a relatively simple approach, where complete data from source MySQL table is extracted and migrated to BigQuery. If the target table already exists, drop it and create a new table ( Or delete complete data and insert newly extracted data).
Full Dump and Load is the only option for first time load even if the incremental load approach is used for recurring loads. For relatively small tables, full load approach can be followed even for further recurring loads.
As shown in the diagram, high-level steps to be followed to replicate MySQL to BigQuery are:
- Extract data from MySQL
- Clean and Transform
- Upload to Google Cloud Storage(GCS)
- Upload to the BigQuery table from GCS using bq tool or Console or any cloud SDK.
Let’s take a detailed look at each step.
a. Extract data from MySQL
There are two popular ways to extract data from MySQL – using mysqldump and using SQL query.
Extract data using mysqldump
Mysqldump is a client utility coming with Mysql installation. It is mainly used to create a logical backup of a database or table. Here, is how it can be used to extract one table:
mysqldump -u <db_username> -h <db_host> -p db_name table_name > table_name.sql
Here output file table_name.sql will be in the form of insert statements like
INSERT INTO table_name (column1, column2, column3, ...) VALUES (value1, value2, value3, ...);
This output has to be converted into a CSV file. You have to write a small script to perform this. Here is a well-accepted python library doing the same – mysqldump_to_csv.py
Alternatively, you can directly create a CSV file using the command mentioned below. However, this option works only when mysqldump is run on the same machine as the mysqld server which is not the case normally.
mysqldump -u [username] -p -t -T/path/to/directory [database] --fields-terminated-by=,
Extract data using SQL query
MySQL client utility can be used to run SQL commands and redirect output to file.
mysql -B -u user database_name -h mysql_host -e "select * from table_name;" > table_name_data_raw.txt
Further, it can be piped with text editing utilities like sed or awk to clean and format data.
mysql -B -u user database_name -h mysql_host -e "select * from table_name;" | sed "s/'/\'/;s/\t/\",\"/g;s/^/\"/;s/$/\"/;s/\n//g" \ > table_name_data.csv
b. Clean and Transform
Apart from transforming data for business logic, there are some basic things to keep in mind:
- BigQuery expects CSV data to be UTF-8 encoded.
- Primary Key and unique key constraints are not enforced by BigQuery. ETL process has to take care of that.
- Column types are slightly different. Most of the types have either equivalent or convertible types. Here is a list of common data types.
|MySQL Data Type||BigQuery Data Type|
|ENUM||No type for ENUM.Must use any type which can represent values in ENUM|
|SET||No type for SET.Must use any type which can represent values in SET|
Fortunately, default date format in MySQL is same, YYYY-MM-DD. Hence, while taking mysqldump there is no need to do any specific changes for this. If you are using a string field to store date and want to convert to date while moving to BigQuery you can useSTR_TO_DATE function.
DATE value must be dash(-) separated and in the form YYYY-MM-DD (year-month-day). You can visit their official page to know more about BigQuery data types.
Syntax : STR_TO_DATE(str,format)
Example : SELECT STR_TO_DATE(‘31,12,1999′,’%d,%m,%Y’);
Result : 1999-12-31
- The hh:mm:ss (hour-minute-second) portion of the timestamp must use a colon (:) separator.
- Make sure text columns are quoted if it can potentially have delimiter characters.
c. Upload to Google Cloud Storage(GCS)
Gsutil is a command line tool for manipulating objects in GCS. It can be used to upload the files from different location to your GCS bucket.
To copy a file to GCS:
gsutil cp table_name_data.csv gs://my-bucket/path/to/folder/
To copy an entire folder:
gsutil cp -r dir gs://my-bucket/path/to/parent/
If the files are present in S3, the same command can be used to transfer to GCS.
gsutil cp -R s3://bucketname/source/path gs://bucketname/destination/path
Storage Transfer Service
Storage Transfer Service from Google cloud is another option to upload files to GCS from S3 or other online data sources like HTTP/HTTPS location. Destination or sink is always a Cloud Storage bucket. It can also be used to transfer data from one GCS bucket to another.
This service is extremely handy when comes to data movement to GCS with support for:
- Schedule one-time or recurring data transfer.
- Delete existing objects in the destination if no corresponding source object is present.
- Deletion of source object after transferring.
- Periodic synchronization between source and sink with advanced filters based on file creation dates, file-name, etc.
Upload from Web Console
If you are uploading from your local machine, web console UI can also be used to upload files to GCS. Here are steps to upload a file to GCS with screenshots.
- Login to your GCP account. In the left bar, you can click on Storage and go to Browser.
- Select the GCS bucket you want to upload the file. Here the bucket we are using is test-data-hevo. Click on the bucket.
- In bucket details page shown below, click on the Upload files button and select file from your system.
- Wait till upload is completed. Now, the uploaded file will be listed in the bucket:
d. Upload to the BigQuery table from GCS
bq is the command line tool to interact with BigQuery. It is extremely convenient to upload data to the table from GCS.Use the bq load command, and specify CSV as the source_format.
General syntax of bq load:
bq --location=[LOCATION] load --source_format=[FORMAT] [DATASET].[TABLE] [PATH_TO_SOURCE] [SCHEMA]
[LOCATION] is your location. This is optional.
[FORMAT] is CSV.
[DATASET] is an existing dataset.
[TABLE] is the name of the table into which you’re loading data.
[PATH_TO_SOURCE] is a fully-qualified Cloud Storage URI.
[SCHEMA] is a valid schema. The schema can be a local JSON file or inline.– autodetect flag also can be used instead of supplying a schema definition.
There are a bunch of options specific to CSV data load :
|CSV option||Description||bq command flag|
|Field delimiter||Field delimiter: Comma, Tab, Pipe, Other||-F or –field_delimiter|
|Header rows||Header rows to skip||–skip_leading_rows|
|Number of bad records allowed||Number of errors allowed||–max_bad_records|
|Newline characters||Allow quoted newlines||–allow_quoted_newlines|
|Custom null values||Specifies a string that represents a null value in a CSV file||–null_marker|
|Trailing optional columns||Allow jagged rows||–allow_jagged_rows|
|Unknown values||Ignore unknown values||–ignore_unknown_values|
|Quote||The value that is used to quote data sections in a CSV file||–quote|
|Encoding||The character encoding of the data||-E or –encoding|
To see full list options visit Bigquery documentation on loading data cloud storage CSV. Now, let’s look at some example commands to load data:
Specify schema using a JSON file:
bq --location=US load --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv ./myschema.json
If you want schema auto-detected from the file:
bq --location=US load --autodetect --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv
If you are writing to the existing table, BigQuery provides three options – Write if empty, Append to the table, Overwrite table. Also, it is possible to add new fields to the table while uploading data. Let us see each with an example.
To overwrite existing table:
bq --location=US load --autodetect --replace --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv
To append to an existing table:
bq --location=US load --autodetect --noreplace --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv ./myschema.json
To add a new field to the table. Here new schema file with an extra field is given :
bq --location=asia-northeast1 load --noreplace --schema_update_option=ALLOW_FIELD_ADDITION --source_format=CSV mydataset.mytable gs://mybucket/mydata.csv ./myschema.json
2. Incremental Dump and Load
In certain use cases, loading data once from MySQL to BigQuery will not be enough. There might be use cases where once initial data is extracted from the source, we need to keep the target table in sync with the source. For a small table doing full data dump every time might be feasible but if the volume data is higher, we should think of delta approach.
a. Extract data from MySQL
For incremental data extraction from MySQL use SQL with proper predicates and write output to file. mysqldump cannot be used here as it always extracts full data.
Eg: Extracting rows based on updated_timestamp column and converting to CSV.
mysql -B -u user database_name -h mysql_host -e "select * from table_name where\ updated_timestamp < now() and updated_timestamp >'#max_updated_ts_in_last_run#'"| sed "s/'/\'/;s/\t/\",\"/g;s/^/\"/;s/$/\"/;s/\n//g" \ > table_name_data.csv
Note: In case of any hard delete happened in the source table, it will not be reflected in the target table.
b. Update target table in BigQuery
To upsert newly extracted data to the BigQuery table, first, upload the data into a staging table. This will be a full load. Please refer full data load section above. Let’s call it as delta_table. Now there are two approaches to load data to the final table :
- Update the values existing records in the final table and insert new rows from delta table which are not in the final table.
UPDATE data_set.final_table t SET t.value = s.value FROM data_set.delta_table s WHERE t.id = s.id;
INSERT data_set.final_table (id, value) SELECT id, value FROM data_set.delta_table WHERE NOT id IN (SELECT id FROM data_set.final_table);
2. Delete rows from the final table which are present in the delta table. Then insert all rows from delta table to final table.
DELETE data_set.final_table f WHERE f.id IN (SELECT id from data_set.delta_table);
INSERT data_set.final_table (id, value) SELECT id, value FROM data_set.delta_table;
Easier Way to move data from MySQL to BigQuery
With a ready to use Data Integration Platform – Hevo, you can easily move data from MySQL to BigQuery with just 3 simple steps. This does not need you to write any code and will provide you with an error-free, fully managed set up to move data in minutes.
- Connect and configure your MySQL database.
- Select the replication mode: (a) load selected MySQL tables (b) load data via Custom Query (c) load data through Binlog.
- For each table in MySQL choose a table name in BigQuery where it should be copied.
It is that simple. While you relax, Hevo will take care of fetching the data and sending it to your destination warehouse.
In addition to this, Hevo lets you bring data from a wide array of sources – Cloud Apps, Databases, SDKs, and more. You can check out the complete list of available integrations.
Complex analytics on data requires moving data to warehouses like BigQuery. It takes multiple steps to extract data, clean it and upload. It requires real effort to ensure there is no data loss at each stage of the process, whether it happens due to data anomalies or type mismatches.