Are you looking to connect Cloud SQL to BigQuery? We have you covered. Moving data to a data warehouse like BigQuery helps you store large amounts of data for better analysis.
Table of Contents
- What is Cloud SQL?
- How to Load Data from Cloud SQL to BigQuery?
What is Cloud SQL?
Cloud SQL is a fully-managed database service using which you can create, maintain and manage your databases on the Google Cloud Platform(GCP). It can be used with MySQL, PostgreSQL, or SQL Server; hosted on the GCP.
You can use Cloud SQL with App Engine applications that are written in many languages like Java, Python, PHP, Node.js, Go, Ruby, etc. It’s popular because of its support for multiple database engines and programming languages.
How to Load Data from Cloud SQL to BigQuery?
There can be primarily 4 strategies to load data into BigQuery via CloudSQL :
1. Write Code to Directly Transfer from Source Database to BigQuery
Let’s take MySQL for Cloud SQL, as an example.
Your MySQL instance will be adding data continuously, hence you will need to schedule your Bigquery transfers regularly, at some pre-decided interval. Also, you will have to use a marker or a flag, or a bookmark, to always point at the last record inserted in the previous iteration of the load job.
Your current iteration will now begin from the record next to the marker. This marker can be a timestamp (stored for every record, tells you the records till which the insertion has already been done) OR a unique key.
With every new iteration/load job, you will start from the marker and transfer all records till the latest added record, to BigQuery. Another option here can be to first export your source data into CSV format and allow BigQuery to import your CSVs.
2. Using Federated Queries
BigQuery supports Cloud SQL federated queries which let you query your Cloud SQL database directly from BigQuery.
BigQuery will run this federated SQL on its own, and pick up records from your Cloud SQL database.
Using the new SQL function EXTERNAL_QUERY(), BigQuery can query data residing in Cloud SQL in real-time, without copying or moving data.
Its syntax is as follows:
SELECT * FROM EXTERNAL_QUERY(Your_Source_Database_Connection_ID, external_database_query);
E.g. If you have a “employee_salary” table in BigQuery and “employee”/”salary” tables in Cloud MySQL.
INSERT YourBigQueryDatabaseName.employee_salary SELECT * FROM EXTERNAL_QUERY( "CloudSQL_Database_connection_ID", "SELECT employee_Name, last_salary_drawn FROM employee e, salary s WHERE e.employee_ID = s.employee_ID ");
EXTERNAL_QUERY executes the query in Cloud SQL and returns results as a temporary table where the source data types are automatically converted to BigQuery data types.
The downside is that you have to code and update your queries often, as your data on both sides grows.
3. Cloud SQL to Cloud Pub/Sub to BigQuery
You can split the writing process to pick data from CloudSQL and store it to Cloud Pub/Sub, temporarily, and then have a Dataflow reader to stream into BigQuery.
First, create a connector/routine that publishes changes from Cloud SQL to Pub/Sub. Then write a Dataflow pipeline that consumes those changes from PubSub, and pushes them to BigQuery.
If you have relational data at source which must be denormalized/aggregated before transfer, this approach works well.
Also, if you have legacy relational tables whose data types are much different from BigQuery and would need a lot of pre-processing, you can consider this option. The upside is that you can reduce overall latency to just a few seconds or minutes.
The obvious downside is with transactional/OLTP data, which will require creation of a strong versioning scheme, to track changes.
Finer details to remember:
- BigQuery imposes a four-gigabyte per-file limit:- If your Cloud SQL is extracting data in JSON format, and storing it in a Google Cloud Storage bucket, prior to moving it to BigQuery, you will have to split your file(s) into multiple chunks.
- BigQuery is “append-only”, i.e. it does not allow you to edit any existing record, all writes are treated as immutable. So you must have a clear strategy to avoid duplication and errors if your application demands so.
- BigQuery does not guarantee data consistency for external data sources which are constantly updating.
- If the underlying data in CloudSQL changes while a federated query/transfer is running, it can result in data inconsistency.
4. The Simpler and Efficient Option– Hevo
Use Hevo. Hevo manages all these complexities for you in an efficient and secure manner. You can transfer your data to BigQuery without writing any code or taking care of nuances that can be time-consuming and annoying.
Configure Cloud SQL as a Source
Configure BigQuery as a Destination
In this blog post, you have learned 4 methods to move your data from Cloud SQL to BigQuery. For a hassle-free transfer, use Hevo. Hevo Data, a No-Code Data Pipeline, helps you transfer data from a source of your choice in a fully-automated and secure manner without having to write the code repeatedly.
Getting data from many sources into destinations can be a time-consuming and resource-intensive task. Instead of spending months developing and maintaining such data integrations, you can enjoy a smooth ride with Hevo Data’s 150+ plug-and-play integrations (including 40+ free sources).Visit our Website to Explore Hevo Data
Saving countless hours of manual data cleaning & standardizing, Hevo Data’s pre-load data transformations get it done in minutes via a simple drag n drop interface or your custom python scripts. No need to go to your data warehouse for post-load transformations. You can run complex SQL transformations from the comfort of Hevo’s interface and get your data in the final analysis-ready form.