Are you looking to connect Cloud SQL to BigQuery? We have you covered. Moving data to a data warehouse like BigQuery helps you store large amounts of data for better analysis. BigQuery is a powerful cloud-based data warehouse and analytics platform that provides a cost-effective solution to store, process, and analyze large amounts of data.

With the help of BigQuery, companies can reduce costs and gain insights faster. In this article, we will discuss the advantages of Cloud SQL to BigQuery replication, the process of migration, and useful tips to help you make the move.

What is Cloud SQL?

Image Source

Cloud SQL is a fully-managed database service using which you can create, maintain and manage your databases on the Google Cloud Platform(GCP). It can be used with MySQL, PostgreSQL, or SQL Server; hosted on the GCP. 

You can use Cloud SQL with App Engine applications that are written in many languages like Java, Python, PHP, Node.js, Go, Ruby, etc. It’s popular because of its support for multiple database engines and programming languages.

How to Load Data from Cloud SQL to BigQuery? 

In this section, we will discuss the different methods to import data from cloud SQL to Bigquery. There can be primarily 4 strategies to move data from Cloud SQL to BigQuery via CloudSQL :

Solve your cloud SQL to BigQuery dataflow replication problems with Hevo’s reliable, no-code, automated pipelines with 150+ connectors.
Get your free trial right away!

1. The Simpler and Efficient Option- Hevo

Hevo manages all these complexities for you in an efficient and secure manner. You can transfer your data to BigQuery without writing any code or taking care of nuances that can be time-consuming and annoying. 

Prerequisites: Ensure your Google Cloud MySQL instance is active, MySQL version is 5.5 or higher, and BinLog replication is enabled. Have your database credentials ready.

  • User Setup: Confirm the database user has SELECT and REPLICATION privileges. If needed, create a new user with these privileges.
  • Hevo Setup: Log in to your Hevo account and whitelist Hevo’s IP on your Google Cloud MySQL instance.
  • Configure Pipeline: In Hevo, create a new Pipeline with a unique name, select ingestion mode, and select CloudMySQL as the source. Enter the specified details and complete the source setup.
  • Test & Proceed: Test the connection in Hevo, then proceed to set up the Destination.

Step 1: Configure Cloud SQL as a Source 

Cloud SQL to BigQuery: Configure Source
Image Source

Step 2: Configure BigQuery as a Destination

Here are the steps to setup BigQuery as the Destination. 

  • In the Hevo dashboard, access the destinations option by clicking on the DESTINATIONS button located in the Navigation Bar.
  • Next, create a new destination by selecting the + CREATE button in the Destinations List View.
  • In the Add Destination page, choose Google BigQuery as the Destination type.
  • Once you’ve selected Google BigQuery, you will be redirected to the Configure your Google BigQuery Warehouse page, where you can set up your warehouse according to your preferences.
Cloud SQL to BigQuery: Configure Destination
Image Source
  • Learn how to set up CloudSQL as a Source here.
  • Learn how to setup BigQuery as a Destination here.

Key features of Hevo Data Platform to move data from Cloud SQL to BigQuery:

  • Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer.
  • Schema Management: Hevo can automatically detect the schema of the incoming data and maps it to the destination schema.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
Get started for Free with Hevo!

2. Write Code to Directly Transfer from Source Database to BigQuery

Let’s take MySQL for Cloud SQL, as an example.

Your MySQL instance will be adding data continuously, hence you will need to schedule your Bigquery transfers regularly, at some pre-decided interval.  Also, you will have to use a marker or a flag, or a bookmark, to always point at the last record inserted in the previous iteration of the load job. 

Your current iteration will now begin from the record next to the marker. This marker can be a timestamp (stored for every record, tells you the records till which the insertion has already been done) OR a unique key. 

With every new iteration/load job, you will start from the marker and transfer all records till the latest added record, to BigQuery. 

Another option here can be to first export your source data into CSV format and allow BigQuery to import your CSVs. 

3. Using Federated Queries to Move Data from Cloud SQL to BigQuery

BigQuery supports Cloud SQL federated queries which let you query your Cloud SQL database directly from BigQuery. BigQuery will run this federated SQL on its own, and pick up records from your Cloud SQL database. 

Using the new SQL function EXTERNAL_QUERY(), BigQuery can query data residing in Cloud SQL in real-time, without copying or moving data. 

Its syntax is as follows:

SELECT * FROM EXTERNAL_QUERY(Your_Source_Database_Connection_ID, external_database_query); 

E.g. If you have an “employee_salary” table in BigQuery and an “employee”/”salary” table in Cloud MySQL.

INSERT 
   YourBigQueryDatabaseName.employee_salary 
SELECT
   *
FROM
   EXTERNAL_QUERY(
      "CloudSQL_Database_connection_ID",
      "SELECT employee_Name, last_salary_drawn FROM employee e, salary s    WHERE e.employee_ID = s.employee_ID ");

EXTERNAL_QUERY executes the query in Cloud SQL and returns results as a temporary table where the source data types are automatically converted to BigQuery data types

To view all tables in a database,

SELECT
*
FROM
EXTERNAL_QUERY
(
"connection_id"
,


"select * from information_schema.tables;"
);

To View the table’s columns,

SELECT
*
FROM
EXTERNAL_QUERY
(
"connection_id"
,


"select * from information_schema.columns where table_name='x';"
);

The downside to this Cloud SQL to BigQuery method is that you have to code and update your queries often, as your data on both sides grows. 

4. Cloud SQL to BigQuery through Cloud Pub/Sub

You can split the writing process to pick data from CloudSQL and store it in Cloud Pub/Sub, temporarily, and then have a Dataflow reader to stream into BigQuery. 

First, create a connector/routine that publishes changes from Cloud SQL to Pub/Sub. Then write a Dataflow pipeline that consumes those changes from PubSub, and pushes them to BigQuery. 

If you have relational data at the source which must be denormalized/aggregated before the transfer, this approach works well. 

Also, if you have legacy relational tables whose data types are much different from BigQuery and would need a lot of pre-processing, you can consider this option. 

The upside is that you can reduce overall latency to just a few seconds or minutes. 

The obvious downside is with transactional/OLTP data, which will require the creation of a strong versioning scheme, to track changes.

Finer details to remember:

  • BigQuery imposes a four-gigabyte per-file limit:– If your Cloud SQL is extracting data in JSON format, and storing it in a Google Cloud Storage bucket, prior to moving it to BigQuery, you will have to split your file(s) into multiple chunks.
  • BigQuery is “append-only“, i.e. it does not allow you to edit any existing record, all writes are treated as immutable. So you must have a clear strategy to avoid duplication and errors if your application demands so.
  • BigQuery does not guarantee data consistency for external data sources which are constantly updating. 
  • If the underlying data in CloudSQL changes while a federated query/transfer is running, it can result in data inconsistency.

5. Using GCP to Create a DataPipeline

Step 1: Enable API

Image Source

Go to the GCP console and enable the BigQuery Connection API.

Step 2: Set Cloud SQL to Public

Navigate to your Cloud SQL instance’s Connection tab and enable the public IP and save your settings.

Step 3: Authorize Address

Provide an authorized address for the connection to complete.

Step 4: Add External Data Source

In BigQuery, add an external data source and enter details about your Cloud SQL data.

Image Source

Step 5: Share Connection

Assign roles and email addresses for user access by clicking on ‘Share Connection’.

Step 6: Query with Federated Queries

Use the EXTERNAL_QUERY function in BigQuery to query your Cloud SQL data.

Remember to check for permissions, supported regions for federated queries, and compatible data types between Cloud SQL and BigQuery. If you encounter any issues, consult your administrator or refer to the documentation for troubleshooting.

6. Using Cloud Dataflow to Create a Data Pipeline

  • Step 1: Create a routine/connector to publish changes from Cloud SQL to Pub/Sub.
  • Step 2: Write a Dataflow pipeline that consumes changes from Pub/Sub and migrates them to BigQuery.

This method is ideal when relational source data needs aggregation or denormalization before migration. Also, ensure you have a versioning schema to track changes if transporting OLTP/transactional data.

Remember, BigQuery treats all writes as immutable, so plan to avoid duplication and errors. Also, BigQuery has a 4 GB/file limit, so you may need to fragment large files before migration. If data in Cloud SQL changes during transfer, it could lead to inconsistencies.

Key Benefits of Migrating From CloudSQL to BigQuery

  • Dynamic Resource Allocation: BigQuery dynamically allocates resources based on usage, eliminating the need to provision and forecast compute and storage resources in advance
  • Fast Analytics: Offers fast analytics on a petabyte scale due to its unique capabilities and columnar data store, which ensures high data compression and minimized data scanning.
  • Scalability: BigQuery’s scalability and performance allow for quick and cost-effective querying on large datasets.
  • Integration: Seamlessly integrates with other Google Cloud services, enhancing data analysis and insight generation.

Conclusion

In this blog post, you have learned 4 methods to move your data from Cloud SQL to BigQuery. For a hassle-free transfer, use Hevo.  Hevo Data, a No-Code Data Pipeline, helps you transfer data from a source of your choice in a fully-automated and secure manner without having to write the code repeatedly.

Getting data from many sources into destinations can be a time-consuming and resource-intensive task. Instead of spending months developing and maintaining such data integrations, you can enjoy a smooth ride with Hevo Data’s 150+ plug-and-play integrations (including 50+ free sources).

Visit our Website to Explore Hevo Data

Saving countless hours of manual data cleaning & standardizing, Hevo Data’s pre-load data transformations get it done in minutes via a simple drag-and-drop interface or your custom Python scripts. No need to go to your data warehouse for post-load transformations. You can run complex SQL transformations from the comfort of Hevo’s interface and get your data in the final analysis-ready form. 

Want to take Hevo Data for a ride? Sign Up for a 14-day free trial and simplify your data integration process. Check out the Hevo pricing details to understand which plan fulfills all your business needs.

Pratik Dwivedi
Freelance Technical Content Writer, Hevo Data

Pratik writes about various topics related to data industry who loves creating engaging content on topics like data analytics, machine learning, AI, big data, and business intelligence.

No-Code Data Pipeline for BigQuery