Easily move your data from GCS to BigQuery to enhance your analytics capabilities. With Hevo’s intuitive pipeline setup, data flows in real-time—check out our 1-minute demo below to see the seamless integration in action!
Google BigQuery and Cloud Storage are two out of 5 storage options available in the Google Cloud Platform. Depending on your application, you might use one or several of these services to get the job done. It is very convenient to load data from GCS to BigQuery because imports and exports between BigQuery and Cloud Storage are free. Moreover, BigQuery supports federated queries from GCS, and they have a compatible security model for both Access Control and Encryption.
This article introduces you to Google BigQuery and Google Cloud Storage. It also provides 2 methods that you can use to connect GCS to BigQuery. The 1st method includes Manual Integration while the 2nd method is completely automated and hassle-free.
Prerequisites
Listed below are the prerequisites of connecting GCS to BigQuery:
- You have installed the Google Cloud SDK on your CLI (Command Line Interface).
- You have configured the Google Cloud SDK to point to your Google Cloud project.
- You have created a Google Cloud project and enabled the BigQuery API.
- You have enabled billing on your project. If you have not enabled it yet, visit this link.
- Access to a supported Internet browser:
- The latest version of Google Chrome, Firefox, or Microsoft Edge
- Microsoft Internet Explorer 11+
- Safari 8+
Introduction to Google BigQuery
As you scale and grow the number of customers, everything else scales including servers, difficulty, data, etc. You are then faced with the challenge of transforming the way your business/organization ingests, processes, cleans, analyzes, and reports on corporate data.
Google BigQuery allows you to take all the data from different business divisions (Marketing, Sales, IT, Finance, etc.) and consolidate it in a secure Cloud Warehouse. The data here can then be made available to the Analysts and Stakeholders in the company to perform in-depth analysis.
Google BigQuery is an Analytical Database. It is both a storage service and a powerful analysis service which is why it is listed under Big Data on the Google Developer Console.
The power of Google BigQuery is its ability to run SQL queries over Terabytes of data in seconds. It can manage large data (Exabyte-scale) and you can run analysis on billions and billions of rows. You can interact with Google BigQuery using SQL. Moreover, Google BigQuery is fully managed and serverless, so you don’t have to worry about spinning up clusters and managing resources.
Introduction to Google Cloud Storage
Google Cloud Storage or GCS is a Data Lake. If you have a lot of unstructured data, it’s probably best to store it in Google Cloud Storage. When I say that Google Cloud Storage stores unstructured data, by that I mean, you load a sequence of bytes into a storage bucket and Google Cloud Storage will store them for you.
Of course, there may be some internal structure to that sequence of bytes, for instance, it could be a zip archive with a table of contents and individual archives inside of it or it might be a JPEG image file with a well-defined format, or a text file in Avro, CSV, JSON, or Parquet format.
The key is that Google Cloud Storage has no internal insight into that structured data. It just faithfully stores and retrieves the exact sequence of bytes you ask it to, regardless of any internal structure that may exist in those bytes.
Moreover, Google BigQuery and Google Cloud Storage work well together:
- Imports and exports between Google BigQuery and Cloud Storage are free.
- Google BigQuery supports federated queries from Cloud Storage.
- Google BigQuery provides a compatible security model for both Access Control and Encryption.
Methods to Connect GCS to BigQuery
Now that you have a basic understanding of both of the tools (Google Cloud Storage and Google BigQuery), you can go on to the two methods for connecting GCS to BigQuery. You can easily connect GCS to BigQuery using one of the two methods given below:
Download the Cheatsheet on How to Set Up High-performance ETL to BigQuery
Learn the best practices and considerations for setting up high-performance ETL to BigQuery
Method 1: Using Cloud Storage Transfer Service to Manually Connect GCS to BigQuery
You can follow these 8 steps to manually connect GCS to BigQuery using the Cloud Storage Transfer Service:
Step 1: Enable the BigQuery Data Transfer Service
From the Developer Console;
- Open the BigQuery Data Transfer API page in the API library.
- From the dropdown menu, select the appropriate project.
- Click the ENABLE button.
Step 2: Grant the bigquery.admin Access Permission
When you load data into BigQuery, you need permissions that allow you to load data into new or existing BigQuery tables and partitions.
All users that will create transfers must be granted the bigquery.admin Cloud IAM role. From the gcloud command-line tool, run the following command;
gcloud projects add-iam-policy-binding myproject
--member group:user@example.com
--role roles/bigquery.admin
Where:
myproject is your project ID.
group:user@example.com is the group and user or group’s email address.
The command will output the updated policy:
bindings:
- members:
- group:user@example.com
role: roles/bigquery.admin
Step 3: Grant the storage.objectAdmin Cloud IAM Role
When loading data from Cloud Storage, you’ll need access to the bucket that contains your data. To get this access, you must have the storage.objectAdmin Cloud IAM role. From the gcloud command-line tool, run the following command;
gcloud projects add-iam-policy-binding myproject
--member group:user@example.com
--role roles/storage.objectAdmin
Stop struggling with manual data transfers and let Hevo automate the process for you. Our seamless integration from Google Cloud Storage to BigQuery ensures real-time, reliable data flow with minimal effort.
With Hevo, you get a user-friendly platform that handles complexities, saving you time and effort. Enjoy effortless data management and unlock powerful insights with Hevo.
Automate Your Data Transfer Now
Step 4: Create a Data Set
bq --location=US mk -d
--default_table_expiration 3600
--description "This is my dataset."
mydataset
Where:
- The dataset’s location is set to the US.
- 3600 is the default lifetime for newly created tables. This is the minimum value and it is equivalent to one hour.
- description is the description of the dataset in quotes.
- mydataset is the name of the dataset you’re creating.
Step 5: Create an Empty Table with a Schema Definition
Enter the following command to create a table using a JSON schema file:
bq mk
--table
--expiration 3600
--description "This is my table"
--label organization:development
mydataset.mytable
/tmp/myschema.json
To verify that the table exists, run this command:
bq show --format=prettyjson mydataset.mytable
Step 6: Create a Storage Bucket
gsutil mb -p myproject -l us-east1 gs://mybucket/
Step 7: Load CSV Data Files into your Bucket
gsutil cp [OBJECT_LOCATION] gs://mybucket/myfiles/
Step 8: Create a Transfer for your Data Source (Cloud Storage)
bq mk --transfer_config
--target_dataset=mydataset
--display_name='My Transfer'
--params='{"data_path_template":"gs://mybucket/myfiles/*.csv",
"destination_table_name_template":"MyTable",
"file_format":"CSV",
"max_bad_records":"1",
"ignore_unknown_values":"true",
"field_delimiter":"|",
"skip_leading_rows":"1",
"allow_quoted_newlines":"true",
"allow_jagged_rows":"false",
"delete_source_files":"true"}'
--data_source=google_cloud_storage
The transfer by default is scheduled to repeat every 24 hours. Apart from Comma-separated values (CSV) files, you can also load data into BigQuery using the following formats:
- JSON (newline-delimited)
- Avro
- Parquet
- ORC
- Datastore exports
- Firestore exports
After running the command, you receive a message like the following:
[URL omitted] Please copy and paste the above URL into your web browser and follow the instructions to retrieve an authentication code.
Follow the instructions and paste the authentication code on the command line.
Limitations of Cloud Transfer Service to Manually Connect GCS to BigQuery
The manual method of connecting GCS to BigQuery using Cloud Transfer Service is effective, however, it comes with several limitations. Some of the limitations include:
- All files in your transfer must have the same schema defined in the BigQuery table, otherwise, the transfer will fail.
- The transfer will fail if you change the table schema in-between load jobs.
- You cannot transfer objects in the cloud storage Archive Storage class.
- If there are any changes to the underlying data while a query is running, then queries will return unexpected errors.
- The files in your Cloud Storage bucket need to be at least one hour old to qualify for transfer.
- The minimum interval time between recurring transfers is one hour which is not ideal if you have large amounts of data.
- When you load CSV or JSON data, values in DATE columns must use the dash (-) separator and the date must be in the following format: YYYY-MM-DD (year-month-day).
- When loading JSON or CSV data, values in TIMESTAMP columns must use a dash (-) separator for the Date portion of the timestamp. The Date needs to have this format: YYYY-MM-DD (year-month-day). The time portion of the timestamp must use a colon (:) separator e.g. hh:mm:ss.
Load Data from Google Cloud Storage to BigQuery
Method 2: Using Hevo’s No Code Data Pipeline to Connect GCS to BigQuery
Hevo Data is an automated No-code Data Pipeline that connects GCS and 150+ sources to Google Bigquery within minutes to create a single source of truth in a completely hassle-free manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code.
You can load data from files in your GCS bucket into a Destination database or data warehouse using Hevo Pipelines. Hevo automatically unzips any Gzipped files on ingestion. Further, files are re-ingested if updated, as it is not possible to identify individual changes. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.
You can reduce the effort required to extract, transform, and load data from GCS to BigQuery using Hevo. Using Hevo, you can build an automated data pipeline to move data from GCS to BigQuery in real-time, without writing any code. Hevo can load data into BigQuery in just 2 simple steps.
- Step 1: Connect and configure to your Google Cloud Storage data source by providing the Pipeline Name, Path Prefix, File Format, and Bucket Name.
- Step 2: Configure the Google BigQuery Data Warehouse where the data should be loaded by providing the project ID, dataset ID, destination name, GCS bucket, and Sanitize Table/Column names.
Here are more reasons to try Hevo:
- Secure: Hevo has a fault-tolerant architecture that ensures that your GCS data is handled in a secure, consistent manner with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema. This way you can leverage Hevo to work with AVRO, CSV, JSON, and XML formats of GCS data without worrying about schema management.
- Interactive UI: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations. Moreover, you can connect GCS to BigQuery using only 2 simple steps.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- High Connector Availability: Hevo provides you with 150+ connectors that allow you to bring data from the source of your choice and load into Data Warehouses like Redshift, BigQuery, Snoflake etc.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Load your Data from GCS to BigQuery using Hevo
Learn More About:
GCP Storage Buckets List
Conclusion
- This article provided you with a brief overview of Google Cloud Storage and Google BigQuery.
- It has also provided a detailed guide on how you can replicate data from GCS to BigQuery manually and using the Hevo Data pipeline.
- Moreover, you got to know about the limitations associated while establishing the connection manually from GCS to BigQuery.
- If those limitations are not much of your concern, you can go for the manual method, otherwise, you can try out the Hevo Data Pipeline to connect GCS to BigQuery.
Share your experience of connecting GCS to BigQuery in the comments section below!
Vivek Sinha is a seasoned product leader with over 10 years of expertise in revolutionizing real-time analytics and cloud-native technologies. He specializes in enhancing Apache Pinot, focusing on query processing and data mutability. Vivek is renowned for his strategic vision and ability to deliver cutting-edge solutions that empower businesses to harness the full potential of their data.