Connecting GCS to BigQuery: 2 Easy Methods

on Data Engineering, Data Extraction, Data Integration, Data Management Tools, Data Mapping, Data Replication, Data replication, Data Storage, Data Warehouses, Database Schema Design, ETL, Google BigQuery • October 19th, 2021 • Write for Hevo

Almost all applications need to use some kind of persistent durable storage to get their job done. This might be to store information about user accounts, images you want to capture or serve a time series of events that are happening you want to analyze. Not surprisingly the Google Cloud Platform can store all these things. However, because different applications have different storage requirements, the Google Cloud Platform has not one but 5 ways of storing persistent data.

Google BigQuery and Cloud Storage are two out of 5 storage options available in the Google Cloud Platform. Depending on your application, you might use one or several of these services to get the job done. It is very convenient to load data from GCS to BigQuery because imports and exports between BigQuery and Cloud Storage are free. Moreover, BigQuery supports federated queries from GCS, and they have a compatible security model for both Access Control and Encryption.

This article introduces you to Google BigQuery and Google Cloud Storage. It also provides 2 methods that you can use to connect GCS to BigQuery. The 1st method includes Manual Integration while the 2nd method is completely automated and hassle-free.

Table of Contents

Prerequisites

Listed below are the prerequisites of connecting GCS to BigQuery:

  • You have installed the Google Cloud SDK on your CLI (Command Line Interface).
  • You have configured the Google Cloud SDK to point to your Google Cloud project.
  • You have created a Google Cloud project and enabled the BigQuery API.
  • You have enabled billing on your project. If you have not enabled it yet, visit this link.
  • Access to a supported Internet browser:
    1. The latest version of Google Chrome, Firefox, or Microsoft Edge
    2. Microsoft Internet Explorer 11+
    3. Safari 8+ 

Introduction to Google BigQuery

GCS to BigQuery: Google BigQuery logo
Image Source

As you scale and grow the number of customers, everything else scales including servers, difficulty, data, etc. You are then faced with the challenge of transforming the way your business/organization ingests, processes, cleans, analyzes, and reports on corporate data.

Google BigQuery allows you to take all the data from different business divisions (Marketing, Sales, IT, Finance, etc.) and consolidate it in a secure Cloud Warehouse. The data here can then be made available to the Analysts and Stakeholders in the company to perform in-depth analysis.

Google BigQuery is an Analytical Database. It is both a storage service and a powerful analysis service which is why it is listed under Big Data on the Google Developer Console.

The power of Google BigQuery is its ability to run SQL queries over Terabytes of data in seconds. It can manage large data (Exabyte-scale) and you can run analysis on billions and billions of rows. You can interact with Google BigQuery using SQL. Moreover, Google BigQuery is fully managed and serverless, so you don’t have to worry about spinning up clusters and managing resources.

Key Features of Google BigQuery

Below are the major key features of Google BigQuery:

  • Extremely easy to use featuring a powerful abstraction layer.
  • Support for standard SQL. Queries are scheduled and executed on Dremel’s multi-tenant compute cluster. 
  • No tuning or indexing of your environment because BigQuery doesn’t use indexes on tables.
  • Canonical saved and shared queries for new users.
  • Partitioning using BigQuery’s required partition filter option to require all queries to filter on the partitioned Date column. This reduces the time and data used when querying data due to the partition pruning capabilities of BigQuery.
  • Support for nested tables for effectively storing and retrieving data. Nested tables reduce the slot usage, time used for querying, and data usage.
  • Google BigQuery is one of the first clouds to offer an MPP Data Warehouse to support geospatial data types and functions. The Google BigQuery GIS (Geospatial Information Systems) feature and the Google BigQuery Geo Visualization tool allow you to view your analysis against a map.
  • You don’t have to manage more servers. Google BigQuery is offered on a serverless service model.
  • The decoupled storage and compute through the Jupiter Network provides flexibility and cost control.
    • The Jupiter Network enables BigQuery to move data between storage and compute seamlessly. So today you can have 1 row in a table and the next day 1 trillion rows in the table and the only thing you need to worry about is paying extra for storage.
    • Segregation of storage and compute costs.
  • Embedded BI for super-fast last-mile querying so that you don’t incur extra latency.
  • Connectors for Data Exploration and Visualization tools.
  • Scales seamlessly.
    • Under the hood, you have the power of Google’s Data Centers (Grid Computations, Colossus File Systems, Dremel Execution Engine, and Google’s Network).
  • Strong security through the use of Cloud IAM, Authentication & Audit Logs.
  • Enables Multi-regional locations. Datasets can be stored in the Americas, Asia Pacific, and Europe.
  • Works well with Google Cloud Storage. Google BigQuery supports federated queries, batch and streaming ingest from Google Cloud Storage.
  • The capability of sharing datasets with other users on Google BigQuery is much like you would share documents on Google Docs.
  • A fair pricing model. On-demand query pricing is set at $5.00 per TB of data processed. Active storage is billed monthly at $0.020 per GB.

More information about Google BigQuery can be found here.

Introduction to Google Cloud Storage

GCS to BigQuery: Google Cloud Storage logo
Image Source

Google Cloud Storage or GCS is a Data Lake. If you have a lot of unstructured data, it’s probably best to store it in Google Cloud Storage. When I say that Google Cloud Storage stores unstructured data, by that I mean, you load a sequence of bytes into a storage bucket and Google Cloud Storage will store them for you.

Of course, there may be some internal structure to that sequence of bytes, for instance, it could be a zip archive with a table of contents and individual archives inside of it or it might be a JPEG image file with a well-defined format, or a text file in Avro, CSV, JSON, or Parquet format.

The key is that Google Cloud Storage has no internal insight into that structured data. It just faithfully stores and retrieves the exact sequence of bytes you ask it to, regardless of any internal structure that may exist in those bytes.

Moreover, Google BigQuery and Google Cloud Storage work well together:

  • Imports and exports between Google BigQuery and Cloud Storage are free. 
  • Google BigQuery supports federated queries from Cloud Storage.
  • Google BigQuery provides a compatible security model for both Access Control and Encryption.

Key Features of Google Cloud Storage

  • A fully redundant data storage infrastructure where you can store and retrieve an unlimited amount of data with no minimum object size, at any time, from anywhere on the web.
  • Cost-efficient Storage classes. These are Standard, Nearline, Coldline, and Archive storage options which cost $0.026, $0.010, $0.007, and $0.004 respectively per GB per month.  
  • A central storage repository for recording all facets of your organization’s operations in data form.
  • Flexible processing where it allows you to select the right tool to analyze your data. Google Cloud Storage integrates natively with other BI (Business Intelligence) tools within the Google Cloud Platform such as Cloud Vision, Dataflow, Dataproc, BigQuery ML, and much more.
  • Granular access over your objects using access-control lists (ACLs).

This post provides an overview of loading data from Cloud Storage to BigQuery using the BigQuery Cloud Storage Transfer Service.

This method will allow you to schedule recurring data loads from Cloud Storage to Google BigQuery. It also works with Amazon S3, Amazon Redshift, and Teradata. In addition, a third-party transfer tool like Hevo Data can make this process a lot easier.

More information about Google Cloud Storage can be found here.

Ways to Connect GCS to BigQuery

Method 1: Using Cloud Storage Transfer Service to Manually Connect GCS to BigQuery

This method involves setting up the Cloud Storage Transfer Service to connect GCS to BigQuery. It is an 8 step process to connect GCS to BigQuery that might also require troubleshooting certain errors manually.

Method 2: Using Hevo Data to Connect GCS to BigQuery

Hevo Data is an automated Data Pipeline platform that can move your data from GCS to BigQuery very quickly without writing a single line of code. It is easy to use, hassle-free, and reliable. 

Moreover, Hevo offers a fully-managed solution to set up data integration from 100+ data sources (including 30+ free data sources) and will let you directly load data to a Data Warehouse such as Snowflake, Amazon Redshift, Google BigQuery, etc. or the destination of your choice. It will automate your data flow in minutes without writing any line of code. Its Fault-Tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully automated solution to manage data in real-time and always have analysis-ready data.

Get Started with Hevo for free

Methods to Connect GCS to BigQuery

GCS to BigQuery: GCS to BigQuery Integration image
Image Source

Now that you have a basic understanding of both of the tools (Google Cloud Storage and Google BigQuery), you can go on to the two methods for connecting GCS to BigQuery. You can easily connect GCS to BigQuery using one of the two methods given below:

Method 1: Using Cloud Storage Transfer Service to Manually Connect GCS to BigQuery

You can follow these 8 steps to manually connect GCS to BigQuery using the Cloud Storage Transfer Service:

Step 1: Enable the BigQuery Data Transfer Service

From the Developer Console;

  • Open the BigQuery Data Transfer API page in the API library.
  • From the dropdown menu, select the appropriate project.
  • Click the ENABLE button.

Step 2: Grant the bigquery.admin Access Permission

When you load data into BigQuery, you need permissions that allow you to load data into new or existing BigQuery tables and partitions. 

All users that will create transfers must be granted the bigquery.admin Cloud IAM role. From the gcloud command-line tool, run the following command;

gcloud projects add-iam-policy-binding myproject 
--member group:user@example.com 
--role roles/bigquery.admin

Where:

myproject is your project ID.

group:user@example.com is the group and user or group’s email address.

The command will output the updated policy:

    bindings:
    - members:
      - group:user@example.com
        role: roles/bigquery.admin

Step 3: Grant the storage.objectAdmin Cloud IAM Role

When loading data from Cloud Storage, you’ll need access to the bucket that contains your data. To get this access, you must have the storage.objectAdmin Cloud IAM role. From the gcloud command-line tool, run the following command;

gcloud projects add-iam-policy-binding myproject 
--member group:user@example.com 
--role roles/storage.objectAdmin

Step 4: Create a Data Set

bq --location=US mk -d 
--default_table_expiration 3600 
--description "This is my dataset." 
mydataset

Where:

  • The dataset’s location is set to the US.
  • 3600 is the default lifetime for newly created tables. This is the minimum value and it is equivalent to one hour.
  • description is the description of the dataset in quotes.
  • mydataset is the name of the dataset you’re creating.

Step 5: Create an Empty Table with a Schema Definition

Enter the following command to create a table using a JSON schema file:

bq mk 
--table 
--expiration 3600 
--description "This is my table" 
--label organization:development 
mydataset.mytable 
/tmp/myschema.json

To verify that the table exists, run this command:

bq show --format=prettyjson mydataset.mytable

Step 6: Create a Storage Bucket

gsutil mb -p myproject -l us-east1 gs://mybucket/

Step 7: Load CSV Data Files into your Bucket

gsutil cp [OBJECT_LOCATION] gs://mybucket/myfiles/

Step 8: Create a Transfer for your Data Source (Cloud Storage)

bq mk --transfer_config 
--target_dataset=mydataset 
--display_name='My Transfer' 
--params='{"data_path_template":"gs://mybucket/myfiles/*.csv",
"destination_table_name_template":"MyTable",
"file_format":"CSV",
"max_bad_records":"1",
"ignore_unknown_values":"true",
"field_delimiter":"|",
"skip_leading_rows":"1",
"allow_quoted_newlines":"true",
"allow_jagged_rows":"false",
"delete_source_files":"true"}' 
--data_source=google_cloud_storage

The transfer by default is scheduled to repeat every 24 hours. Apart from Comma-separated values (CSV) files, you can also load data into BigQuery using the following formats:

  • JSON (newline-delimited)
  • Avro
  • Parquet
  • ORC
  • Datastore exports
  • Firestore exports

After running the command, you receive a message like the following:

[URL omitted] Please copy and paste the above URL into your web browser and follow the instructions to retrieve an authentication code.

Follow the instructions and paste the authentication code on the command line.

Limitations of Cloud Transfer Service to Manually Connect GCS to BigQuery

The manual method of connecting GCS to BigQuery using Cloud Transfer Service is effective, however, it comes with several limitations. Some of the limitations include:

  • All files in your transfer must have the same schema defined in the BigQuery table, otherwise, the transfer will fail. 
  • The transfer will fail if you change the table schema in-between load jobs.
  • You cannot transfer objects in the cloud storage Archive Storage class. 
  • If there are any changes to the underlying data while a query is running, then queries will return unexpected errors.
  • The files in your Cloud Storage bucket need to be at least one hour old to qualify for transfer.
  • The minimum interval time between recurring transfers is one hour which is not ideal if you have large amounts of data. 
  • When you load CSV or JSON data, values in DATE columns must use the dash (-) separator and the date must be in the following format: YYYY-MM-DD (year-month-day).
  • When loading JSON or CSV data, values in TIMESTAMP columns must use a dash (-) separator for the Date portion of the timestamp. The Date needs to have this format: YYYY-MM-DD (year-month-day). The time portion of the timestamp must use a colon (:) separator e.g. hh:mm:ss.

Method 2: Using Hevo Data to Connect GCS to BigQuery

GCS to BigQuery: Hevo Banner
Image Source

Hevo Data, a No-code Data Pipeline, helps you directly transfer data from Google Cloud Storage and 100+ other data sources to Data Warehouses such as Google BigQuery, Amazon Redshift, Databases, BI tools, or a destination of your choice in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

Sign up here for a 14-day Free Trial!

You can reduce the effort required to extract, transform, and load data from GCS to BigQuery using Hevo. Using Hevo, you can build an automated data pipeline to move data from GCS to BigQuery in real-time, without writing any code. Hevo can load data into BigQuery in just 2 simple steps.

  • Step 1: Connect and configure to your Google Cloud Storage data source by providing the Pipeline Name, Path Prefix, File Format, and Bucket Name.
GCS to BigQuery: Source Configuration
Image Source
  • Step 2: Configure the Google BigQuery Data Warehouse where the data should be loaded by providing the project ID, dataset ID, destination name, GCS bucket, Enable Streaming Inserts, and Sanitize Table/Column names.
GCS to BigQuery: Destination
Image Source

Here are more reasons to try Hevo:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.

Conclusion

This article provided you a brief overview of Google Cloud Storage and Google BigQuery. It has also provided a detailed guide on how you can replicate data from GCS to BigQuery manually and using the Hevo Data pipeline. Moreover, you got to know about the limitations associated while establishing the connection manually from GCS to BigQuery. If those limitations are not much of your concern, you can go for the manual method, otherwise, you can try out the Hevo Data Pipeline to connect GCS to BigQuery.

Visit our Website to Explore Hevo

Hevo now takes care of moving data from GCS to BigQuery in real-time. In addition to this, Hevo comes with additional capabilities to clean, enrich and transform your data both before and after moving to Google BigQuery. This ensures that you always have analysis-ready data in your warehouse.

Businesses can use automated platforms like Hevo Data to set this integration and handle the ETL process. It helps you directly transfer data from a source of your choice to a Data Warehouse, Business Intelligence tools, or any other desired destination in a fully automated and secure manner without having to write any code and will provide you a hassle-free experience.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Share your experience of connecting GCS to BigQuery in the comments section below!

No-code Data Pipeline for BigQuery