Almost all applications need to use some kind of persistent durable storage to get their job done. This might be to store information about user accounts, images you want to capture or serve a time series of events that are happening you want to analyze. Not surprisingly the Google Cloud Platform can store all these things. However, because different applications have different storage requirements, the Google Cloud Platform has not one but 5 ways of storing persistent data.
Google BigQuery and Cloud Storage are two out of 5 storage options available in the Google Cloud Platform. Depending on your application, you might use one or several of these services to get the job done. It is very convenient to load data from GCS to BigQuery because imports and exports between BigQuery and Cloud Storage are free. Moreover, BigQuery supports federated queries from GCS, and they have a compatible security model for both Access Control and Encryption.
This article introduces you to Google BigQuery and Google Cloud Storage. It also provides 2 methods that you can use to connect GCS to BigQuery. The 1st method includes Manual Integration while the 2nd method is completely automated and hassle-free.
Listed below are the prerequisites of connecting GCS to BigQuery:
- You have installed the Google Cloud SDK on your CLI (Command Line Interface).
- You have configured the Google Cloud SDK to point to your Google Cloud project.
- You have created a Google Cloud project and enabled the BigQuery API.
- You have enabled billing on your project. If you have not enabled it yet, visit this link.
- Access to a supported Internet browser:
- The latest version of Google Chrome, Firefox, or Microsoft Edge
- Microsoft Internet Explorer 11+
- Safari 8+
Introduction to Google BigQuery
As you scale and grow the number of customers, everything else scales including servers, difficulty, data, etc. You are then faced with the challenge of transforming the way your business/organization ingests, processes, cleans, analyzes, and reports on corporate data.
Google BigQuery allows you to take all the data from different business divisions (Marketing, Sales, IT, Finance, etc.) and consolidate it in a secure Cloud Warehouse. The data here can then be made available to the Analysts and Stakeholders in the company to perform in-depth analysis.
Google BigQuery is an Analytical Database. It is both a storage service and a powerful analysis service which is why it is listed under Big Data on the Google Developer Console.
The power of Google BigQuery is its ability to run SQL queries over Terabytes of data in seconds. It can manage large data (Exabyte-scale) and you can run analysis on billions and billions of rows. You can interact with Google BigQuery using SQL. Moreover, Google BigQuery is fully managed and serverless, so you don’t have to worry about spinning up clusters and managing resources.
Key Features of Google BigQuery
Below are the major key features of Google BigQuery:
- Extremely easy to use featuring a powerful abstraction layer.
- Support for standard SQL. Queries are scheduled and executed on Dremel’s multi-tenant compute cluster.
- No tuning or indexing of your environment because BigQuery doesn’t use indexes on tables.
- Canonical saved and shared queries for new users.
- Partitioning using BigQuery’s required partition filter option to require all queries to filter on the partitioned Date column. This reduces the time and data used when querying data due to the partition pruning capabilities of BigQuery.
- Support for nested tables for effectively storing and retrieving data. Nested tables reduce the slot usage, time used for querying, and data usage.
- Google BigQuery is one of the first clouds to offer an MPP Data Warehouse to support geospatial data types and functions. The Google BigQuery GIS (Geospatial Information Systems) feature and the Google BigQuery Geo Visualization tool allow you to view your analysis against a map.
- You don’t have to manage more servers. Google BigQuery is offered on a serverless service model.
- The decoupled storage and compute through the Jupiter Network provides flexibility and cost control.
- The Jupiter Network enables BigQuery to move data between storage and compute seamlessly. So today you can have 1 row in a table and the next day 1 trillion rows in the table and the only thing you need to worry about is paying extra for storage.
- Segregation of storage and compute costs.
- Embedded BI for super-fast last-mile querying so that you don’t incur extra latency.
- Connectors for Data Exploration and Visualization tools.
- Scales seamlessly.
- Under the hood, you have the power of Google’s Data Centers (Grid Computations, Colossus File Systems, Dremel Execution Engine, and Google’s Network).
- Strong security through the use of Cloud IAM, Authentication & Audit Logs.
- Enables Multi-regional locations. Datasets can be stored in the Americas, Asia Pacific, and Europe.
- Works well with Google Cloud Storage. Google BigQuery supports federated queries, batch and streaming ingest from Google Cloud Storage.
- The capability of sharing datasets with other users on Google BigQuery is much like you would share documents on Google Docs.
- A fair pricing model. On-demand query pricing is set at $5.00 per TB of data processed. Active storage is billed monthly at $0.020 per GB.
More information about Google BigQuery can be found here.
Introduction to Google Cloud Storage
Google Cloud Storage or GCS is a Data Lake. If you have a lot of unstructured data, it’s probably best to store it in Google Cloud Storage. When I say that Google Cloud Storage stores unstructured data, by that I mean, you load a sequence of bytes into a storage bucket and Google Cloud Storage will store them for you.
Of course, there may be some internal structure to that sequence of bytes, for instance, it could be a zip archive with a table of contents and individual archives inside of it or it might be a JPEG image file with a well-defined format, or a text file in Avro, CSV, JSON, or Parquet format.
The key is that Google Cloud Storage has no internal insight into that structured data. It just faithfully stores and retrieves the exact sequence of bytes you ask it to, regardless of any internal structure that may exist in those bytes.
Moreover, Google BigQuery and Google Cloud Storage work well together:
- Imports and exports between Google BigQuery and Cloud Storage are free.
- Google BigQuery supports federated queries from Cloud Storage.
- Google BigQuery provides a compatible security model for both Access Control and Encryption.
Key Features of Google Cloud Storage
- A fully redundant data storage infrastructure where you can store and retrieve an unlimited amount of data with no minimum object size, at any time, from anywhere on the web.
- Cost-efficient Storage classes. These are Standard, Nearline, Coldline, and Archive storage options which cost $0.026, $0.010, $0.007, and $0.004 respectively per GB per month.
- A central storage repository for recording all facets of your organization’s operations in data form.
- Flexible processing where it allows you to select the right tool to analyze your data. Google Cloud Storage integrates natively with other BI (Business Intelligence) tools within the Google Cloud Platform such as Cloud Vision, Dataflow, Dataproc, BigQuery ML, and much more.
- Granular access over your objects using access-control lists (ACLs).
This post provides an overview of loading data from Cloud Storage to BigQuery using the BigQuery Cloud Storage Transfer Service.
This method will allow you to schedule recurring data loads from Cloud Storage to Google BigQuery. It also works with Amazon S3, Amazon Redshift, and Teradata. In addition, a third-party transfer tool like Hevo Data can make this process a lot easier.
More information about Google Cloud Storage can be found here.
Did you know that 75-90% of data sources you will ever need to build pipelines for are already available off-the-shelf with No-Code Data Pipeline Platforms like Hevo?
Ambitious data engineers who want to stay relevant for the future automate repetitive ELT work and save more than 50% of their time that would otherwise be spent on maintaining pipelines. Instead, they use that time to focus on non-mediocre work like optimizing core data infrastructure, scripting non-SQL transformations for training algorithms, and more.
Step off the hamster wheel and opt for an automated data pipeline like Hevo. With a no-code intuitive UI, Hevo lets you set up pipelines in minutes. Its fault-tolerant architecture ensures zero maintenance. Moreover, data replication happens in near real-time from 150+ sources to the destination of your choice including Snowflake, BigQuery, Redshift, Databricks, and Firebolt.
Start saving those 20 hours with Hevo today.
Get started for Free with Hevo!
Methods to Connect GCS to BigQuery
Now that you have a basic understanding of both of the tools (Google Cloud Storage and Google BigQuery), you can go on to the two methods for connecting GCS to BigQuery. You can easily connect GCS to BigQuery using one of the two methods given below:
Download the Cheatsheet on How to Set Up High-performance ETL to BigQuery
Learn the best practices and considerations for setting up high-performance ETL to BigQuery
Method 1: Using Cloud Storage Transfer Service to Manually Connect GCS to BigQuery
You can follow these 8 steps to manually connect GCS to BigQuery using the Cloud Storage Transfer Service:
Step 1: Enable the BigQuery Data Transfer Service
From the Developer Console;
- Open the BigQuery Data Transfer API page in the API library.
- From the dropdown menu, select the appropriate project.
- Click the ENABLE button.
Step 2: Grant the bigquery.admin Access Permission
When you load data into BigQuery, you need permissions that allow you to load data into new or existing BigQuery tables and partitions.
All users that will create transfers must be granted the bigquery.admin Cloud IAM role. From the gcloud command-line tool, run the following command;
gcloud projects add-iam-policy-binding myproject
myproject is your project ID.
group:firstname.lastname@example.org is the group and user or group’s email address.
The command will output the updated policy:
Step 3: Grant the storage.objectAdmin Cloud IAM Role
When loading data from Cloud Storage, you’ll need access to the bucket that contains your data. To get this access, you must have the storage.objectAdmin Cloud IAM role. From the gcloud command-line tool, run the following command;
gcloud projects add-iam-policy-binding myproject
Step 4: Create a Data Set
bq --location=US mk -d
--description "This is my dataset."
- The dataset’s location is set to the US.
- 3600 is the default lifetime for newly created tables. This is the minimum value and it is equivalent to one hour.
- description is the description of the dataset in quotes.
- mydataset is the name of the dataset you’re creating.
Step 5: Create an Empty Table with a Schema Definition
Enter the following command to create a table using a JSON schema file:
--description "This is my table"
To verify that the table exists, run this command:
bq show --format=prettyjson mydataset.mytable
Step 6: Create a Storage Bucket
gsutil mb -p myproject -l us-east1 gs://mybucket/
Step 7: Load CSV Data Files into your Bucket
gsutil cp [OBJECT_LOCATION] gs://mybucket/myfiles/
Step 8: Create a Transfer for your Data Source (Cloud Storage)
bq mk --transfer_config
The transfer by default is scheduled to repeat every 24 hours. Apart from Comma-separated values (CSV) files, you can also load data into BigQuery using the following formats:
- JSON (newline-delimited)
- Datastore exports
- Firestore exports
After running the command, you receive a message like the following:
[URL omitted] Please copy and paste the above URL into your web browser and follow the instructions to retrieve an authentication code.
Follow the instructions and paste the authentication code on the command line.
Limitations of Cloud Transfer Service to Manually Connect GCS to BigQuery
The manual method of connecting GCS to BigQuery using Cloud Transfer Service is effective, however, it comes with several limitations. Some of the limitations include:
- All files in your transfer must have the same schema defined in the BigQuery table, otherwise, the transfer will fail.
- The transfer will fail if you change the table schema in-between load jobs.
- You cannot transfer objects in the cloud storage Archive Storage class.
- If there are any changes to the underlying data while a query is running, then queries will return unexpected errors.
- The files in your Cloud Storage bucket need to be at least one hour old to qualify for transfer.
- The minimum interval time between recurring transfers is one hour which is not ideal if you have large amounts of data.
- When you load CSV or JSON data, values in DATE columns must use the dash (-) separator and the date must be in the following format: YYYY-MM-DD (year-month-day).
- When loading JSON or CSV data, values in TIMESTAMP columns must use a dash (-) separator for the Date portion of the timestamp. The Date needs to have this format: YYYY-MM-DD (year-month-day). The time portion of the timestamp must use a colon (:) separator e.g. hh:mm:ss.
Method 2: Using Hevo’s No Code Data Pipeline to Connect GCS to BigQuery
Hevo Data is an automated No-code Data Pipeline that connects GCS and 100+ sources to Google Bigquery within minutes to create a single source of truth in a completely hassle-free manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code.
You can load data from files in your GCS bucket into a Destination database or data warehouse using Hevo Pipelines. Hevo automatically unzips any Gzipped files on ingestion. Further, files are re-ingested if updated, as it is not possible to identify individual changes. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.
You can reduce the effort required to extract, transform, and load data from GCS to BigQuery using Hevo. Using Hevo, you can build an automated data pipeline to move data from GCS to BigQuery in real-time, without writing any code. Hevo can load data into BigQuery in just 2 simple steps.
- Step 1: Connect and configure to your Google Cloud Storage data source by providing the Pipeline Name, Path Prefix, File Format, and Bucket Name.
- Step 2: Configure the Google BigQuery Data Warehouse where the data should be loaded by providing the project ID, dataset ID, destination name, GCS bucket, Enable Streaming Inserts, and Sanitize Table/Column names.
Here are more reasons to try Hevo:
- Secure: Hevo has a fault-tolerant architecture that ensures that your GCS data is handled in a secure, consistent manner with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema. This way you can leverage Hevo to work with AVRO, CSV, JSON, and XML formats of GCS data without worrying about schema management.
- Interactive UI: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations. Moreover, you can connect GCS to BigQuery using only 2 simple steps.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- High Connector Availability: Hevo provides you with 100+ connectors that allow you to bring data from the source of your choice and load into Data Warehouses like Redshift, BigQuery, Snoflake etc.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Hevo supports loading data from GCS to BigQuery in real time without writing any code. It will take full control of data transfer process and allow you to focus on other key aspects of your business. Try our 14 day full access free trial!
Sign up here for a 14-day Free Trial!
This article provided you with a brief overview of Google Cloud Storage and Google BigQuery. It has also provided a detailed guide on how you can replicate data from GCS to BigQuery manually and using the Hevo Data pipeline. Moreover, you got to know about the limitations associated while establishing the connection manually from GCS to BigQuery. If those limitations are not much of your concern, you can go for the manual method, otherwise, you can try out the Hevo Data Pipeline to connect GCS to BigQuery.
Hevo will provide you a much more hassle-free solution compared to Cloud Storage transfer service. Using Hevo, all of your ETL tasks will be automated and this will also free up your engineering bandwidth. Hevo now takes care of moving data from GCS to BigQuery in real-time. In addition to this, Hevo comes with additional capabilities to clean, enrich and transform your data both before and after moving to Google BigQuery. This ensures that you always have analysis-ready data in your warehouse.
Learn more about Hevo
Share your experience of connecting GCS to BigQuery in the comments section below!