GCS to BigQuery: Steps to Move Data Instantly

on Tutorials • April 11th, 2020 • Write for Hevo

Almost all applications need to use some kind of persistent durable storage to get their job done. This might be to store information about user accounts, images you want to capture or serve a time series of events that are happening that you want to analyze.

Not surprisingly the Google Cloud Platform can store all these things. However, because different applications have different storage requirements, the Google Cloud Platform has not one but 5 ways of storing persistent data.

The storage services offered on the Google Cloud Platform are:

  • Cloud Bigtablea scalable IaaS NoSQL database service designed for large companies and enterprises who often have Big Data analytics needs with complex backend workloads.
  • Cloud Datastorea scalable NoSQL database optimized for serving transactional data to applications.
  • Cloud SQLa Managed SQL Database.
  • Cloud Storagea secure and durable object storage store that can scale to exabytes of data.
  • Cloud BigQueryan enterprise data warehouse

BigQuery and Cloud Storage are two of five storage options available in the Google Cloud Platform. Depending on your application, you might use one or several of these services to get the job done.

Cloud Bigtable, Cloud Datastore, and Cloud SQL are primarily operational databases. They are intended to be used as part of an application.

Understanding Google BigQuery

As you scale and grow the number of customers, everything else scales, servers, difficulty, data, etc. You are then faced with the challenge of transforming the way your business/organization ingests, processes, cleans, analyzes, and reports on corporate data.

How do you store and analyze data from all your corporate divisions – 

BigQuery allows you to take all the data from different business divisions (Marketing, Sales, IT, Finance, etc.) and consolidate it in a secure cloud warehouse. The data here can then be made available to the analysts and stakeholders in the company to perform in-depth analysis.

BigQuery is an analytical database. It is both a storage service and a powerful analysis service which is why it is listed under BigData on the Google Developer Console.

The power of BigQuery is its ability to run SQL queries over Terabytes of data in seconds. It can manage large data (Exabyte-scale) and you can run analysis on billions and billions of rows. You interact with BigQuery using SQL. BigQuery is fully managed and serverless so you don’t have to worry about spinning up clusters and managing resources, that’s all done for you.

Why choose BigQuery as your Cloud DataWarehouse?

  • Extremely easy to use featuring a powerful abstraction layer.
  • Support for standard SQL. Queries are scheduled and executed on Dremel’s multi-tenant compute cluster. 
  • No tuning or indexing of your environment because BigQuery doesn’t use indexes on tables.
  • Canonical saved and shared queries for new users.
  • Partitioning using BigQuery’s required partition filter option to require all queries to filter on the partitioned date column. This reduces the time and data used when querying data due to the partition pruning capabilities of BigQuery.
  • Support for nested tables for effectively storing and retrieving data. Nested tables reduce the slot usage, time used for querying, and data usage.
  • BigQuery is one of the first clouds to offer an MPP data warehouse to support geospatial data types and functions. The BigQuery GIS (geospatial Information Systems) feature and the BigQuery Geo Visualization tool allows you to view your analysis against a map.
  • You don’t have to manage more servers – BigQuery is offered on a serverless service model.
  • The decoupled storage and compute through the Jupiter Network provides flexibility and cost control.
    • The Jupiter Network enables BigQuery to move data between storage and compute seamlessly. So today you can have 1 row in a table and the next day 1 trillion rows in the table and the only thing you need to worry about is paying extra for storage.
    • Segregation of storage and compute costs.
  • Embedded BI for super-fast last-mile querying so that you don’t incur extra latency.
  • Connectors for data exploration and visualization tools.
  • Scales seamlessly.
    • Under the hood, you have the power of Google’s Data Centers (Grid computations, Colossus file systems, Dremel execution engine, and Google’s Network).
  • Strong security through the use of Cloud IAM, Authentication & Audit Logs.
  • Multi-regional locations. Datasets can be stored in the Americas, Asia Pacific, and Europe.
  • Works well with GCS. BigQuery supports federated queries, batch and streaming ingests from GCS.
  • The capability of sharing datasets with other users on BigQuery, much like you would share documents on Google Docs.
  • A fair pricing model. On-demand query pricing is set at $5.00 per TB of data processed. Active storage is billed on a monthly basis at $0.020 per GB.

Understanding Google Cloud Storage

Google Cloud Storage or GCS is a data lake. If you have a lot of unstructured data, it’s probably best to store it in GCS. When I say that Google Cloud Storage stores unstructured data, by that I mean, you load a sequence of bytes into a storage bucket and GCS will store them for you.

Of course, there may be some internal structure to that sequence of bytes, for instance, it could be a zip archive with a table of contents and individual archives inside of it or it might be a JPEG image file with a well-defined format, or a text file in Avro, CSV, JSON, or Parquet format.

The key is that GCS has no internal insight into that structured data. It just faithfully stores and retrieves the exact sequence of bytes you ask it to regardless of any internal structure that may exist in those bytes.

Cloud Storage calls the place you store things a bucket and the things you store objects.

BigQuery and GCS work really well together:

  • Imports and exports between BigQuery and Cloud Storage are free. 
  • BigQuery supports federated queries from Cloud Storage.
  • Compatible security model: for both access control and encryption.

GCS offers:

  • A fully redundant data storage infrastructure where you can store and retrieve an unlimited amount of data with no minimum object size, at any time, from anywhere on the web.
  • Cost-efficient Storage classes. These are Standard, Nearline, Coldline, and Archive storage options which cost $0.026, $0.010, $0.007, and $0.004 respectively per GB per month.  
  • A central storage repository for recording all facets of your organization’s operations in data form.
  • Flexible processing where it allows you to select the right tool to analyze your data. GCS integrates natively with other BI tools within the Google Cloud Platform such as Cloud Vision, Dataflow, Dataproc, BigQuery ML and much more.
  • Granular access over your objects using access-control lists (ACLs).

Moving Data from GCS to BigQuery

  1. Load data into BigQuery using the bq command-line tool.
  2. Using Hevo, an automated data pipeline that integrates seamlessly with GCS and BigQuery. Hevo is a code-free data integration platform that is enabling any business to implement a data lake and data warehouse ETL pipelines.
  3. Use the Cloud Storage Transfer Service.

This post provides an overview of loading data from Cloud Storage to BigQuery using the last method—using the BigQuery Cloud Storage Transfer Service.

This method will allow you to schedule recurring data loads from Cloud Storage to BigQuery. It also works with Amazon S3, Amazon Redshift, and Teradata. In addition, a third-party transfer tool like Hevo Data can make this process a lot easier.

Prerequisites

  1. You have installed the Google Cloud SDK on your CLI.
  2. You have configured the Google Cloud SDK to point to your Google Cloud project.
  3. You have created a Google Cloud project and enabled the BigQuery API.
  4. You have enabled billing on your project.
  5. Access to a supported Internet browser:
    1. The latest version of Google Chrome, Firefox, or Microsoft Edge
    2. Microsoft Internet Explorer 11+
    3. Safari 8+ 

GCS to BigQuery: Steps to Move data using Cloud Storage Transfer Service

1. Enable the BigQuery Data Transfer Service

From the Developer Console;

  1. Open the BigQuery Data Transfer API page in the API library.
  2. From the dropdown menu, select the appropriate project.
  3. Click the ENABLE button.

2. Grant the bigquery.admin access permission

When you load data into BigQuery, you need permissions that allow you to load data into new or existing BigQuery tables and partitions. 

All users that will create transfers must be granted the bigquery.admin Cloud IAM role. From the gcloud command-line tool, run the following command;

gcloud projects add-iam-policy-binding myproject 
--member group:user@example.com 
--role roles/bigquery.admin

Where:

myproject is your project ID.

group:user@example.com is the group and user or group’s email address.

The command will output the updated policy:

    bindings:
    - members:
      - group:user@example.com
        role: roles/bigquery.admin

3. Grant the storage.objectAdmin Cloud IAM role

When loading data from Cloud Storage, you’ll need access to the bucket that contains your data. To get this access, you must have the storage.objectAdmin Cloud IAM role. From the gcloud command-line tool, run the following command;

gcloud projects add-iam-policy-binding myproject 
--member group:user@example.com 
--role roles/storage.objectAdmin

4. Create a dataset

bq --location=US mk -d 
--default_table_expiration 3600 
--description "This is my dataset." 
mydataset

Where:

  • The dataset’s location is set to the US.
  • 3600 is the default lifetime for newly created tables. This is the minimum value and it is equivalent to one hour.
  • description is the description of the dataset in quotes.
  • mydataset is the name of the dataset you’re creating.

5. Create an empty table with a schema definition  

Enter the following command to create a table using a JSON schema file:

bq mk 
--table 
--expiration 3600 
--description "This is my table" 
--label organization:development 
mydataset.mytable 
/tmp/myschema.json

To verify that the table exists, run this command:

bq show --format=prettyjson mydataset.mytable

6. Create a storage bucket

gsutil mb -p myproject -l us-east1 gs://mybucket/

7. Load CSV data files into your bucket

gsutil cp [OBJECT_LOCATION] gs://mybucket/myfiles/

8. Create a transfer for your data source (Cloud Storage).

bq mk --transfer_config 
--target_dataset=mydataset 
--display_name='My Transfer' 
--params='{"data_path_template":"gs://mybucket/myfiles/*.csv",
"destination_table_name_template":"MyTable",
"file_format":"CSV",
"max_bad_records":"1",
"ignore_unknown_values":"true",
"field_delimiter":"|",
"skip_leading_rows":"1",
"allow_quoted_newlines":"true",
"allow_jagged_rows":"false",
"delete_source_files":"true"}' 
--data_source=google_cloud_storage

The transfer by default is scheduled to repeat every 24 hours. Apart from Comma-separated values (CSV) files, you can also load data into BigQuery using the following formats:

  • JSON (newline-delimited)
  • Avro
  • Parquet
  • ORC
  • Datastore exports
  • Firestore exports

After running the command, you receive a message like the following:

[URL omitted] Please copy and paste the above URL into your web browser and follow the instructions to retrieve an authentication code.

Follow the instructions and paste the authentication code on the command line.

Move any Data into your Warehouse without any code

Get Started For Free

Limitations of the above approach

  1. All files in your transfer must have the same schema defined in the BigQuery table, otherwise the transfer will fail. 
  2. The transfer will fail if you change the table schema in-between load jobs.
  3. You cannot transfer objects in the cloud storage Archive Storage class. 
  4. If there are any changes to the underlying data while a query is running, then queries will return unexpected errors.
  5. The files in your Cloud Storage bucket need to be at least one hour old to qualify for transfer.
  6. The minimum interval time between recurring transfers is one hour which is not ideal if you have large amounts of data. 
  7. When you load CSV or JSON data, values in DATE columns must use the dash (-) separator and the date must be in the following format: YYYY-MM-DD (year-month-day).
  8. When loading JSON or CSV data, values in TIMESTAMP columns must use a dash (-) separator for the date portion of the timestamp. The date needs to have this format: YYYY-MM-DD (year-month-day). The time portion of the timestamp must use a colon (:) separator e.g. hh:mm:ss.

EASIER WAY TO MOVE DATA FROM GCS TO BIGQUERY

You can reduce the effort required to extract, transform, and load data from GCS to BigQuery using Hevo. Using Hevo, you can build an automated data pipeline to move data from GCS to BigQuery in real-time, without writing any code. Hevo can load data into BigQuery in just 3 simple steps.

  1. Connect and configure to your GCS data source
  2. Configure the Google BigQuery data warehouse where the data should be loaded

Hevo now takes care of moving data from Google Cloud Storage to BigQuery in real-time. In addition to this, Hevo comes wit additional capabilities to clean, enrich and transform your data both before and after moving to BigQuery. This ensures that you always have analysis-ready data in your warehouse.

Hevo ingests data from many other databases (including MySQL, SQL Server, Oracle, SQL Server, PostgreSQL, Amazon Aurora, et al) via log-based Change Data Capture (CDC), as well as from log files, messaging systems, SaaS applications and more. You can look at the complete list of data sources here.

Sign up for a 14-day free trial to experience Hevo’s simplicity and robustness first-hand.

No-code Data Pipeline for BigQuery