Google BigQuery – a fully managed Cloud Data Warehouse for analytics from Google Cloud Platform (GCP), is one of the most popular Cloud-based analytics solutions. Due to its unique architecture and seamless integration with other services from GCP, there are certain best practices to be considered while configuring Google BigQuery ETL (Extract, Transform, Load) & migrating data to BigQuery.

This article will give you a birds-eye on how Google BigQuery can enhance the ETL Process in a seamless manner. Read along to discover how you can use Google BigQuery ETL for your organization!

Best Practices to Perform Google BigQuery ETL

Given below are 11 Best Practices & Strategies individuals can use to perform Google BigQuery ETL:

Simplify BigQuery ETL with Hevo’s no-code Data Pipeline

Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready.

Get Started with Hevo for Free

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data.

SIGN UP HERE FOR A 14-DAY FREE TRIAL

1. GCS – Staging Area for BigQuery Upload

Unless you are directly loading data from your local machine, the first step in Google BigQuery ETL is to upload data to GCS. To move data to GCS you have multiple options:

  • Gsutil is a command line tool which can be used to upload data to GCS from different servers.
  • If your data is present in any online data sources like AWS S3 you can use Storage Transfer Service from Google cloud. This service has options to schedule transfer jobs.

Other things to be noted while loading data to GCS:

  • GCS bucket and Google BigQuery dataset should be in the same location with one exception – If the dataset is in the US multi-regional location, data can be loaded from GCS bucket in any regional or multi-regional location.

The format supported to upload from GCS to Google BigQuery are – Comma-separated values (CSV), JSON (newline-delimited), Avro, Parquet, ORC, Cloud Datastore exports, Cloud Firestore exports. 

2. Nested and Repeated Data

This is one of the most important Google BigQuery ETL best practices. Google BigQuery performs best when the data is denormalized. Instead of keeping relations, denormalize the data and take advantage of nested and repeated fields. Nested and repeated fields are supported in Avro, Parquet, ORC, JSON (newline delimited) formats. STRUCT is the type that can be used to represent an object which can be nested and ARRAY is the type to be used for the repeated value.

For example, the following row from a BigQuery table is an array of a struct:

{

 "id": "1",
 "first_name": "Ramesh",
"last_name": "Singh",
"dob": "1998-01-22",
 "addresses": [
 {
     "status": "current",
     "address": "123 First Avenue",
"city": "Pittsburgh",
     "state": "WA",
     "zip": "11111",
    "numberOfYears": "1"
},
{
     "status": "previous",
     "address": "456 Main Street",
     "city": "Pennsylvania",
     "state": "OR",
     "zip": "22222",
    "numberOfYears": "5"
  }
]
}

3. Data Compression

The next vital Google BigQuery ETL best practice is on Data Compression. Most of the time the data will be compressed before transfer. You should consider the below points while compressing data.

  • The binary Avro is the most efficient format for loading compressed data.
  • Parquet and ORC format are also good as they can be loaded in parallel.
  • For CSV and JSON, Google BigQuery can load uncompressed files significantly faster than compressed files because uncompressed files can be read in parallel.

4. Time Series Data and Table Partitioning

Time Series data is a generic term used to indicate a sequence of data points paired with timestamps. Common examples are clickstream events from a website or transactions from a Point Of Sale machine. The velocity of this kind of data is much higher and volume increases over time. Partitioning is a common technique used to efficiently analyze time-series data and Google BigQuery has good support for this with partitioned tables. Partitioned Tables are crucial in Google BigQuery ETL operations because it helps in the Storage of data.

A partitioned table is a special Google BigQuery table that is divided into segments often called as partitions. It is important to partition bigger table for better maintainability and query performance. It also helps to control costs by reducing the amount of data read by a query. Automated tools like Hevo Data can help you partition BigQuery ETL tables within the UI only which helps streamline your ETL even faster.

To learn more about partitioning in Google BigQuery, you can read our blog here.

Google BigQuery has mainly three options to partition a table:

  • Ingestion-time partitioned tables – For these type of table BigQuery automatically loads data into daily, date-based partitions that reflect the data’s ingestion date. A pseudo column named _PARTITIONTIME will have this date information and can be used in queries.
  • Partitioned tables – Most common type of partitioning which is based on TIMESTAMP or DATE column. Data is written to a partition based on the date value in that column. Queries can specify predicate filters based on this partitioning column to reduce the amount of data scanned.
    • You should use the date or timestamp column which is most frequently used in queries as partition column.
    • Partition column should also distribute data evenly across each partition. Make sure it has enough cardinality.
    • Also, note that the Maximum number of partitions per partitioned table is 4,000.
    • Legacy SQL is not supported for querying or for writing query results to partitioned tables.
  • Sharded Tables – You can also think of shard tables using a time-based naming approach such as [PREFIX]_YYYYMMDD and use a UNION while selecting data.

Generally, Partitioned tables perform better than tables sharded by date. However, if you have any specific use-case to have multiple tables you can use sharded tables. Ingestion-time partitioned tables can be tricky if you are inserting data again as part of some bug fix.

5. Streaming Insert

The next vital Google BigQuery ETL best practice is on actually inserting data. For inserting data into a Google BigQuery table in batch mode a load job will be created which will read data from the source and insert it into the table. Streaming data will enable us to query data without any delay in the load job. Stream insert can be performed on any Google BigQuery table using Cloud SDKs or other GCP services like Dataflow (Dataflow is an auto-scalable stream and batch data processing service from GCP ). The following things should be noted while performing stream insert:

  • Streaming data is available for the query after a few seconds of the first stream inserted in the table.
  • Data takes up to 90 minutes to become available for copy and export.
  • While streaming to a partitioned table, the value of  _PARTITIONTIME pseudo column will be NULL.
  • While streaming to a table partitioned on a DATE or TIMESTAMP column, the value in that column should be between 1 year in the past and 6 months in the future. Data outside this range will be rejected.

6. Bulk Updates

Google BigQuery has quotas and limits for DML statements which is getting increased over time. As of now the limit of combined INSERT, UPDATE, DELETE and MERGE statements per day per table is 1,000. Note that this is not the number of rows. This is the number of the statement and as you know, one single DML statement can affect millions of rows.

Now within this limit, you can run updates or merge statements affecting any number of rows. It will not affect any query performance, unlike many other analytical solutions.

7. Transforming Data after Load (ELT)

Google BigQuery ETL must also address ELT in some scenarios as ELT is the popular methodology now. Sometimes it is really handy to transform data within Google BigQuery using SQL, which is often referred to as Extract Load Transfer (ELT). BigQuery supports both INSERT INTO SELECT and CREATE TABLE AS SELECT  methods to data transfer across tables.

INSERT das.DetailedInve (product, quantity)
VALUES('television 50',
(SELECT quantity FROM ds.DetailedInv
WHERE product = 'television'))

CREATE TABLE mydataset.top_words
AS SELECT corpus,ARRAY_AGG(STRUCT(word, word_count)) AS top_words
FROM bigquery-public-data.samples.shakespeare GROUP BY corpus;

8. Federated Tables for Adhoc Analysis

You can directly query data stored in the location below from BigQuery which is called federated data sources or tables.

  • Cloud BigTable
  • GCS
  • Google Drive

Things to be noted while using this option:

  • Query performance might not be good as the native Google BigQuery table.
  • No consistency is guaranteed in case of external data is changed while querying.
  • Can’t export data from an external data source using BigQuery Job.
  • Currently, Parquet or ORC format is not supported.
  • The query result is not cached, unlike native BigQuery tables.

9. Access Control and Data Encryption

Data stored in Google BigQuery is encrypted by default and keys are managed by GCP Alternatively customers can manage keys using the Google KMS service.

To grant access to resources, BigQuery uses IAM(Identity and Access Management) to the dataset level. Tables and views are child resources of datasets and inherit permission from the dataset. There are predefined roles like bigquery.dataViewer and bigquery.dataEditor or the user can create custom roles.

10. Character Encoding

Sometimes it will take some time to get the correct character encoding scheme while transferring data. Take notes of the points mentioned below as it will help you to get them correct in the first place.

  • To perform Google BigQuery ETL, all source data should be UTF-8 encoded with the below exception
  • If a CSV file with data encoded in ISO-8859-1 format, it should be specified and BigQuery will properly convert the data to UTF-8
  • Delimiters should be encoded as ISO-8859-1
  • Non-convertible characters will be replaced with Unicode replacement characters: �

11. Backup and Restore

Google BigQuery ETL addresses backup and disaster recovery at the service level. The user does not need to worry about it. Still, Google BigQuery is maintaining a complete 7-day history of changes against tables and allows to query a point-in-time snapshot of the table.

Concerns when using BigQuery

You should be aware of potential issues or difficulties. You may create better data pipelines and data solutions where these problems can be solved by having a deeper understanding of these concerns.

Limited data type support

BigQuery does not accept arrays, structs, or maps as data types. Therefore, in order to make such data suitable with your data analysis requirements, you will need to modify them.

Dealing with unstructured data

When working with unstructured data in BigQuery, you need to account for extra optimisation activities or transformational stages. BigQuery handles structured and semi-structured data with ease. However, unstructured data might make things a little more difficult.

Complicated workflow

Getting started with BigQuery’s workflow function may be challenging for novices, particularly if they are unfamiliar with fundamental SQL or other aspects of data processing.

Lack of support for Modify/Update delete operations on individual rows

To change any row, you have to either alter the entire table or utilize an insert, update, and delete combo.

Serial operations

BigQuery is well-suited to processing bulk queries in parallel. However, if you try to conduct serial operations, you can discover that it performs worse.

Daily table update limit

A table can be updated up to 1000 times in a day by default. You will need to request and raise the quota in order to get more updates.

Common Stages in a BigQuery ELT Pipeline

Let’s look into the typical steps in a BigQuery ELT pipeline:

  • Transferring data from file systems, local storage, or any other media
  • Data loading into Google Cloud Platform services (GCP)
  • Data loading into BigQuery
  • Data transformation using methods, processes, or SQL queries

There are two methods for achieving data transformation with BigQuery:

Using Data Transfer Services

This method loads data into BigQuery using GCP native services, and SQL handles the transformation duties after that.

Using GCS

In this method, tools such as Distcp, Sqoop, Spark Jobs, GSUtil, and others are used to load data into the GCS (Google Cloud Storage) bucket. In this method, SQL may also do the change.

Conclusion

In this article, you have learned 11 best practices you can employ to perform. Google BigQuery ETL operations. However, performing these operations manually time and again can be very taxing and is not feasible. You will need to implement them manually, which will consume your time & resources, and writing custom scripts can be error-prone. Moreover, you need full working knowledge of the backend tools to successfully implement the in-house Data transfer mechanism. You will also have to regularly map your new files to the Google BigQuery Data Warehouse.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. Checkout Hevo pricing and find a plan that suits you best.

Have any further queries? Get in touch with us in the comments section below.

mm
Freelance Technical Content Writer, Hevo Data

Faisal loves data science and combines his problem-solving ability and passion for writing to help data teams in solving complex business problems.