BigQuery ETL: 11 Best Practices For High Performance

on Engineering • February 22nd, 2019 • Write for Hevo

Google BigQuery – a fully managed cloud data warehouse for analytics from Google Cloud Platform (GCP), is one of the most popular cloud-based analytics solutions. Due to its unique architecture and seamless integration with other services from GCP, there are certain elements to be considered as BigQuery ETL best practices while migrating data to BigQuery.

Best Practices to Perform BigQuery ETL

Following are 11 best practices to perform BigQuery ETL:

  1. GCS as a Staging Area for BigQuery Upload
  2. Handling Nested and Repeated Data
  3. Data Compression Best Practices
  4. Time Series Data and Table Partitioning
  5. Streaming Insert
  6. Bulk Updates
  7. Transforming Data after Load (ELT)
  8. Federated Tables for Adhoc Analysis
  9. Access Control and Data Encryption
  10. Character Encoding
  11. Backup and Restore

Hevo Data Integration Platform – An Easier and More Reliable Way

If you want to load any data easily into Google BigQuery without any hassle, you can try out Hevo. Hevo automates the flow of data from various sources to Google BigQuery in real-time and at zero data loss. In addition to migrating data, you can also build aggregates and joins on Google BigQuery to create materialized views that enable faster query processing.

Let’s see some key highlights of Hevo Data:

  • Easy Setup And Highly Intuitive User Interface: Hevo has a minimal learning curve and can be set up in minutes. Once the user has quickly configured and connected both the data source and the destination data warehouse, Hevo moves data in real-time.
  • Fully Managed: Neither coding nor pipeline maintenance is required by your team.
  • Unlimited Integrations: Hevo can provide connectivity to numerous cloud-based and on-site assets. Check out the complete list here.
  • Automatic Schema Mapping: Hevo automatically detects the schema of the incoming data and maps it to the destination schema. This feature makes you free from the tedious job of manually configuring schema.
  • Effortless Data Transformations: Hevo provides a simple python interface to clean, transform and enrich any data before moving it to the data warehouse.

Give Hevo a try by signing up for a 14-day free trial today.

1. GCS – Staging Area for BigQuery Upload

Unless you are directly loading data from your local machine, before loading the data into BigQuery you have to upload data to GCS. To move data to GCS you have multiple options:

  • Gsutil is a command line tool which can be used to upload data to GCS from different servers.
  • If your data is present in any online data sources like AWS S3 you can use Storage Transfer Service from Google cloud. This service has options to schedule transfer jobs.

Other things to be noted while loading data to GCS:

  • GCS bucket and BigQuery dataset should be in the same location with one exception – If the dataset is in the US multi-regional location, data can be loaded from GCS bucket in any regional or multi-regional location.

Format supported to upload from GCS to BigQuery are – Comma-separated values (CSV), JSON (newline-delimited), Avro, Parquet, ORC, Cloud Datastore exports, Cloud Firestore exports. 

2. Nested and Repeated Data

This is one of the most important BigQuery best practices. BigQuery performs best when the data is denormalized. Instead of keeping relations, denormalize the data and take advantage of nested and repeated fields. Nested and repeated fields are supported in Avro, Parquet, ORC, JSON (newline delimited) formats. STRUCT is the type that can be used to represent an object which can be nested and ARRAY is the type to be used for the repeated value.

For example, the following row from a BigQuery table is an array of a struct:

{

 "id": "1",
 "first_name": "John",
"last_name": "Doe",
"dob": "1968-01-22",
 "addresses": [
 {
     "status": "current",
     "address": "123 First Avenue",
"city": "Seattle",
     "state": "WA",
     "zip": "11111",
    "numberOfYears": "1"
},
{
     "status": "previous",
     "address": "456 Main Street",
     "city": "Portland",
     "state": "OR",
     "zip": "22222",
    "numberOfYears": "5"
  }
]
}

3. Data Compression

Most of the time the data will be compressed before transfer. You should consider the below points while compressing data.

  • The binary Avro is the most efficient format for loading compressed data.
  • Parquet and ORC format are also good as they can be loaded in parallel.
  • For CSV and JSON, BigQuery can load uncompressed files significantly faster than compressed files because uncompressed files can be read in parallel.

4. Time Series Data and Table Partitioning

Time series data is a generic term used to indicate a sequence of data points paired with timestamps. Common examples are clickstream events from a website or transactions from a Point Of Sale machine. The velocity of this kind of data is much higher and volume increases over time. Partitioning is a common technique used to efficiently analyze time-series data and BigQuery has good support for this with partitioned tables.

A partitioned table is a special BigQuery table that is divided into segments often called as partitions. It is important to partition bigger table for better maintainability and query performance. It also helps to control costs by reducing the amount of data read by a query.

BigQuery has mainly three options to partition a table:

  • Ingestion-time partitioned tables – For these type of table BigQuery automatically loads data into daily, date-based partitions that reflect the data’s ingestion date. A pseudo column named _PARTITIONTIME will have this date information and can be used in queries.
  • Partitioned tables – Most common type of partitioning which is based on TIMESTAMP or DATE column. Data is written to a partition based on the date value in that column. Queries can specify predicate filters based on this partitioning column to reduce the amount of data scanned.
    • You should use the date or timestamp column which is most frequently used in queries as partition column.
    • Partition column should also distribute data evenly across each partition. Make sure it has enough cardinality.
    • Also, note that the Maximum number of partitions per partitioned table is 4,000.
    • Legacy SQL is not supported for querying or for writing query results to partitioned tables.
  • Sharded Tables – You can also think of shard tables using a time-based naming approach such as [PREFIX]_YYYYMMDD and use a UNION while selecting data.

Generally, Partitioned tables perform better than tables sharded by date. However, if you have any specific use-case to have multiple tables you can use sharded tables. Ingestion-time partitioned tables can be tricky if you are inserting data again as part of some bug fix. You can read a detailed comparison here.

5. Streaming Insert

For inserting data into a BigQuery table in batch mode a load job will be created which will read data from the source and insert it into the table. Streaming data will enable us to query data without any delay in the load job. Stream insert can be performed on any BigQuery table using Cloud SDKs or other GCP services like Dataflow (Dataflow is an auto-scalable stream and batch data processing service from GCP ). Following things should be noted while performing stream insert:

  • Streaming data is available for the query after a few seconds of first stream insert in the table.
  • Data takes up to 90 minutes to become available for copy and export.
  • While streaming to a partitioned table, the value of  _PARTITIONTIME pseudo column will be NULL.
  • While streaming to a table partitioned on a DATE or TIMESTAMP column, the value in that column should be between 1 year in the past and 6 months in the future. Data outside this range will be rejected.

6. Bulk Updates

BigQuery has quotas and limits for DML statements which is getting increased over time. As of now the limit of combined INSERT, UPDATE, DELETE and MERGE statements per day per table is 1,000. Note that this is not the number of rows. This is the number of the statement and as you know, one single DML statement can affect millions of rows.

Now within this limit, you can run updates or merge statements affecting any number of rows. It will not affect any query performance, unlike many other analytical solutions.

7. Transforming Data after Load (ELT)

Sometimes it is really handy to transform data within BigQuery using SQL, which is often referred to as Extract Load Transfer (ELT). BigQuery supports both INSERT INTO SELECT and CREATE TABLE AS SELECT  methods to data transfer across tables.

INSERT das.DetailedInve (product, quantity)
VALUES('countertop microwave',
(SELECT quantity FROM ds.DetailedInv
WHERE product = 'microwave'))

CREATE TABLE mydataset.top_words
AS SELECT corpus,ARRAY_AGG(STRUCT(word, word_count)) AS top_words
FROM bigquery-public-data.samples.shakespeare GROUP BY corpus;

8. Federated Tables for Adhoc Analysis

You can directly query data stored in the location below from BigQuery which is called federated data sources or tables.

  • Cloud BigTable
  • GCS
  • Google Drive

Things to be noted while using this option:

  • Query performance might not be good as native BigQuery table.
  • No consistency guaranteed in case of external data is changed while querying.
  • Can’t export data from an external data source using BigQuery Job.
  • Currently, Parquet or ORC format is not supported.
  • The query result is not cached, unlike native BigQuery tables.

9. Access Control and Data Encryption

Data stored in BigQuery is encrypted by default and keys are managed by GCP Alternatively customer can manage key using Google KMS service.

To grant access to resources, BigQuery uses IAM(Identity and Access Management) to dataset level. Tables and views are child resources of datasets and inherit permission from the dataset. There are predefined roles like bigquery.dataViewer and bigquery.dataEditor or the user can create custom roles.

10. Character Encoding

Sometimes it will take some time to get the correct character encoding scheme while transferring data. Take notes of the points mentioned below as it will help you to get it correct in the first place.

  • BigQuery expects all source data to be UTF-8 encoded with below exception
  • If a CSV file with data encoded in ISO-8859-1 format, it should be specified and BigQuery will properly convert the data to UTF-8
  • Delimiters should be encoded as ISO-8859-1
  • Non-convertible characters will be replaced with Unicode replacement character: �

11. Backup and Restore

BigQuery addresses backup and disaster recovery at the service level. The user does not need to worry about it. Still, BigQuery is maintaining a complete 7-day history of changes against tables and allows to query a point-in-time snapshot of the table. Read more about the syntax of querying snapshot here.

Conclusion

In this article, you have learnt 11 practices you can employ while performing BigQuery ETL operations. However, performing these operations manually time and again can be very taxing and is not feasible.

Try Hevo today to simplify this process. Sign up for a 14-day free trial to experience the ease Hevo brings to BigQuery ETL.

Have any further queries? Get in touch with us in the comments section below.

No-code Data Pipeline for BigQuery