Google BigQuery ETL: 11 Best Practices For High Performance

• January 31st, 2022

bigquery etl

Google BigQuery – a fully managed Cloud Data Warehouse for analytics from Google Cloud Platform (GCP), is one of the most popular Cloud-based analytics solutions. Due to its unique architecture and seamless integration with other services from GCP, there are certain best practices to be considered while configuring Google BigQuery ETL (Extract, Transform, Load) & migrating data to BigQuery.

In a world where ETL plays a crucial role in Data Analysis, different tools have been developed to simplify, streamline, and visualize the ETL process. The scalability, security, and high processing power of Google BigQuery make it a favourite for large businesses. Today companies are learning to use Google BigQuery ETL to enhance their transactions. This process involves using the latest strategies and techniques to leverage the power of Google BigQuery and GCP to enhance the overall ETL process.

This article will give you a birds-eye view of Google BigQuery, its key features, and how it can enhance the ETL Process in a seamless manner. Read along to discover how you can use Google BigQuery ETL for your organization!

Table of Contents

What is Google BigQuery?

Google BigQuery ETL: Google BigQuery Logo| Hevo Data
Image Source

Google BigQuery is a globally used Data Warehouse that offers storage for large amounts of data using Cloud technology. It operates on a serverless design and requires no infrastructural setup or hardware management from you. Moreover, Google BigQuery replicates your data stored in different locations automatically. This ensures high availability, and reliability when it comes to your company’s data. 

Unlike some Cloud-based Data Warehouses in the market, this Google tool can deliver with little to no maintenance. Furthermore, Google allows you to access its Data Warehouse from all locations that have Google Cloud availability. Another advantage of using Google BigQuery is that you get seamless and free integrations with other Google-based services such as Google Analytics, Google Drive, etc.

It is also one of the popular Data Warehouses for ETL operations, hence, performing Google BigQuery ETL will be a piece of cake!

Key Features of Google BigQuery

The following features of Google BigQuery make it beneficial for ETL BigQuery and businesses of all magnitude:

  • Availability: Google BigQuery supports automatic Data Replication. It also keeps a 7-day log that contains information about any changes made to your data during the previous week. These features protect your information from system failures and provide high Data Availability. 
  • Easy Maintenance: Since Google BigQuery automates almost all the data warehousing tasks like Running Queries, Managing Storage, Allocating Resources, etc. This minimizes the need for human interference in running and maintaining the Data Warehouse system. 
  • Smart Pricing: Google BigQuery has separate modules for storing and processing data. This safeguards you from paying extra costs and allows Google BigQuery to offer smart and appealing pricing models for businesses. 
  • Machine Learning: Google BigQuery’s machine learning capabilities provide real-time insight into your data. Its high-speed API provides a solid foundation for faster data analysis. This allows you to upload your business data and analyze it instantly.
  • High Scalability: Google BigQuery operates on parallel processing. this way, it is able to offer you high scalability and consistency. It works great for real-time data analysis and tailors your data according to your needs.

To understand Google BigQuery better, visit here.

What is ETL?

Google BigQuery ETL: ETL Process| Hevo Data
Image Source

ETL can be defined as a data integration process divided into three steps, i.e., extract, transform and load. It is primarily used to integrate data from multiple sources and load it in a centralized location, typically a Data Warehouse for analytical purposes. During this process, necessary data is extracted from all data sources, transformed into a form suitable for analysis, and loaded into a destination based on the requirements. Google BigQuery ETL also follows this approach. Strong Open-Source ETL Tools like Hevo Data automate processes and make this tedious process more engaging.

The three steps of ETL are given below:

  • Extract: Extraction can be defined as the process of gathering all essential data from the source systems. For most ETL processes, these sources can be Databases such as MySQL, MongoDB, Oracle, etc., Customer Relationship Management (CRM), Enterprise Resource Planning (ERP) tools, or various other files, documents, web pages, etc.
  • Transform: Transformation can be defined as the process of converting the data into a format suitable for analysis such that it can easily be understood by a Business Intelligence or Data Analysis tool. The following operations are usually performed in this phase: Filtering, de-duplicating, cleansing, validating, and authenticating the data.
    • Performing all necessary translations, calculations, or summarizations on the extracted raw data. This can include operations such as changing row and column headers for consistency, standardizing data types, and many others to suit the organization’s specific Business Intelligence (BI) and Data Analysis requirements.
    • Encrypting, removing, or hiding data governed by industry or government regulations.
    • Formatting the data into tables and performing the necessary joins to match the Schema of the destination Data Warehouse.
  • Load: Loading can be defined as the process of storing the transformed data in the destination of choice, normally a Data Warehouse such as Amazon Redshift, Google BigQuery, Snowflake, etc.

To learn more about the Open-Source ETL Tools, please click on this link.

Set Up an Efficient Google BigQuery ETL Solution with Hevo’s No-code Data Pipeline

Hevo Data, an Automated No Code Data Pipeline can help you set up Google BigQuery ETL in the most seamless and completely hassle-free manner way possible. Hevo is fully managed and completely automates the process of not only loading your data into BigQuery but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. The whole process of Google BigQuery ETL can be automated in a matter of a few clicks.

Get Started with Hevo for Free

“By using a Data Pipeline from Hevo, you can reduce your Data Extraction, Transformation, and Loading time & effort many folds! In addition, Hevo’s native integration with multiple Data Sources and BI Tools such as Tableau will empower you to set up your ETL Pipeline, Visualize it and gain actionable insights with ease!”

Experience an entirely automated hassle-free Google BigQuery ETL. Try our 14-day full access free trial today!

Best Practices to Perform Google BigQuery ETL

Google BigQuery ETL: Google BigQuery Logo| Hevo Data
Image Source

Given below are 11 Best Practices & Strategies individuals can use to perform Google BigQuery ETL:

1. GCS – Staging Area for BigQuery Upload

Unless you are directly loading data from your local machine, the first step in Google BigQuery ETL is to upload data to GCS. To move data to GCS you have multiple options:

  • Gsutil is a command line tool which can be used to upload data to GCS from different servers.
  • If your data is present in any online data sources like AWS S3 you can use Storage Transfer Service from Google cloud. This service has options to schedule transfer jobs.

Other things to be noted while loading data to GCS:

  • GCS bucket and Google BigQuery dataset should be in the same location with one exception – If the dataset is in the US multi-regional location, data can be loaded from GCS bucket in any regional or multi-regional location.

The format supported to upload from GCS to Google BigQuery are – Comma-separated values (CSV), JSON (newline-delimited), Avro, Parquet, ORC, Cloud Datastore exports, Cloud Firestore exports. 

2. Nested and Repeated Data

This is one of the most important Google BigQuery ETL best practices. Google BigQuery performs best when the data is denormalized. Instead of keeping relations, denormalize the data and take advantage of nested and repeated fields. Nested and repeated fields are supported in Avro, Parquet, ORC, JSON (newline delimited) formats. STRUCT is the type that can be used to represent an object which can be nested and ARRAY is the type to be used for the repeated value.

For example, the following row from a BigQuery table is an array of a struct:

{

 "id": "1",
 "first_name": "Ramesh",
"last_name": "Singh",
"dob": "1998-01-22",
 "addresses": [
 {
     "status": "current",
     "address": "123 First Avenue",
"city": "Pittsburgh",
     "state": "WA",
     "zip": "11111",
    "numberOfYears": "1"
},
{
     "status": "previous",
     "address": "456 Main Street",
     "city": "Pennsylvania",
     "state": "OR",
     "zip": "22222",
    "numberOfYears": "5"
  }
]
}

3. Data Compression

The next vital Google BigQuery ETL best practice is on Data Compression. Most of the time the data will be compressed before transfer. You should consider the below points while compressing data.

  • The binary Avro is the most efficient format for loading compressed data.
  • Parquet and ORC format are also good as they can be loaded in parallel.
  • For CSV and JSON, Google BigQuery can load uncompressed files significantly faster than compressed files because uncompressed files can be read in parallel.

4. Time Series Data and Table Partitioning

Time Series data is a generic term used to indicate a sequence of data points paired with timestamps. Common examples are clickstream events from a website or transactions from a Point Of Sale machine. The velocity of this kind of data is much higher and volume increases over time. Partitioning is a common technique used to efficiently analyze time-series data and Google BigQuery has good support for this with partitioned tables. Partitioned Tables are crucial in Google BigQuery ETL operations because it helps in the Storage of data.

A partitioned table is a special Google BigQuery table that is divided into segments often called as partitions. It is important to partition bigger table for better maintainability and query performance. It also helps to control costs by reducing the amount of data read by a query. Automated tools like Hevo Data can help you partition BigQuery ETL tables within the UI only which helps streamline your ETL even faster.

To learn more about partitioning in Google BigQuery, you can read our blog here.

Google BigQuery has mainly three options to partition a table:

  • Ingestion-time partitioned tables – For these type of table BigQuery automatically loads data into daily, date-based partitions that reflect the data’s ingestion date. A pseudo column named _PARTITIONTIME will have this date information and can be used in queries.
  • Partitioned tables – Most common type of partitioning which is based on TIMESTAMP or DATE column. Data is written to a partition based on the date value in that column. Queries can specify predicate filters based on this partitioning column to reduce the amount of data scanned.
    • You should use the date or timestamp column which is most frequently used in queries as partition column.
    • Partition column should also distribute data evenly across each partition. Make sure it has enough cardinality.
    • Also, note that the Maximum number of partitions per partitioned table is 4,000.
    • Legacy SQL is not supported for querying or for writing query results to partitioned tables.
  • Sharded Tables – You can also think of shard tables using a time-based naming approach such as [PREFIX]_YYYYMMDD and use a UNION while selecting data.

Generally, Partitioned tables perform better than tables sharded by date. However, if you have any specific use-case to have multiple tables you can use sharded tables. Ingestion-time partitioned tables can be tricky if you are inserting data again as part of some bug fix.

5. Streaming Insert

The next vital Google BigQuery ETL best practice is on actually inserting data. For inserting data into a Google BigQuery table in batch mode a load job will be created which will read data from the source and insert it into the table. Streaming data will enable us to query data without any delay in the load job. Stream insert can be performed on any Google BigQuery table using Cloud SDKs or other GCP services like Dataflow (Dataflow is an auto-scalable stream and batch data processing service from GCP ). The following things should be noted while performing stream insert:

  • Streaming data is available for the query after a few seconds of the first stream inserted in the table.
  • Data takes up to 90 minutes to become available for copy and export.
  • While streaming to a partitioned table, the value of  _PARTITIONTIME pseudo column will be NULL.
  • While streaming to a table partitioned on a DATE or TIMESTAMP column, the value in that column should be between 1 year in the past and 6 months in the future. Data outside this range will be rejected.

6. Bulk Updates

Google BigQuery has quotas and limits for DML statements which is getting increased over time. As of now the limit of combined INSERT, UPDATE, DELETE and MERGE statements per day per table is 1,000. Note that this is not the number of rows. This is the number of the statement and as you know, one single DML statement can affect millions of rows.

Now within this limit, you can run updates or merge statements affecting any number of rows. It will not affect any query performance, unlike many other analytical solutions.

What makes Hevo’s Google BigQuery ETL Process Best-In-Class

Providing a high-quality ETL solution can be a cumbersome task if you just have a Data Warehouse and raw data. Hevo’s automated, No-code platform empowers you with everything you need to have a smooth Google BigQuery ETL experience. Our platform has the following in store for you!

Check out what makes Hevo amazing:

  • Integrations: Hevo’s fault-tolerant Data Pipeline offers you a secure option to unify data from 100+ sources (including 40+ free sources) and store it in Google BigQuery or any other Data Warehouse of your choice. This way you can focus more on your key business activities and let Hevo take full charge of the Data Transfer process.
  • High-Speed Data Loading: Loading compressed data into Google BigQuery is slower than loading uncompressed data. Hevo can decompress your data before feeding it to BigQuery. Hence your process would be simpler on the source side and will be completed efficiently on the destination side. 
  • Quick Setup: Hevo with its automated features, can be set up in minimal time. Moreover, with its simple and interactive UI, it is extremely easy for new customers to work on and perform operations.
  • Scalability: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.

Try our 14-day free trial & set up your Google BigQuery ETL with ease!

Load Data from for Free

7. Transforming Data after Load (ELT)

Google BigQuery ETL must also address ELT in some scenarios as ELT is the popular methodology now. Sometimes it is really handy to transform data within Google BigQuery using SQL, which is often referred to as Extract Load Transfer (ELT). BigQuery supports both INSERT INTO SELECT and CREATE TABLE AS SELECT  methods to data transfer across tables.

INSERT das.DetailedInve (product, quantity)
VALUES('television 50',
(SELECT quantity FROM ds.DetailedInv
WHERE product = 'television'))

CREATE TABLE mydataset.top_words
AS SELECT corpus,ARRAY_AGG(STRUCT(word, word_count)) AS top_words
FROM bigquery-public-data.samples.shakespeare GROUP BY corpus;

8. Federated Tables for Adhoc Analysis

You can directly query data stored in the location below from BigQuery which is called federated data sources or tables.

  • Cloud BigTable
  • GCS
  • Google Drive

Things to be noted while using this option:

  • Query performance might not be good as the native Google BigQuery table.
  • No consistency is guaranteed in case of external data is changed while querying.
  • Can’t export data from an external data source using BigQuery Job.
  • Currently, Parquet or ORC format is not supported.
  • The query result is not cached, unlike native BigQuery tables.

9. Access Control and Data Encryption

Data stored in Google BigQuery is encrypted by default and keys are managed by GCP Alternatively customers can manage keys using the Google KMS service.

To grant access to resources, BigQuery uses IAM(Identity and Access Management) to the dataset level. Tables and views are child resources of datasets and inherit permission from the dataset. There are predefined roles like bigquery.dataViewer and bigquery.dataEditor or the user can create custom roles.

10. Character Encoding

Sometimes it will take some time to get the correct character encoding scheme while transferring data. Take notes of the points mentioned below as it will help you to get them correct in the first place.

  • To perform Google BigQuery ETL, all source data should be UTF-8 encoded with the below exception
  • If a CSV file with data encoded in ISO-8859-1 format, it should be specified and BigQuery will properly convert the data to UTF-8
  • Delimiters should be encoded as ISO-8859-1
  • Non-convertible characters will be replaced with Unicode replacement characters: �

11. Backup and Restore

Google BigQuery ETL addresses backup and disaster recovery at the service level. The user does not need to worry about it. Still, Google BigQuery is maintaining a complete 7-day history of changes against tables and allows to query a point-in-time snapshot of the table.

Conclusion

In this article, you have learned 11 best practices you can employ to perform. Google BigQuery ETL operations. However, performing these operations manually time and again can be very taxing and is not feasible. You will need to implement them manually, which will consume your time & resources, and writing custom scripts can be error-prone. Moreover, you need full working knowledge of the backend tools to successfully implement the in-house Data transfer mechanism. You will also have to regularly map your new files to the Google BigQuery Data Warehouse.9

Hevo Data provides an Automated No-code Data Pipeline that empowers you to overcome the above-mentioned limitations. Hevo caters to 100+ data sources (including 40+ free sources) and can seamlessly set up Google BigQuery ELT within minutes. Hevo’s Data Pipeline enriches your data and manages the transfer process in a fully automated and secure manner without having to write any code. It will make your life easier and make data migration hassle-free.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. 

Have any further queries? Get in touch with us in the comments section below.