Google BigQuery Structs: Syntax and Queries Simplified 101

By: Published: October 5, 2021

BigQuery Structs - Featured image

BigQuery is Google’s Data Warehousing solution. It is serverless, i.e., it allocates compute resources on the fly, as per the requirements, so that you need not worry about resource allocation. It performs Parallel Query Execution, thanks to the organization of data in columns rather than rows, and is well suited for spiky workloads, i.e. querying large chunks of data in a short duration.

Often, the data you are dealing with in your analysis does not belong to the conventional data types like int, float, boolean, string, etc. You often deal with collections. One of the common ways of representing data collections is through key-value pairs. BigQuery Structs allow the storage of key-value pair collections in your tables. Google BigQuery defines a struct as follows:

Container of ordered fields each with a type (required) and field name (optional).

If you have worked with JSON files in the past, or with dictionaries in Python, you will feel at home with structs in BigQuery.

In this article, you will learn how to create BigQuery Structs, how to use them in queries, and how to perform operations on these Structs. In the end, you’ll also briefly touch upon the concept of Nested Structs.

Table of Contents

Prerequisites

It is assumed that you have worked with Google BigQuery in the past and know how to create datasets and tables in Google BigQuery. If you are not familiar with these concepts, it will be worthwhile to look at these helper articles:

Introduction to Google BigQuery

BigQuery Structs - BigQuery logo
Image Source

Google BigQuery is a fully-managed Cloud Data Warehouse that lets you use SQL to manage terabytes of data. It enables businesses to evaluate their data more quickly and generate insights using normal SQL queries. Since Google BigQuery is part of the Google Cloud Platform (GCP), it can take advantage of Google Cloud Functions and other Google products to help you save time and get better results. To process read-only data, Google BigQuery is built on Google’s Dremel engine. Users can scale up or down both the Storage and Compute power on their own, depending on their needs.

Download the Cheatsheet on How to Set Up High-performance ETL to BigQuery
Download the Cheatsheet on How to Set Up High-performance ETL to BigQuery
Download the Cheatsheet on How to Set Up High-performance ETL to BigQuery
Learn the best practices and considerations for setting up high-performance ETL to BigQuery

Separating the Processing Engine from the storage space gives Google BigQuery additional flexibility. This capability allows Google BigQuery to Store, Analyze and Evaluate data from any location. Google BigQuery also features advanced Data Analysis and Visualization capabilities, such as the Google BigQuery ML (Machine Learning) and BI (Business Intelligence) Engine.

Also, you can utilize Google BigQuery in three different ways:

  • Data Loading and Exporting: You may easily and rapidly import your data into BigQuery. After BigQuery has processed your data, you can export it for further study.
  • Interactive Queries: You may also use data to conduct Batch queries and construct Virtual tables.
  • Manage Data: Google BigQuery allows you to list jobs, datasets, projects, and tables, as well as update them. Any data you upload to Google BigQuery can be managed.
Simplify BigQuery ETL with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ Data Sources (Including 40+ Free Data Sources) and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/Destination such as Google BigQuery but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

Get Started with Hevo for Free

Its completely Automated Data Pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Working with Google BigQuery Structs

Creating Google BigQuery Structs

Let’s get started by creating a table with a Struct column. Let’s create a student_records table, containing rollNo as one integer column and info as a Struct column. The info column is itself composed of 3 attributes: name, age and department, with age being an integer, and the other two columns being strings.

You can set an expiration time for your table as shown. Also, replace the ‘my_first_dataset’ with your dataset name.

CREATE TABLE my_first_dataset.student_records
 (
   rollNo INT OPTIONS(description="The identifier for each student"),
   info STRUCT<
     name STRING,
     age INT,
     department STRING
   >
 )
 OPTIONS(
   expiration_timestamp=TIMESTAMP "2021-10-07 00:00:00 UTC",
   description="Table with a struct column"
 )

Now that the table is created, let’s populate it with values. You can run a query like the one below (with more values as required).

INSERT INTO my_first_data.student_records (rollNo, info) VALUES (1,STRUCT("Yash",26,"Mechanical Engineering"))

Now, in case you are using the Google BigQuery Sandbox, then the above query won’t execute, because DML (Data Manipulation Language) queries like INSERT, UPDATE, or DELETE are not supported in Sandbox and you will have to provide billing information. However, that doesn’t mean you can’t have a table populated with data. Run the following query. It will not only create the table but also populate it with data. 

CREATE OR REPLACE TABLE my_first_dataset.student_records AS (
SELECT 1 AS rollNo, STRUCT("Isha" as name,22 as age,"Pharmacy" as department) AS info
UNION ALL
SELECT 2 AS rollNo, STRUCT("Yash" as name,22 as age,"Mechanical Engineering" as department) AS info
UNION ALL 
SELECT 3 AS rollNo, STRUCT("Siddhi" as name,18 as age,"Metallurgy" as department) AS info
UNION ALL 
SELECT 4 AS rollNo, STRUCT("Kushal" as name,21 as age,"Electrical Engineering" as department) AS info
);

As you can see, you are creating a table using the result of a query and you are adding multiple rows by performing the UNION ALL operation. Of course, this approach is not scalable (you won’t do this to populate thousands of rows), but it will help you proceed further with this tutorial.

As things stand right now, you have a table student_records containing a column of type struct, populated with 4 rows of data. Now let’s explore further.

First of all, if you click on the student_records table in the explorer, and look at the schema, you will see that the type for the info column is RECORD and the mode is NULLABLE. This combination (RECORD + NULLABLE) identifies a Struct BigQuery. 

Note: If the type is RECORD and the mode is REPEATED, it means that the column contains an Array of Structs.

BigQuery Structs - Table Schema
Image Source: Self

Querying Google BigQuery Structs

If you directly query a Struct column in Google BigQuery, the result will contain multiple columns, one for each of the attributes within the BigQuery Structs. Here’s an example:

SELECT info from my_first_dataset.student_records

The output contains 3 columns since the info column has 3 attributes.

BigQuery Structs - Querying BigQuery Structs
Image Source: Self

In case you wish to query specific attributes of the Struct, you can use the Dot Notation. For instance, the following query fetches the roll no, name, and age for each student:

SELECT rollNo, info.name, info.age from my_first_dataset.student_records
BigQuery Structs - Querying BigQuery Structs from Database
Image Source: Self

Performing Operations on Google BigQuery Structs

Structs support limited operations: Equal (=), Not equal (!= or <>), IN, and NOT IN. An example with Equal is given below.

SELECT * from my_first_dataset.student_records
WHERE info = ("Yash",22,"Mechanical Engineering")
BigQuery Structs - Operations on Structs
Image Source: Self

Note that you did not use the STRUCT keyword before (“Yash”,22, “Mechanical Engineering”) in the above query. This is because Google BigQuery constructs an anonymous Struct when it is given a tuple of values. See the tuple syntax of constructing a struct.

An example with NOT IN is given below:

SELECT * from my_first_dataset.student_records
WHERE info NOT IN unnest([("Siddhi",18,"Metallurgy"), ("Kushal",21,"Electrical Engineering")])
BigQuery Structs - NOT IN Query
Image Source: Self

As you can see, we checked if the Struct belongs to an Array of Structs or not, by listing items of the array, using unnest.

Note that the limited set of operations listed above apply only to the Struct as a whole. Individual attributes within the Struct can support other operations (>, <, etc.) depending on the data type of that attribute. An example is given below showing the use of the ‘<’ operator on the age attribute.

SELECT * from my_first_dataset.student_records
WHERE info.age < 22
BigQuery Structs - Where Clause Applied
Image Source: Self

Understanding Nested Structs in Google BigQuery

A Struct having another Struct as one or more of its attributes is known as a Nested Struct. If you have understood the concept of Structs, then creating and querying Nested Structs should be straightforward for you.

Whether it’s a single record or repeated data, Google BigQuery allows nested records within tables. Unlike traditional denormalization methods, records in Google BigQuery are expressed using nested and repeated fields.

Google BigQuery’s inbuilt support for Nested and Repeated structures in JSON on the other hand is the preferred way for denormalizing data. This strategy, rather than flattening attributes into a table, localizes a record’s subattributes into a single table. Employing nested records during analysis eliminates the necessity for repeating data, generating new subtables or using joins in Google BigQuery Structs.

The following example shows the creation and population of a table containing the info column which is a Struct, which contains another BigQuery Struct (subjects) as one of its attributes.

CREATE OR REPLACE TABLE my_first_dataset.student_records_detailed AS (
SELECT 1 AS rollNo, STRUCT("Isha" as name,22 as age,"Pharmacy" as department, STRUCT("Chemistry" as sub1, "Bio" as sub2) as subjects) AS info
 
UNION ALL
 
SELECT 2 AS rollNo, STRUCT("Yash" as name,22 as age,"Mechanical Engineering" as department, STRUCT("Physics" as sub1, "Math" as sub2) as subjects) AS info);

When you query the Nested Struct column, the attributes within the Inner Struct also appear as columns.

SELECT info from my_first_dataset.student_records_detailed
BigQuery Structs - Nested Query
Image Source: Self

As you would have expected, the dot notation can be extended to queries of Nested Structs:

SELECT info.subjects.sub1 from my_first_dataset.student_records_detailed
BigQuery Structs - Dot Notation
Image Source: Self

Finally, operations work on Nested Structs like on normal Google BigQuery Structs. An example is shown below for the equal operator:

SELECT * from my_first_dataset.student_records_detailed
WHERE info = ("Isha",22,"Pharmacy",("Chemistry","Bio"))
BigQuery Structs - Equal Operator
Image Source: Self

I hope you enjoyed reading this article on creating & working with Google BigQuery Structs and found it useful. You can refer to the official documentation for any further reading on structs.

Conclusion

This article introduced you to the concept of Google BigQuery Structs. It also provided you with an in-depth guide with proper syntax and examples of creating, querying, and managing Google BigQuery Structs. It also covered Nested Structs and how it is just an extension of the concept of Google BigQuery Structs.

Companies store valuable data from multiple data sources into Google BigQuery. The manual process to transfer data from source to destination is a tedious task but this is where Hevo saves the day!

Visit our Website to Explore Hevo

Hevo Data is a No-code Data Pipeline that can help you transfer data from any data source to Google BigQuery. It fully automates the process to load and transform data from 100+ Data Sources (Including 40+ Free Sources) to a destination of your choice such as Google BigQuery without writing a single line of code. 

Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your experience of learning about Google BigQuery Structs in the comments section below!

Yash Sanghvi
Freelance Technical Content Writer, Hevo Data

Yash has helped companies succeed online and is specialized in the data industry. He provides comprehensive and informative content for those delving deeper into the complexities of data integration and analysis.

No-code Data Pipeline For Google BigQuery