As businesses and individuals have acknowledged the benefits of keeping their data in Cloud Storage Systems, it has increased the adoption of the Cloud as a Data Storage Platform. As a result, the majority of businesses have migrated their data from On-Premise databases to Cloud Storage Systems.

The choice of data storage is critical for every company since it influences the speed with which queries are handled as well as the expenses associated. Analyzing data with pre-built functions such as ARRAY_AGG BigQuery, Max(), Min(), etc allows users to quickly perform data-related tasks.

Since its public availability in 2011, Google BigQuery has been marketed as a one-of-a-kind Analytics Data Warehouse solution. Its serverless design enables it to operate at scale and speed, allowing it to perform analytics on massive datasets in a matter of seconds.

In this article, you will delve into the basics of the Cloud Data Warehouse from Google, understand its features, and also learn about the ARRAY_AGG BigQuery Functions with key usage and examples.

Prerequisites

  • Basic understanding of SQL.
  • Basic understanding of databases.

Introduction to Google BigQuery

Google BigQuery Logo

Google BigQuery is a Cloud-based Data Warehouse with a Big Data Analytic Web Service that can handle Petabytes of data during analytics. Designed for large-scale Data Analysis, BigQuery is divided into two parts: Storage and Query Processing. Google BigQuery is built on the Colossus File System and uses the Dremel Query Engine over REST interface to process queries. These two parts are separated and rely on Google’s Jupiter Network to transport data incredibly quickly from one location to another. 

Google BigQuery employs a Columnar Storage structure, which facilitates rapid query processing and great data compression capabilities. It may be used in conjunction with other Google products and services, such as Predictive Analytics, Data Imports, and Google Analytics, to enhance your workflow.

Key Features of Google BigQuery

Some of the key features of Google BigQuery are as follows:

  • Flexible Scaling: With BigQuery, you do not have to manually adjust the cluster since computational resources are automatically scaled according to the workload, it can scale storage to Petabytes on-demand with ease. Google BigQuery also handles Patching, Updates, Computing, and Storage Resource Scaling, making this a totally managed service.
  • Partitioning: A partitioned table is a unique table that is separated into parts, known as partitions, to make data management and querying easier. The decoupled Storage and Computation architecture of Google BigQuery uses column-based partitioning to reduce the amount of data read from disc by slot workers. Once slot workers have read their data from the disc, BigQuery can automatically find more optimum data sharding and instantly repartition data using BigQuery’s in-memory shuffle function.

Functions in Google BigQuery

There are a number of functions offered by BigQuery, viz:

  • Aggregate functions: They offer a summary of a table’s values. E.g., SUM(field) and COUNT(field).
  • Date Functions: These functions let you alter the format of dates, choose the required field (day, month, or year), and move the date by a specified interval. E.g. DATE(Timestamp).
  • String Functions: You may use String Functions to produce a string, pick and replace substrings, determine the length of a string and the index sequence of the substring in the original string. E.g. Lower(‘str’) and Upper(‘str’).
  • Window Functions: These functions are quite similar to Aggregate functions. The primary distinction is that Window Functions don’t calculate the whole set of data specified by the query but just on a subset or window of that data. E.g. MAX(), MIN().

Aggregate Functions in Standard SQL

Functions that summarize the rows of a group into a single value are known as Aggregate Functions. They are as follows:

  • ANY_VALUE(): Returns expression for some row chosen from the group.
  • ARRAY_AGG(): Returns an ARRAY of expression values. Later in this article, you will also learn about ARRAY_AGG BigQuery Functions.
  • ARRAY_CONCAT_AGG(): Concatenates items from an ARRAY expression and returns a single ARRAY as a result. The NULL items in non-NULL input arrays are recognized by this function, while the NULL elements in NULL input arrays are ignored.
  • AVG(): If the input contains a NaN, this function returns the average of non-NULL input values or NaN.
  • BIT_AND(): Returns the result of a bitwise AND operation on expression.
  • BIT_OR(): Returns the result of a bitwise OR operation on expression.
  • BIT_XOR(): Returns the result of a bitwise XOR operation on expression.
  • COUNT: Returns the number of different items in a given expression.
  • COUNTIF: This function returns the number of TRUE values in an expression.
  • LOGICAL_AND(): The logical AND of all non-NULL expressions is returned.
  • LOGIAL_OR(): Returns the logical OR of all non-NULL expressions.
  • MAX: Returns the maximum non-NULL value of an expression.
  • MIN: Returns the minimum non-NULL value of an expression.
  • STRING AGG(): Returns a value (STRING or BYTES) consisting of non-null values concatenated together.
  • SUM(): Returns the total of all non-zero values.

ARRAY_AGG BigQuery Functions

Before proceeding to learn about the use of ARRAY_AGG BigQuery Functions, you must know what an array is. In BigQuery, an array is an ordered list consisting of zero or more values of the same data type. An array can also include NULL values.

The ARRAY_AGG BigQuery Functions returns an ARRAY of expression values. It is basically the opposite of UNNEST (used to flatten an array into its components to make it usable for analysis and database operations).

Syntax for ARRAY_AGG BigQuery Functions

ARRAY_AGG(
  [DISTINCT]
  expression
  [{IGNORE|RESPECT} NULLS]
  [ORDER BY key [{ASC|DESC}] [, ... ]]
  [LIMIT n]
)
[OVER (...)]
  • OVER: Used to specify a window, which refers to a group of rows in a table on which a function is applied. All other clauses in ARRAY_AGG BigQuery are presently incompatible with this clause ().
  • DISTINCT: Each distinct expression value is only aggregated once in the result.
  • IGNORE NULLS or RESPECT NULLS: If IGNORE NULLS is provided, the NULL values will be ignored. The NULL values are included in the outcome if RESPECT NULLS is provided. The NULL values are included in the result if neither is supplied. 
  • ORDER BY: This specifies the order in which the values are displayed. For each sort key, the default sort direction is ASC.
  • LIMIT: Limits the number of expression inputs that can be used in the result. The limit n must be an INT64.

Example 1

SELECT FORMAT("%T", ARRAY_AGG(x LIMIT 5)) AS array_agg
FROM UNNEST([2, 1, -2, 3, -2, 1, 2]) AS x;

+-------------------+
| array_agg         |
+-------------------+
| [2, 1, -2, 3, -2] |
+-------------------+

Example 2

SELECT
  x,
  FORMAT("%T", ARRAY_AGG(x) OVER (ORDER BY ABS(x))) AS array_agg
FROM UNNEST([2, 1, -2, 3, -2, 1, 2]) AS x;

+----+-------------------------+
| x  | array_agg               |
+----+-------------------------+
| 1  | [1, 1]                  |
| 1  | [1, 1]                  |
| 2  | [1, 1, 2, -2, -2, 2]    |
| -2 | [1, 1, 2, -2, -2, 2]    |
| -2 | [1, 1, 2, -2, -2, 2]    |
| 2  | [1, 1, 2, -2, -2, 2]    |
| 3  | [1, 1, 2, -2, -2, 2, 3] |
+----+-------------------------+

Conclusion

In this article, you learned about what makes Google BigQuery fast and scalable and its other features. The article also offers a better idea about the ARRAY_AGG BigQuery Functions. This method is useful when we have a denormalized table (or query result) and need to aggregate all data labels into array format to get fewer rows. As a result, query performance is improved, allowing for faster and more efficient analysis of big data stored in pre-joined tables, especially from object-based schemas.

With your Data Warehouse, Google BigQuery live and running, you’ll need to extract data from multiple platforms to carry out your analysis. However, integrating and analyzing your data from a diverse set of data sources can be challenging and this is where Hevo Data comes into the picture.

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations such as Google BigQuery, with a few clicks.

Hevo Data with its strong integration with 150+ sources (including 40+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools.

Want to take Hevo for a spin?

step=email" target="_blank">Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at our unbeatable pricing that will help you choose the right plan for your business needs!

Share your experience of learning about ARRAY_AGG BigQuery Functions. Let us know in the comments section below!

mm
Senior Customer Experience Engineer

Veeresh specializes in JDBC, REST API, Linux, and Shell Scripting. He excels in resolving complex issues, conducting brainstorming sessions, and implementing Python transformations, contributing significantly to Hevo's success.

No-code Data Pipeline for Google BigQuery