Google BigQuery is an essential tool that every enterprise can employ to store data, analyze the data and query the database. It evolved from the term Big Data with a focus on overcoming the biggest challenge – to support all kinds of data structures like semi-structured, structured, and unstructured data in a particular database. 

In this article, different aspects of Google BigQuery are explained along with its key features. The article also dives deep into the various BigQuery Functions available for your convinience.

It will also discuss the BigQuery Partition and the various types of Analytic BigQuery Functions. Read along to understand the tools that Google BigQuery has to offer!

What is BigQuery?

Bigquery Functions: Bigquery Logo

In 2011, Google introduced BigQuery for sharpening business agility. It is a cloud-based platform and has been formed using Big Table. BigQuery is an enterprise-built data warehouse that is serverless and completely manageable. 

BigQuery works superiorly on petabytes of data and terabytes of data within fraction seconds without any disturbance in performance and lack of time. It is flexible in working on the variable size of datasets. Its specialty is it can execute complex queries on petabytes of databases. 

BigQuery the term is initiated from Big Data, it preferably can be used along with SQL queries at the initial stage. Additionally, BigQuery is accessible from Web UI, multiple libraries like C#, Go, Java, Node Js, Ruby, Python, PHP, etc. 

Key Features of BigQuery

It is because of its distinct features BigQuery technique is in demand present in the market. BigQuery holds contemporary features that not only enhance productivity but also leverage security and integrity.

  • Cost-Effective: BigQuery offers a decent pricing model but for beginners, those who want to use operations are provided with free tier usage. It charges for certain operations, or for using BigQuery storage APIs. It depends upon two pricing components: Analyzing and Storing. 
  • Enhances Faster SQL: BigQuery supports the ANSI SQL technique along with the second sub-query that increases concurrency, and even results in high throughput. ANSI SQL also increases the response time. 
  • Data Encryption & Security: As BigQuery is a cloud-based platform, there must be a potential risk to data or personal information. Therefore, the data present is encrypted and integrated with Cloud Identity and Access Management security. 
  • Effective Storage: BigQuery is very efficient to store data. Once the data is loaded in BigQuery, it automatically processes it. 
  • BigQuery Machine Learning: BigQuery Machine Learning concept allows the Data Scientists and Data Analysts to perform various activities like building, training, and testing the model. Using SQL query syntax, you can directly fetch the results from BigQuery. 

What is the BigQuery Partition?

BigQuery Partition
Image Source

Traditionally, users of Google BigQuery could only split large datasets into smaller tables delimited by time and date. This method involved complex data management to handle the multiple splits manually.

To simplify this process, Google BigQuery launched the Partitioned Table. This table contains all the functionality of an ordinary Google BigQuery table, and it is automatically partitioned for each date.

BigQuery Partitions enable you to query your data easily and manage it while maintaining the best possible performance. It also lets you control your Google BigQuery spending.

Using BigQuery partition, you can automatically create smaller blocks from a single table for specific criteria. For example, each day Google BigQuery might create a new table that contains data only for that particular day.

How Does the BigQuery Partition Work?

Using the Google BigQuery partition, you can request data for a specific date by running a suitable Query and get fast results at a low price. This is because, Google BigQuery Partition only reads a particular table or a small set of tables in response to a Query, instead of the entire database. This has the following 2 main advantages:

The system can read smaller tables much faster, thus providing better performance. Therefore, you can even execute complex queries much more efficiently at runtime.

Google BigQuery’s cost plan relies on query size and data transfers. This way you can save money by transferring only the required data instead of sending huge data chunks.

What are Analytic BigQuery Functions?

Analytic BigQuery Functions or Window Functions is a collection of functions that allows you to perform computation over multiple rows and return one unique result for each of these rows.

The Analytic Function is important for situations in which you have to calculate metrics like moving averages and cumulative sums using a set of rows as part of your Data Analytics. It is designed to return a single value always for each row.

This is different from the Aggregate BigQuery Functions which returns a unique value for a set of rows. They are very useful, and once you’ve passed the learning curve, you’ll be surprised by their capabilities. Some common use cases are:

  • Computing Running Total
  • Computing Moving Average
  • Ranking Rows by Custom Criteria
  • Calculating the Year-over-Year % Change

Types of Analytic BigQuery Functions

The Analytic BigQuery Functions are of the following 3 types:

ROW NUMBER

This Analytic function is beneficial when you have extracted the data from a MongoDB or MySQL Database and stored it into a table. Moreover, this extraction is a daily routine and the snapshot of the current values from the database are transferred to the table regularly.

Now, the Raw function can provide you with the data that carries the most recent status related to all unique_ids required for your analysis.

SELECT
  fileName AS unique_id,
  historyStatus AS latest_status,
  lastUpdatedTimestamp AS last_update,
  // we want to group the calculation by fileName (unique_id)
  // then we give each row a number based on most recent ones (descending order)
  ROW_NUMBER() OVER (PARTITION BY fileName ORDER BY lastUpdatedTimestamp DESC) AS row_number
FROM
  `bigquery-public-data.fcc_political_ads.file_history`
ORDER BY
  1,
  3 DESC
SELECT
  //select all fields except the helper field
  * EXCEPT(row_number)
FROM (
  SELECT
    fileName AS unique_id,
    historyStatus AS latest_status,
    lastUpdatedTimestamp AS last_update,
    //we want to group the calculation by fileName (unique_id)
    // then we give each row a number based on most recent ones (descending order)
    ROW_NUMBER() OVER (PARTITION BY fileName ORDER BY lastUpdatedTimestamp DESC) AS row_number
  FROM
    `bigquery-public-data.fcc_political_ads.file_history`
  ORDER BY
    1,
    3 DESC )
WHERE
  //filter only most recent ones, identified by row number 1
  row_number = 1

LAG and LEAD

You can use this Analytic Function if you are dealing with a daily event tracker. The LAG & LEAD function enables you to calculate the change in numbers between today and yesterday. Keep in mind that the LAG and LEAD functions are operating in reverse orders but deliver the same results.

SELECT
  city_name,
  date_time,
  percent_congestion,
  // Apply windowing with lag function with 1 window
  // the LAG will be applied by city_name ordered by DATE_TIME
  LAG(percent_congestion, 1) OVER (PARTITION BY city_name ORDER BY DATE_TIME) AS prev_percent_congestion
FROM
  `bigquery-public-data.covid19_geotab_mobility_impact_eu.city_congestion`
ORDER BY
  1,
  2
SELECT
  *,
  //Simple substraction current value - previous value
  percent_congestion - prev_percent_congestion AS percent_congestion_diff
FROM (
  SELECT
    city_name,
    date_time,
    percent_congestion,
    // Apply windowing with lag function with 1 window
    // the LAG will be applied by city_name ordered by DATE_TIME
    LAG(percent_congestion, 1) OVER (PARTITION BY city_name ORDER BY DATE_TIME) AS prev_percent_congestion
  FROM
    `bigquery-public-data.covid19_geotab_mobility_impact_eu.city_congestion`
  ORDER BY
    1,
    2)

FIRST VALUE and LAST VALUE

The FIRST VALUE & Last VALUE Function is also useful for calculating differences in values of vents or congestion. However, unlike the previous functions, it is applicable to situations when the value ( being tracked) does not change daily. Therefore, this Analytic Function can give you the difference between the current and last recorded values.

WITH
  reworked_data AS (
  SELECT
    city_name,
    date_time,
    // For example, ther will be no congestion at 00-05 AM, 10-11 AM, and 02-04 PM
    CASE
      WHEN EXTRACT(HOUR FROM date_time) IN (0, 1, 2, 3, 4, 5, 10, 11, 14, 15, 16) THEN NULL
    ELSE
    percent_congestion
  END
    AS percent_congestion
  FROM
    `bigquery-public-data.covid19_geotab_mobility_impact_eu.city_congestion`
  ORDER BY
    1,
    2 )
SELECT
  *,
  // use last value ignore null to get the latest value which is not null by city_name
  // notice that we add UNBOUNDED PRECEDING AND 1 PRECEDING
  // it will be used by each row to perform the evaluation
  // unbounded preceding means it will evaluate all of the previous rows
  // 1 preceding means it will evaluate up to previous row
  LAST_VALUE(percent_congestion IGNORE NULLS) OVER (PARTITION BY city_name ORDER BY date_time ASC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS last_percent_congestion
FROM
  reworked_data

Other BigQuery Functions

In 2020, Google BigQuery launched several new functions which can optimize small tasks to streamline your work. These 12 user-friendly SQL based BigQuery functions are as follows:

Add Table Columns via DDL

This function allows developers to add a single column to their tables anytime directly using the DDL command. Its syntax is as follows:

alter table mydataset.mytable
add column a string,
add column if not exists b geography,
add column c array<numeric>,
add column d date options(description="my description") 

TRUNCATE Table

This function allows developers to remove a table using a single command and without any extra cost, unlike the Delete command. Its syntax is as follows:

TRUNCATE TABLE [project_name.] dataset_name.] table_name

Unicode Table Naming

Developers can now name their tables in a more creative way and are not limited to using letters and numbers only. Moreover, multilanguage support is available for them.

Create external tables using a DDL statement

This function allows developers to quickly import files from Google Cloud in different formats and use them to build a Table via DDL commands. Its syntax is as follows:

create external table dataset.CsvTable options(
format = 'CSV',
uris = ['gs://bucket/path1.csv', 'gs://bucket/path2.csv']);

EXPORT DATA

This is a powerful function that enables developers to transfer data from Google BigQuery to Google Cloud using simple SQL commands. Its syntax is as follows:


export data options(
uri='gs://bucket/folder/*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') as
select field1, field2
from table1
order by field1 limit 10

EXECUTE IMMEDIATE

This function allows developers to write dynamic SQL code. However, you should avoid it as the resulting code can be complex and very hard to debug. Its syntax is as follow:

-- create a temporary table called Books.
execute immediate
‘create temp table books (title string, publish_date int64)’;

-- add a row for Hamlet 
execute immediate
‘insert into books (title, publish_date) values('Hamlet', 1599)’;

Authorized User-Defined Functions

These function will allow authorized consumers to extract data from tables using SQL Queries, even if the professional who called the user denied function does not have access to those tables. Google BigQuery already had User-defined functions, The new feature just extends their usability by providing data access to authorized users.

Duplicate Column Names in Query Results

This function allows developers to choose the same column more than once in a single query. All it requires is a suffix before each duplicate column name.

New BigQuery LAST_DAY DATE Function

Apart from the Date function of Google BigQuery, you can now also use its last-Day Date function to extract the last day of a certain period such as a month or week. Its syntax is as follows:


select last_day(’2020-10-23’,
       week(monday)) as last_day_of_week
2020-10-25

Date Arithmetic in BigQuery

This function allows you to use simple + and – operators to perform Date Calculations. This is a simpler form of the earlier Date Add and Date Sub functions. Its syntax is as follows:


-- you can now do this 

select ’2020-10-23’ + 2
2020-10-25

New BigQuery String Functions

Google BigQuery has launched 14 travail functions that enable you to perform operations on Strings. An example of this function is as follows:


-- the new way
select ’The ’||‘quick ’||‘brown ’||‘fox’ as quip

-- the old way
select concat(’The ’,‘quick ’,‘brown ’,‘fox’) as quip

Expanded INFORMATION SCHEMA

Thi function will streamline th process of acquiring information from the Metadata of multiple entities and objects. Its syntax is as follows:


select table_name
from mydataset.INFORMATION_SCHEMA.TABLES
where table_name like 'scenario%'

Limitations of BigQuery Functions

The above section explains the type of User Defined Functions, it’s time to look over the challenges that you might face while using User Defined Functions. 

  • The functions that need Document-Oriented objects like Windows, Node and Document User Defined Functions don’t support them.
  • The JavaScript Functions that depend upon native code might fail.
  • UDFs are case sensitive which means it limits at applicable rates and quotes. 

Conclusion

This article introduced you to Google BigQuery and explained its features. It also covered every aspect which is required to know about BigQuery Functions, their types, their components, and their limitations. Additionally, you will get a real-time experience of how to use User-Defined BigQuery Functions with the help of examples. Through this article, you can grasp some handy tips to use BigQuery. 

While using User-Defined it is insightful, it is a hectic task to Set Up the proper environment. To make things easier, Hevo comes into the picture. Hevo Data is a No-code Data Pipeline and has awesome 150+ pre-built Integrations that you can choose from.

Hevo can help you Integrate your data from numerous sources and load them into a destination like BigQuery to Analyze real-time data with a BI tool such as Tableau. It will make your life easier and data migration hassle-free. It is user-friendly, reliable, and secure.

SIGN UP for a 14-day free trial and see the difference!

Share your experience of learning about the BigQuery Functions in the comments section below

Nitin Birajdar
Lead Customer Experience Engineer

Nitin, with 9 years of industry expertise, is a distinguished Customer Experience Lead specializing in ETL, Data Engineering, SAAS, and AI. His profound knowledge and innovative approach in tackling complex data challenges drive excellence and deliver optimal solutions. At Hevo Data, Nitin is instrumental in advancing data strategies and enhancing customer experiences through his deep understanding of cutting-edge technologies and data-driven insights.