BigQuery is a prominent Data Warehouse offered by Google. It provides enterprises with a Cloud platform where they can store their data and get scalability. BigQuery also comes with Data Analytics features, allowing its users to analyze their data and draw meaningful insights.
When using BigQuery, you will need to get summaries from your data. The summaries will normally be computed over a group of rows or columns. BigQuery provides different ways of doing this. For example, Aggregate and Window functions. BigQuery Aggregate Functions can be limited because they return a single value for multiple rows.
On the other hand, BigQuery Window Functions, which are also called Analytic functions, return a single value for every row. This means that these functions are very powerful and they can be very helpful to you when using BigQuery.
In this article, we will discuss and explain to you Google BigQuery Window Functions in detail.
Table of Contents
What is BigQuery Window Functions?
BigQuery Window Functions provide BigQuery users with a way to calculate values over a group of rows. Unlike Aggregate functions, which return a single value for all rows, BigQuery Window Functions return a single value for every row.
The BigQuery Window Functions allow you to query either a subset or a full set of rows from your dataset and return a value for every row of the results. They are normally used to calculate running totals, rolling averages, and removing duplicate rows from a dataset.
BigQuery Window Functions have an OVER clause that defines a window of rows around the row you are evaluating. For every row, the result of the window function is calculated using the selected group of rows as the input, in most cases by doing aggregation.
BigQuery Window Functions Syntax
BigQuery Window Functions take the following syntax, but you will not be using every clause in every window function:
function_name ( [ arguments ] ) OVER over_clause
over_clause:
{ named_window | ( [ over_window ] ) }
window_specification:
[ named_window ]
[ PARTITION BY partition_expression [, ...] ]
[ ORDER BY expression [ { ASC | DESC } ] [, ...] ]
[ window_frame ]
window_frame:
{ rows_range } { frame_start | frame_between }
rows_range:
{ ROWS | RANGE }
The parameters used in the syntax are described below:
- function_name: The name of the window function. For example, the numbering function RANK().
- arguments: The list of arguments specific to the window function. Some BigQuery window functions have arguments while others do not have.
- OVER: The keyword required in a window function.
- over_clause: A window that specifies a group of rows in a table on which to apply the window function.
- over_window: The specification for the window.
- window-frame: A clause defining the window frame for the window.
- rows_range: The logical range or physical rows for a window frame.
The following notation rules are very important when using BigQuery window functions:
- Square brackets [ ] indicate optional clauses.
- Parentheses ( ) indicate literal parentheses.
- The vertical bar | indicates a logical OR.
- Curly braces { } enclose options.
- A comma, followed by an ellipsis inside square brackets [, … ] shows that the preceding item may repeat in a comma-separated list.
Hevo Data, a No-code Data Pipeline, helps load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ Data Sources including 40+ Free Sources. It is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination.
Hevo loads the data onto the desired Data Warehouse such as Google BigQuery in real-time and enriches the data and transforms it into an analysis-ready form without having to write a single line of code. Its completely automated pipeline, fault-tolerant, and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.
GET STARTED WITH HEVO FOR FREE
Check out why Hevo is the Best:
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled securely and consistently with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Simplify your BigQuery ETL with Hevo today!
SIGN UP HERE FOR A 14-DAY FREE TRIAL!
Using BigQuery Window Functions
In this section, we will be giving practical examples of how to use the BigQuery window functions.
ROW NUMBER
Consider this scenario: You have a table with extracted data from the database (PostgreSQL, MySQL, etc). The extractions are done daily and a daily snapshot of the current value is stored in the database. The last update field shows the last ingestion time.
The goal is to get the data that contains only the latest status of every id to be part of the analysis. To demonstrate this, we will use data from BigQuery public data fcc_political_ads.
SELECT
fileName AS id,
historyStatus AS status,
lastUpdatedTimestamp AS last_update_time
FROM
`bigquery-public-data.fcc_political_ads.file_history`
ORDER BY
1,
3 DESC;
The results will show that for every identifier, there is a status value. However, we only need to use the most recent ones in the analysis.
Image Source
We can use the ROW NUMBER window function to assign a rank to each row based on the recency of the update. The following query demonstrates this:
SELECT
fileName AS id,
historyStatus AS status,
lastUpdatedTimestamp AS last_update_time,
ROW_NUMBER() OVER (PARTITION BY fileName ORDER BY lastUpdatedTimestamp DESC) AS row_number
FROM
`bigquery-public-data.fcc_political_ads.file_history`
ORDER BY
1,
3 DESC;
The query returns the following result:
Image Source
Now that you have the new identifier field, you can use it to filter the row with the most, recent update:
SELECT
* EXCEPT(row_number)
FROM (
SELECT
fileName AS id,
historyStatus AS status,
lastUpdatedTimestamp AS last_update_time,
ROW_NUMBER() OVER (PARTITION BY fileName ORDER BY lastUpdatedTimestamp DESC) AS row_number
FROM
`bigquery-public-data.fcc_political_ads.file_history`
ORDER BY
1,
3 DESC )
WHERE
row_number = 1;
The query will return the following result:
Image Source
LAG & LEAD
These window functions can be very helpful to you if you are working with a daily event tracker. You can use the LAG & LEAD function to determine how numbers have changed between yesterday and today. Although these two BigQuery window functions work in reverse orders, they give the same results.
To demonstrate how to use these BigQuery window functions, we will use BigQuery public data from the `bigquery-public-data.covid19_geotab_mobility_impact_eu.city_congestion` table.
Image Source
The goal is to calculate the difference in percentage congestion between today and yesterday.
First, let’s create a helper field using the LAG function. We will give it the name yest_percent_congestion:
SELECT
city_name,
date_time,
percent_congestion,
LAG(percent_congestion, 1) OVER (PARTITION BY city_name ORDER BY DATE_TIME) AS yest_percent_congestion
FROM
`bigquery-public-data.covid19_geotab_mobility_impact_eu.city_congestion`
ORDER BY
1,
2;
The LAG has been applied by city_name and ordered by DATE_TIME. Let’s now calculate the difference between today’s and yesterday’s percentage:
SELECT
*,
percent_congestion - yest_percent_congestion AS diff_percent_congestion
FROM (
SELECT
city_name,
date_time,
percent_congestion,
LAG(percent_congestion, 1) OVER (PARTITION BY city_name ORDER BY DATE_TIME) AS yest_percent_congestion
FROM
`bigquery-public-data.covid19_geotab_mobility_impact_eu.city_congestion`
ORDER BY
1,
2);
The first data for every city will have a null because of the lack of yesterday’s data to start with.
The LEAD function can be used in the same way. However, it takes the next value rather than the previous value.
FIRST VALUE & LAST VALUE
These BigQuery window functions are used when the value being tracked doesn’t change daily. You can use the function to get the difference between the current and the last recorded values.
WITH
reworked_data AS (
SELECT
city_name,
date_time,
CASE
WHEN EXTRACT(HOUR FROM date_time) IN (0, 1, 2, 3, 4, 5, 10, 11, 14, 15, 16) THEN NULL
ELSE
percent_congestion
END
AS percent_congestion
FROM
`bigquery-public-data.covid19_geotab_mobility_impact_eu.city_congestion`
ORDER BY
1,
2 )
SELECT
*,
LAST_VALUE(percent_congestion IGNORE NULLS) OVER (PARTITION BY city_name ORDER BY date_time ASC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS last_congestion_percent
FROM
reworked_data;
Check out the BigQuery Window Functions Document to know more functions in detail.
Conclusion
This is what you’ve learned in this article:
- You’ve learned more about BigQuery Window Functions.
- You’ve learned the 3 types of BigQuery Window Functions and how to apply them in real-life scenarios.
However, extracting complex data from a diverse set of data sources like Databases, CRMs, Project Management Tools, Streaming Services, Marketing Platforms to your Google BigQuery can be quite challenging. This can be quite challenging. This is where a simpler alternative like Hevo can save your day!
For any information on BigQuery Partition, you can visit the former link.
Hevo Data is a No-Code Data Pipeline that offers a faster way to move data from 100+ Data Sources including 40+ Free Sources, into your Data Warehouse to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.
VISIT OUR WEBSITE TO EXPLORE HEVO
Want to take Hevo for a spin?
SIGN UP and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.
Share your experience with BigQuery Window Functions in the comments section below!