If you need a solution for analyzing nested data, BigQuery ROW_NUMBER is a fantastic choice. The BI Engine and BigQuery ML are both capable of analyzing tons of data quickly and efficiently. BigQuery also supports a variety of numbering functions.

At their core, numbering functions are simply a subgroup of analytic functions. Numbering functions simply assign integer values to each row, depending on their position in the specified window. For example, take a look at the table below:

xrankdense_rankrow_num
11
2222
2223
6344
ROW_Number (): For x=6 row_num is 4

Now, it’s important to understand the ROW_NUMBER function better before you go ahead. 

What is the BigQuery ROW_NUMBER Function?

The BigQuery ROW_NUMBER is function is one of the most commonly used functions in SQL. It was made available in SQL Server 2005, and all later versions too. When analyzing a results grid, Bigquery ROW_NUMBER simply adds a discrete incrementing number to the order. 

To determine the order, you must use the ORDER BY expression. This determines the order in which row numbers are added. In most cases, one or more columns are generally stated in the ORDER BY expression. But you can also use a more complicated expression or add a sub-query too. 

As a result, the function generates a constantly increasing integral value that begins at 1, with a higher value assigned for each subsequent row. 

More importantly, the ROW_NUMBER can also be paired with the PARTITION BY clause. So, when the partition limit is crossed, the counter is reset, and begins from 1 again. Partitions can have several values, such as 1,2, 3, and so on, and when it’s reset, the counter starts from 1, 2, and 3 again. 

To put it simply, the ROW_NUMBER generates a series of temporary values that are assigned to figures, and is calculated dynamically based on when the query is executed. Both the ROW_NUMBER and the RANK function are generally similar.

However, whereas the ROW_NUMBER gives a sequence of values starting from 1 (with 1 added incrementally), the RANK function repeats values that are tied.

For people who already have experience with Oracle, the ROW_NUM function might be a bit more familiar. It’s very similar in nature to a Pseudo-Column, except this one’s a bit more dynamic. 

Keep in mind that the ROW_NUMBER is an analytic function. Analytic functions are those which compute values over groups of rows, returning a single result for each of the rows. This function is commonly used for implanting top or bottom-N reports. 

Simplify BigQuery ETL Using Hevo’s No-code Data Pipeline

Looking for an easy way to centralize data to your BigQuery? Hevo Data is a No-code Data Pipeline platform that offers a fully managed solution to set up Data Integration for 150+ Data Sources (Including 40+ Free Sources) and will let you directly load data to a data warehouse like Google BigQuery or a destination of your choice securely, in real-time, and with zero data loss.

Sign up here for a 14-Day Free Trial!

Examples of the BigQuery ROW_NUMBER Function

To better understand the BigQuery ROW_NUMBER function, here’s a simple example. 

The sample table for office.employees assigns a number to each row based on when an employee left the office. 

SELECT office_id, last_name, employee_id, ROW_NUMBER ()
	OVER (PARTITION BY department_id ORDER BY employee_id) AS work_id
	FROM employees; 

This query selects the office_id, last_name, employee_id and generates a new column, work_id. The new column contains a unique number for each row, which is reset for each department_id because of the PARTITION BY clause and is ordered by the employee_id within each department.

department_idlast_nameemployee_idwork_id
5Byrne1211
8Michael1342
12Alex1552
18Faye1884
24Mila2296

Now, if you were to run the following inner-N query, it’ll simply select all rows from the table shown above but returns only rows between 5-20. 

SELECT last_name FROM
	(SLEECT last_name, ROW_NUMBER () OVER (ORDER BY last_name) R FROM employees)
	Where R BETWEEN 5 and 20;

This is a simple way to sort the data. You can also use the OVER clause in order to manipulate the data and get another clause in it. For instance, in the following example, we can get a list of all customers for an organization by projecting several columns, including the OrderID, the OrderDate, the OrderNumber, the TotalDue, and ROW_NUMBER. 

For the purposes of this example, the ROW_NUMBER function shall be applied to the CustID column. The values continue until the table reaches its end. Keep in mind that since the ORDER BY clause is not used this query, the order of the CustID column is not specific. 

USE BusinessSales2021;
GO
SELECT ROW_NUMBER () OVER (
	ORDER BY CustID) AS RowNum,
	CustID,
	OrderID,
	OrderDate,
	OrderNumber
	TotalDue
From TotalSales.SalesOrderHeader;
RowNumCustIDOrderIDOrderDateOrderNumberTotalDue
1120098002021-12-6ON49949800
2120087652021-11-8ON41005600
3120298942021-12-9ON87552322
4144091722021-8-8ON94002455
5156020222021-9-13ON12005600
6180085762021-10-8ON45456400

You can also use the ROW_NUMBER function with the Bigquery row_number PARTITION by clause. For instance ,if you add that on the CustID and OrderDate fields, the output will change. Let’s carry on from the example given above. 

Use BusinessSales2021;
GO
Select ROW_NUMBER () OVER (PARTITION BY CustID,
					DATEADD (MONTH, DATEDIFF (Month, 0, OrderDate), 0)
	ODER BY SubTotal DESC) AS MonthlyOrders,
	CustID,
	OrderID,
	OrderDate,
	OrderNumber
	Subtotal,
	TotalDue
FROM TotalSales.SalesOrderHeader;
DayOrdersCustIDOrderIDOrderDateOrderNumberTotalDue
1120098002021-12-6ON49949800
1120087652021-12-8ON41005600
1120298942021-12-9ON87552322
1144091722021-8-8ON94002455
1156020222021-9-13ON12005600
2180085762021-10-8ON45456400

In this query, a partition is created for the OrderDate and the CustID. For each unique combination, the ROW_Number will repeat itself. This makes it easy for data analysts to figure out which order placed more than one order on the same day. As you can see, CustID 1200 placed two orders in the month of December 2021. 

ROW_NUMBER in Snowflake, Databricks, BigQuery, and Redshift​

Most, if not all, modern data warehouses support ROW_NUMBER and other similar ranking functions; the syntax is also the same across them. Use the table below to read more on your data warehouse’s documentation for the ROW_NUMBER function.

Data warehouseROW_NUMBER support?
Snowflake                  ✅
Databricks                  ✅
Amazon Redshift                  ✅
Google BigQuery                  ✅
Data Warehouse ROW_NUMBER support

Download the Cheatsheet on How to Set Up High-performance ETL to BigQuery
Download the Cheatsheet on How to Set Up High-performance ETL to BigQuery
Download the Cheatsheet on How to Set Up High-performance ETL to BigQuery
Learn the best practices and considerations for setting up high-performance ETL to BigQuery

ROW_NUMBER function use cases

We most commonly see the ROW_NUMBER function used in data work to:

  • In SELECT statements to add explicit and unique row numbers in a group of data or across an entire table
  • Paired with the QUALIFY statement, filter CTEs, queries, or models to capture one unique row per specified partition with the ROW_NUMBER function. This is particularly useful when removing duplicate rows from a dataset (but use this wisely!).

This isn’t an extensive list of where your team may be using the ROW_NUMBER function throughout your dbt models, but it contains some common scenarios analytics engineers face day-to-day.

How To Overcome the Resources Exceeded Error

ROW_NUMBER is a commonly used function, but from time to time, you might get the error “Resources Exceeded” when using it. As the volume of your data continues to increase, there might be an excess number of elements in your dataset to use ORDER BY to bring them in a single partition.

However, to avoid this issue, you should consider using ARRAY_AGG(), since the ORDER BY is capable of dropping all of the data, apart from the top record on each GROUP BY.

Obviously, the query is likely to run a bit slower, but at least you’ll be able to circumvent the “Resources Exceeded” error. 

It’s important to note that ARRAY_AGG () is an aggregate function. These functions simply summarize rows within a group, condensing them to a single value.

How Versatile is ROW_NUMBER?

ROW_NUMBER can be leveraged for use in a variety of different situations, such as:

  • Identifying quality gaps in your data
  • Minimizing the available data
  • Ranking values in datasets
  • Optimizing sessionization
  • Managing queries associated with preferences. 

However, when you think about it, ROW_NUMBER is more than just a traditional function. It’s actually a window function with peculiar properties. And, before we go further, it’s important to talk about the arguments that the function can take. 

So, if you take a look at ROW_NUMBER’s conventional syntax, you’ll realize it doesn’t take direct arguments. The OVER clause must be used for adding an argument to the ROW_NUMBER clause. This argument is then referred to as a window. 

The window simply gives definition to a subset of the data that must be used for computing data. Apart from the OVER clause, other arguments that can be used with this function include partitions, rows, and orders. 

Typically, ROW_NUMBER is used for ranking different records. While there are other functions that can be used, such as DENSE_RANK and RANK, they are all slightly different. The reason why ROW_NUMBER is unique is because it returns a unique and constantly increasing ranking for each of the records. 

This ultimately becomes one of the function’s greatest advantages. Since data analysts already know that only one record can exist for each value of a ROW_NUMBER, they don’t have to worry about cardinalities when grouping the different queries together. 

Conclusion

As mentioned in the article, BigQuery ROW_NUMBER is one of the many functions that you can use to gain better visibility over your data and manipulate it to offer valuable insights into your business performance. If you are using BigQuery to store your data, you should absolutely consider adding a no-code data pipeline to seamlessly pull your data into one location.

Hevo does just that. Hevo is the only real-time ELT No-code data pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you export data from sources & load data to different destinations including data warehouses like Redshift and Snowflake. but also transform & enrich your data, & make it analysis-ready.

Visit our Website to Explore Hevo

Want to take Hevo for a spin? Sign Up for a 14-day free trial and simplify your Data Integration process. Do check out the pricing details to understand which plan meets all your business needs.

Najam Ahmed
Freelance Technical Content Writer, Hevo Data

Skilled in freelance writing within the data industry, Najam is passionate about simplifying the complexities of data integration and data analysis through informative content for those delving deeper into these subjects.

No-code Data Pipeline For Google BigQuery