Snowflake Row Number: 4 Critical Aspects

• July 23rd, 2020

SNOWFLAKE ROW NUMBER

In this article, we are going to work through some examples of using the Snowflake ROW_NUMBER() analytic function to solve some business questions. We will see how to use this function to work with Snowflake row numbers. By the end, you will be able to understand the concepts behind the ROW_NUMBER() function.

Table of Contents

Introduction to Snowflake

Snowflake Row Number: Snowflake Logo
Image Source

Snowflake is a fully managed service that provides customers with near-infinite scalability of concurrent workloads to easily integrate, load, analyze, and securely share their data. Its common applications include data lakes, data engineering, data application development, data science, and secure consumption of shared data. 

Snowflake uses AWS, Azure, or GCP to host its services and provides an intuitive user interface that allows customers to perform analytics. It uses SQL to query the data that runs on its virtual machines.

Snowflake’s unique architecture natively integrates computing and storage. This architecture enables you to virtually enable your users and data workloads to access a single copy of your data without any detrimental effect on performance. Snowflake makes it possible by abstracting the complexity of underlying Cloud infrastructures.

Here are a few features of Snowflake as a Software as a Service (SaaS) offering:

  • Snowflake is a Cloud Data Warehouse operating on a SaaS (software-as-a-service) model.
  • Snowflake’s infrastructure is built on AWS (Amazon Web Services) and it offers an architecture that is based on a new SQL database engine that enables faster, more dynamic queries with performant data integration.
  • Snowflake’s analytics layer integrates with modern tools and services that facilitate and optimize Data Science exploration.

Key Features of Snowflake

Here are a few key features of Snowflake:

  • Scalable: Snowflake provides a secure, scalable architecture that can spin up an unlimited number of independent virtual warehouses in a few minutes. You can execute several parallel jobs without worrying about performance and memory management.
  • Pay Per Use Model: Snowflake is available as pay per usage; i.e., you only have to pay for the time you use the services offered by Snowflake. You can shut down the warehouses once the execution gets completed, to save costs.
  • High Processing Speed: Each Virtual Warehouse in Snowflake is associated with the MPP cluster (Massive Parallel Processing) that performs the parallel execution of a job without degrading other clusters’ performance.
  • Separate Storage and Compute Layer: Snowflake uses different storage and computes layers that can scale up or down without affecting the other.
  • Disaster Recovery: In Snowflake, data is replicated three times (by default) across the availability zones and regions, hence providing a complete fail-safe and fault-tolerant system.

To learn more about Snowflake, visit here.

Introduction to Snowflake ROW_NUMBER() Function

The ROW_NUMBER() is an analytic function that generates a non-persistent sequence of temporary values which are calculated dynamically when the query is executed. The ROW_NUMBER() function assigns a unique incrementing number for each row within a partition of a result set. The row number starts at 1 and continues up sequentially, to the end of the table.

Hevo, A Simpler Alternative to Move your Data to Snowflake

Hevo Data, a No-code Data Pipeline, provides you with a platform to export data from any source to Snowflake.

Get Started with Hevo for Free

It helps you move and transform data in real-time and provides state-of-the-art infrastructure. Some of the salient features of Hevo include:

  • Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
  • Real-time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
  • 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
  • Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use.
  • Connectors: Hevo supports 100+ integrations to SaaS platforms, files, databases, analytics, and BI tools. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake Data Warehouses; Amazon S3 Data Lakes; and MySQL, MongoDB, TokuDB, DynamoDB, PostgreSQL databases to name a few. 
  • Scalable Infrastructure: Hevo has in-built integrations for 150+ data sources that can help you scale your data infrastructure as required.
  • 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects schema of incoming data and maps it to the destination schema.
Sign up here for a 14-Day Free Trial!

Snowflake Row Number Syntax and Arguments

row_number() over (
  [ partition by <expression1> [, <expression2> ... ] ]
  order by <expression3> [ , <expression4> ... ] [ { asc | desc } ]
  )

Snowflake Row Number Syntax: OVER

  • The OVER clause defines the window or set of rows that the ROW_NUMBER() function operates on. 
  • The possible components of the OVER clause are ORDER BY (required), and PARTITION BY (optional).

Snowflake Row Number Syntax: PARTITION BY

  • The PARTITION BY clause divides the rows into partitions (groups of rows) to which the function is applied.
  • The PARTITION BY clause is optional. When you omit it, the ROW_NUMBER() function will treat the whole result set as a single partition/group.

Snowflake Row Number Syntax: ORDER BY

  • The ORDER BY clause defines the sequential order of the rows within each partition of the result set.
  • The ORDER BY clause is required, you must include it because the ROW_NUMBER() function is order sensitive.

Snowflake Row Number Syntax: Expression1 and Expression2

  • expression1 and expression2 specify the column(s) or expression(s) to partition by. You can partition by one or more columns or expressions.
    For example, suppose that you are selecting data across multiple states (or provinces) and you want row numbers from 1 to N within each state; in that case, you can partition by the state.

Snowflake Row Number Syntax: expression3 and expression4

  • expression3 and expression4 are a component of the Snowflake Row Number specify the column(s) or expression(s) to use to determine the order of the rows. You can order by 1 or more expressions.

Snowflake ROW_NUMBER() Examples

In this section, we’ll take a look at some practical examples of using the Snowflake row number function to answer critical business questions. We’ll use sample data from the TPC-H dataset. The TPC Benchmark™ H (TPC-H) specification has broad industry-wide relevance.

snowflake row number - tpc-h schema

The following statement uses the row number function to return the row number, customer name, and account balance from the customer table. The row number values are assigned based on the order of account balance.

use schema snowflake_sample_data.tpch_sf1;   -- or snowflake_sample_data.{tpch_sf10 | tpch_sf100 | tpch_sf1000}
select name, c_acctbal, row_number()
  over (order by c_acctbal desc)
  from customer;

Here is the partial output:

Next, let’s look at how to assign row numbers within partitions. For example, we can get the most expensive line item partitioned by the supplier key using the following query:

with cte_lineitem as (
select 
    row_number() over(
        partition by ps_suppkey
        order by p_retailprice desc
    ) row_num,
    ps_suppkey,
     p_name,     
    p_retailprice
from 
    p_part
)
select * from cte_lineitem
where row_num = 1;

Here is the partial output:

The following statement returns the emails of employees in ascending order based on the date of hire. It shows the row number, email and hire date from the employee table (the employee who was hired first will have row number 1).

select email, hire_date, row_number()
over (order by hire_date asc)
from employees 

Using Snowflake ROW_NUMBER() Function to Get the Top-N Query 

Let’s switch to a simple inventories table with the following schema.

    INVENTORIES
*PRODUCT_ID
*WAREHOUSE_ID  
PRODUCT_NAME

To get the most abundant product in a warehouse, you can use the ROW_NUMBER() function of Snowflake Row Number as shown in the following query:

with list_products as (
select 
    row_number() over(
        partition by warehouse_id
        order by quantity desc
    ) row_num, 
    warehouse_id,
    product_name, 
    quantity
from 
    inventories
)
select * from list_products
where row_num = 1;

Here is the output:

In this example:

  • First, the PARTITION BY clause divided the rows into partitions by warehouse id.
  • Then, the ORDER BY clause sorted the products in each warehouse by quantity in descending order.
  • Next, the ROW_NUMBER() function is applied to each row in a specific warehouse id. It re-initialized the row number for each category.
  • After that, the outer query selected the rows with row number 1 which is the most expensive product in each warehouse.

To get more than one product with the same N-highest prices with Snowflake Row Number, you can use the RANK() or DENSE_RANK() function.

Conclusion

So far, we have reviewed the Snowflake ROW_NUMBER function in detail. You covered a few examples of how to use the Snowflake ROW_NUMBER() function to make useful queries such as inner-N, top-N, and bottom-N.

Visit our Website to Explore Hevo

You can make more of your insights by centralizing your various data sources into Snowflake faster with Hevo Data. Hevo supports real-time streaming from 150+ data sources and allows you to easily replicate these sources into your Snowflake data warehouse with just a few clicks. Our data ingestion tool stands out in terms of ease of use, the fact that it is zero maintenance, and that we have a knowledgeable and highly responsive customer support team that supports multiple time zones.

If you are interested in trying out Hevo for free, Sign Up for a 14-day free trial here.

Check out the short video to get a product overview.

Please share your thoughts on Snowflake ROW_NUMBER() function in the comments!

No-Code Data Pipeline for Snowflake