BigQuery Quantiles & Percentile: Using APPROX_QUANTILES Simplified 101

By: Published: March 31, 2022

BigQuery Quantiles

Quantiles and Percentiles are mathematical techniques related to statistical analysis and machine learning. They are important concepts because, in statistical analysis, most of the distribution of numeric variables that you’ll be observing in your data sets will often have a highly skewed or asymmetrical distribution. This could be attributed to outliers in the data, highly exponential distributions, and multi-modal distributions, among other reasons. 

It gets difficult to work with too much skewness and numerous statistical techniques. Machine Learning algorithms perform better when numerical input variables have a uniform probability distribution, and quantiles and percentiles expose a way to transform a numeric input variable to have a standard data distribution. The result, in turn, can be used as input to a predictive model.

In this post, you will learn how to use Google BigQuery quantiles to explore distributions in data sets. Code examples are given for their application in the context of sorted lists.

Table of Contents

What is Google BigQuery?

Google BigQuery Logo - BigQuery Quantiles
Image Source

Google BigQuery is a Cloud Data Warehouse service that is a part of the Google Cloud Platform (GCP). It uses Dremel technology to store data in a Columnar Storage structure which also enables Google BigQuery to offer fast query processing and high data compression features.

Moreover, Google BigQuery can easily integrate with other Google Cloud services and leverage features like Cloud Functions, Machine Learning, etc for enhanced data Analysis and Visualization. It supports SQL language that allows users to easily access and manipulate data from the Data Warehouse.

Key Features of Google BigQuery

Some of the main features of Google BigQuery are listed below:

  • BigQuery Omni: With the help of Google BigQuery Omni, users can access and analyze data across the cloud using SQL language support from the Google BigQuery interface.
  • BI Engine: to deliver sub-second query responses, Google BigQuery allows users to use in-memory analysis services to analyze large datasets in the Data Warehouse itself.
  • Integrations: Google BigQuery provides users with one-click integrations with other Google products and partnered third-party platforms for easy data access.
  • Standard SQL Support: Google BigQuery supports SQL dialect ANSI:2011 that reduces the requirement to rewrite the code and also offers ODBC and JDBC drivers support.

To learn more about Google BigQuery, click here.

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 150+ data sources and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

Get Started with Hevo for Free

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  1. Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  2. Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  3. Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  4. Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  5. Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  6. Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  7. Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Get Started with Hevo for Free

What is Google BigQuery Quantiles and Percentiles?

To understand Google BigQuery Quantiles, you first need to understand the term Percentile. A percentile is a measure used in statistics to measure the central tendency and spread within a distribution. It basically indicates the value below which a specific percentage of observations in a group of observations falls. For example, the 90th Percentile is the value below which 90% of the observations may be found.

Percentiles are made up of 99 points that divide any data set or distribution into 100 equal parts. These are denoted as P1, P2, …P99. This means that 1% of the observations, 2% of the observations,….99% of the observations are either equal to or fall below P1, P2,…P99 respectively.

From that, we can deduce that Percentile basically says where do you lie in a sorted list. 

Example:

P(X <= Qi) = i/100 for i = 1,2,3,…99

Then the value at the 10th index (P10) in the sorted list gives the 10th Percentile. What this Percentile tells us is basically 10% of the values are less than this point and 90% of values are more than this point. Here, the 50th Percentile is referred to as the Median.

Google BigQuery Quantile on the other hand is something that partitions the values in a sorted list into equal-sized subgroups of adjacent values or a probability distribution into distributions of equal probability. Based on these definitions, you can understand that just like Percentiles, Quantiles define how many values in a distribution are above or below a certain limit. 

Example:

zp will be a Quantile of order p if;

P(X <= zp) = p for 0 < p < 1

Percentiles range from 0 to 100 while Quantiles range from 0 to 1. A Quantile that partitions a distribution into 100 parts is a Percentile. Therefore, a Percentile is a type of Quantile. The 0.99 Quantile is the 99th Percentile.

The Median is also technically a Quantile because it splits the data into two groups that contain the same number of data points. The Median is also known as the .5 Quantile or the 50th Percentile.

However, unlike Percentiles, Quantiles are infinitely divisible. With the help of this model, you are not confined to predicting the median rather you can also predict unlimited types of Percentiles as per your objective or situation.

Example:

  • 0.25 quantile = 25 percentile
  • .5 quantile = 50 percentile (median)
  • .75 quantile = 75 percentile
  • 1 quantile = 100 percentile (max value)

Other types of Quantiles include:

  • 2-quantiles are called median.
  • 4-quantiles are called quartiles.
  • 5-quantiles are called quintiles.
  • 8-quantiles are called octiles.
  • 10-quantiles are called deciles.
  • 20-quantiles are called vigintiles.
  • 100-quantiles are called percentiles.
  • 1000-quantiles are permilles.

Therefore, if you divide a distribution into five equal partitions, you will speak of five quintiles. The first quintile includes all values that are smaller than 1/5 of all values. In a graphical representation, it corresponds to 20% of the total area of distribution.

You can also think of the Quantile as a transformation. If you express measurements in Google BigQuery Quantiles, they always have a uniform distribution from zero to one, regardless of the underlying distribution.

Often Google BigQuery Quantile measurements are more meaningful than raw numbers, that’s the theory behind grading on a curve.

Example:

The following sorted list that contains weights in grams is given below.

[3,4,4,6,7,9,12,13,14,16,17,19,22,23,23,25,28,29,34,37]

Using standard SQL you can find the Google BigQuery Quantiles & Percentile for this dataset.

PercentileQuantileValue
003
25.258.5
50.516.5
75.7523.5
100137

Here’s how to interpret these values:

  • The 0 percentile and 0 quantile is 3.
  • The 25th percentile and .25 quantile is 8.5.
  • The 50th percentile and .5 quantile is 16.5.
  • The 75th percentile and .75 quantile is 23.5.
  • The 100th percentile and 1 quantile is 37.

When to Use Google BigQuery Quantiles

Generally, Google BigQuery Quantiles can be used in the following conditions:

  • Key assumptions of linear regression are violated.
  • To predict the Median, the 0.25 Quantile, or any other Quantile.
  • In case of outliers in your data.
  • When error terms or residuals are not normal.
  • When there is an increase in error variance with an increase in the outcome variable.

Google BigQuery Quantiles are generally used to describe a sample, population, or distribution. As an example, Quantile functions can be used.

Monte Carlo Simulations: A Quantile function, also known as a CDF (Cumulative Distribution Function) can be used in both statistical applications and Monte Carlo simulations to estimate the possible outcomes of an uncertain event.

A sample from a given distribution may be calculated in principle by applying its Quantile function to a sample from a uniform distribution.

The demands for example of simulation methods in modern computation finance are focusing increasing attention on methods based on Quantile functions as they work well with multi-variant techniques based on Monte Carlo methods in finance.

Probability Distribution: A Quantile function is a way of defining a probability distributionwhich is an alternative to the probability distribution function. In statistical analysis, users need to know key percentage points of a given distribution, for example, 25%, 50%, 75%, etc.

Consumer Economics: In consumer economics, when analyzing the effects of household income on food expenditures for low and high expenditure households and the standard deviation, you don’t know if there’s a bell-shaped curve around the mean, or a barbell shape with half the households one standard deviation above the mean and the other half one standard deviation below, or some other distribution.

You can use Google BigQuery Quantiles to answer questions like:

What score does a student need to attain on a test to be ranked in the top 10%?

To answer this question, you first need to identify the 90th percentile or .9 quantile of all scores. This is the value that separates the top 10% from the bottom 90%.

What scores encompass the middle 50% of scores for students in a particular test?

To answer this, we would find the .75 quantile of scores and .25 quantile of scores, which are the two values that determine the upper and lower bounds for the middle 50% of scores.

How to Calculate BigQuery Quantile using APPROX_QUANTILES

Google BigQuery Quantile has an APPROX_QUANTILES function that is categorized as an approximate aggregate function. This Google BigQuery Quantile function exposes a way to explore distributions by computing a value for each row based on other rows (window) without the need for a self-join. The Google BigQuery Quantile (APPROX_QUANTILES) function will return a single value for every row.

Since the APPROX_QUANTILE function returns approximate results, it is much more scalable in terms of memory usage and execution time compared to exact aggregation functions like COUNT, MIN, MAX, and AVG

The size and skewness of the dataset will affect the accuracy of approximation. The fractional error per quantile is epsilon=1/N, so as N increases, the relative error becomes smaller and the accuracy becomes higher. 

To put this into context with Google BigQuery Quantiles:

  • quantile(<expression>, 2) computes min and max with 50% error.
  • quantile(<expression>, 3) computes min, median, and max with 33% error.
  • quantile(<expression>, 5) computes quartiles with 25% error.
  • quantile(<expression>, 11) computes deciles with 10% error.
  • quantile(<expression>, 21) computes vingtiles with 5% error.
  • quantile(<expression>, 101) computes percentiles with 1% error.

BigQuery Quantiles & Percentile function is appropriate for exploring distributions in large data sets using APPROX_QUANTILE for which linear memory usage is impractical.

Syntax

APPROX_QUANTILES(
  [DISTINCT]
  expression, number
  [{IGNORE|RESPECT} NULLS]
)

APPROX_QUANTILE function will return the approximate boundaries or the values for a group of expression values, in which number represents the number of quantiles you want to create. Using an approximation algorithm, this function returns an array of number + 1 elements, where the first element is the approximate minimum value, number – 2 intermediate values, and the last element is the approximate maximum value.

Arguments

  • expression: supports any data type except: ARRAY STRUCT
  • number: positive integer of type INT64.

Optional Clauses

You can apply these optional clauses in the following order:

  1. DISTINCT: Every distinct value is aggregated only once into the result.
  2. IGNORE NULLS: NULL values are excluded from the result.
  3. RESPECT NULLS: NULL values are included in the result.

Returned Data Types

It returns an ARRAY of the data type defined by the expression parameter.

Therefore, if you run:

APPROX_QUANTILES(values, 4)

You will get:

[minvalue, 1st quartile, 2nd quartile, 3rd quartile, maxvalue]

Google BigQuery Quantiles Examples

This is the output using an input of numbers between 1 and 100:

SELECT APPROX_QUANTILES(x, 4) AS output
FROM UNNEST(GENERATE_ARRAY(1, 100)) AS x;
+---------------------------------+
|                output           |
+---------------------------------+
|        [1,25,50,75,100]         |
+---------------------------------+

The output includes the minimum (1), the median (50), and the maximum (100).

Similarly, for an array with values between 1 and 10, you will get the following output:

SELECT APPROX_QUANTILES(x, 2) AS output
FROM UNNEST([1, 1, 1, 4, 5, 6, 7, 8, 9, 10]) AS x;

+------------------+
|      output      |
+------------------+
|     [1, 5, 10]   |
+------------------+

This output does not account for values that appear more than once. To ensure that each distinct value in the expression is aggregated only once, you must include the DISTINCT clause as follows:

SELECT APPROX_QUANTILES(DISTINCT x, 2) AS output
FROM UNNEST([1, 1, 1, 4, 5, 6, 7, 8, 9, 10]) AS x;

+-----------------------+
|          output       |
+-----------------------+
|        [1, 6, 10]     |
+-----------------------+

To calculate the .9 quantile or 90th percentile using Google BigQuery Quantiles, you need to include the OFFSET(90) clause:

SELECT APPROX_QUANTILES(x, 100)[OFFSET(90)] AS output
FROM UNNEST([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) AS x;

+------------------+
|      output      |
+------------------+
|           9      |
+------------------+

In the following query, we’re introducing the FORMAT specifier with the % symbol, which formats the data type expression as strings. Since the RESPECT NULLS clause is specified, NULL values will be included in the result.

SELECT FORMAT("%T", APPROX_QUANTILES(x, 2 RESPECT NULLS)) AS output
FROM UNNEST([NULL, NULL, 1, 1, 1, 4, 5, 6, 7, 8, 9, 10]) AS x;

+-----------------------+
|          output       |
+-----------------------+
|    [NULL, 4, 10]      |
+-----------------------+

When you run the same query with the DISTINCT clause to ignore repeated values, you will get a different result:

SELECT FORMAT("%T", APPROX_QUANTILES(DISTINCT x, 2 RESPECT NULLS)) AS output
FROM UNNEST([NULL, NULL, 1, 1, 1, 4, 5, 6, 7, 8, 9, 10]) AS x;

+-----------------------+
|          output       |
+-----------------------+
|    [NULL, 6, 10]      |
+-----------------------+

Conclusion

In this article, You just learnt how to use Google BigQuery Quantiles & Percentile function to assess the statistical significance of an observation whose distribution is known using APPROX_QUANTILES function. I hope you found this post helpful and that you will apply these concepts to explore your data and make meaningful inferences.

Visit our Website to Explore Hevo

Companies store valuable data from multiple data sources in Google BigQuery. The manual process to transfer data from source to destination is a tedious task. Hevo Data is a No-code Data Pipeline that can help you transfer data from any data source to desired Google BigQuery. It fully automates the process to load and transform data from 150+ data sources to a destination of your choice without writing a single line of code. 

Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite firsthand.

Share your experience of learning about Google BigQuery Quantiles & Percentile in the comments section below!

Jeremiah
Freelance Technical Content Writer, Hevo Data

Jeremiah is specialized in writing for the data industry, offering valuable insights and informative content on the complexities of data integration and analysis. He loves to update himself about latest trends and help businesses through the knowledge.

No-code Data Pipeline For your Google BigQuery