Understanding Redshift Aggregate Functions Simplified 101

on Aggregate Functions, Amazon Redshift, Data Aggregation, Data Warehouses, Redshift Functions • March 23rd, 2022 • Write for Hevo

Redshift Aggregate Functions

Amazon Redshift is a widely used Data Warehouse offered by Amazon Web Services. It offers storage space to users for storing their business data and use for Analysis. It organizes the user’s data into tables, which are made up of rows and columns. It makes it easy for the users to understand their data.

Business users run many queries to access and manipulate data based on their requirements. Many times the basic queries to understand the data require getting a summary of data, the total or average of a column, etc. 

Instead of calculating these manually, Amazon Redshift provides Amazon Redshift Aggregate Functions to make your work easier. You simply have to call the Amazon Redshift Aggregate Function, and it will return the results to you. In this article, you will learn about Amazon Redshift Aggregate Functions in detail. 

Table of Contents

What is Amazon Redshift?

AWS Redshift Logo - Redshift Aggregate Functions
Image Source

Amazon Redshift is a Cloud Data Warehouse based on industry-standard SQL, with the extra capability of managing very big datasets and supporting high-performance data processing and reporting. It is built on top of PostgreSQL 8.0.2. This means that Redshift can be used with standard SQL queries.

However, this isn’t what sets it apart from competing providers. Redshift stands out because of the speed with which it responds to queries conducted on a big database with exabytes of data. MPP, or Massively Parallel Processing, is a design that allows for quick querying. 

Key Features of Amazon Redshift

Some of the main features of Amazon Redshift are listed below:

  • Data Sharing: Amazon Redshift saves costs for organizations and improves performance by data sharing from single cluster to multi-cluster.
  • Massively Parallel Processing: Amazon Redshift uses multiple compute nodes where each computes node processes a chunk of data and all these nodes perform the same operations on data. 
  • Fault-Tolerant: Amazon Redshift continuously monitors the health of the cluster. It automatically replicates data regularly to avoid any data loss at the time of disaster.

To learn more about Amazon Redshift, click here.

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ data sources and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

Get Started with Hevo for Free

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  1. Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  2. Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  3. Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  4. Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  5. Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  6. Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  7. Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

What is an Amazon Aggregate Function?

An aggregate function is a database management function of Amazon Redshift that groups the values of numerous rows into a single summary value. It takes a set of input values and produces a single return value. The GROUP BY clause groups and aggregates results based on the unique values contained in a column or columns.

Type of Redshift Aggregate Functions

The different Amazon Redshift Aggregate Functions are listed below:

1) Redshift Aggregate Functions: ANY_VALUE 

ANY VALUE is a nondeterministic function that returns any value from the input expression values. If the input expression does not return any rows or has a NULL value, this function can return NULL.

Syntax

ANY_VALUE ( [ DISTINCT | ALL ] expression )

Examples

The example below yields an instance of any dateid with Eagles as the eventname.

select any_value(dateid) as dateid, eventname from event where eventname ='Eagles' group by eventname;

Following are the results given below.

dateid  | eventname
------------------------
 1878   |  Eagles

When the eventname is Eagles or Cold War Kids, the following example returns an instance of any dateid.

select any_value(dateid) as dateid, eventname from event where eventname in('Eagles', 'Cold War Kids') group by eventname;

Following are the results shown below.

dateid  | eventname
------------------------
 1922   | Cold War Kids
 1878   | Eagles

2) Redshift Aggregate Functions: APPROXIMATE PERCENTILE_DISC 

The inverse distribution functions APPROXIMATE PERCENTILE DISC is based on the discrete distribution model. It returns an element from the supplied set using a percentile value and a sort specification. With a low relative error of roughly 0.5 percent, approximation permits the function to run substantially faster.

To approximate the discrete percentile of the expression in the ORDER BY clause for a given percentile value, APPROXIMATE PERCENTILE DISC employs a quantile summary algorithm. The value with the least cumulative distribution value that is larger than or equal to percentile is returned by APPROXIMATE PERCENTILE DISC (with respect to the same sort criteria).

The only compute-node function is APPROXIMATE PERCENTILE DISC. If a user-defined table or an Amazon Redshift system table isn’t referenced in the query, the function produces an error.

Syntax

APPROXIMATE  PERCENTILE_DISC ( percentile ) WITHIN GROUP (ORDER BY expr)

Examples

For the top ten dates, the following example returns the number of sales, total sales, and fiftieth percentile value.

select top 10 date.caldate,
count(totalprice), sum(totalprice),
approximate percentile_disc(0.5) 
within group (order by totalprice)
from listing
join date on listing.dateid = date.dateid
group by date.caldate
order by 3 desc;
 
caldate    | count | sum        | percentile_disc
-----------+-------+------------+----------------
2008-01-07 |   658 | 2081400.00 |         2020.00
2008-01-02 |   614 | 2064840.00 |         2178.00
2008-07-22 |   593 | 1994256.00 |         2214.00
2008-01-26 |   595 | 1993188.00 |         2272.00
2008-02-24 |   655 | 1975345.00 |         2070.00
2008-02-04 |   616 | 1972491.00 |         1995.00
2008-02-14 |   628 | 1971759.00 |         2184.00
2008-09-01 |   600 | 1944976.00 |         2100.00
2008-07-29 |   597 | 1944488.00 |         2106.00
2008-07-23 |   592 | 1943265.00 |         1974.00

3) Redshift Aggregate Functions: AVG 

The AVG function returns the input expression values’ average (arithmetic mean). NULL values are ignored by the AVG function, which only deals with numeric values.

Syntax

AVG ( [ DISTINCT | ALL ] expression )

Examples

The average quantity sold per transaction from the SALES table:

select avg(qtysold)from sales;
 
avg
-----
2
(1 row)

The average total price listed for all listings:

select avg(numtickets*priceperticket) as avg_total_price from listing;
 
avg_total_price
-----------------
3034.41
(1 row)

The average price paid, arranged by month in descending order:

select avg(pricepaid) as avg_price, month 
from sales, date
where sales.dateid = date.dateid
group by month
order by avg_price desc;
 
avg_price | month
-----------+-------
659.34 | MAR
655.06 | APR
645.82 | JAN
643.10 | MAY
642.72 | JUN
642.37 | SEP
640.72 | OCT
640.57 | DEC
635.34 | JUL
635.24 | FEB
634.24 | NOV
632.78 | AUG
(12 rows)

4) Redshift Aggregate Functions: COUNT 

The COUNT function counts the rows that the expression specifies. There are three versions of the COUNT function. COUNT (*) counts all rows in the target table, whether or whether they contain nulls.

COUNT (expression) returns the number of rows in a column or expression with non-NULL values. COUNT (Unique expression) determines how many distinct non-NULL values are present in a column or expression.

Syntax

[ APPROXIMATE ] COUNT ( [ DISTINCT | ALL ] * | expression )

Examples

Count the number of users in the state of Florida:

select count (*) from users where state='FL';
 
count
-------
510
(1 row)

Count all of the EVENT table’s unique venue IDs:

select count (distinct venueid) as venues from event;
 
venues
--------
204
(1 row)

Count how many times each seller advertised a batch of four or more tickets for sale. Sort the results according to the seller ID:

select count(*), sellerid from listing 
where numtickets > 4
group by sellerid
order by 1 desc, 2;
 
count | sellerid
-------+----------
12 |    6386
11 |    17304
11 |    20123
11 |    25428
...

The returns and execution times for COUNT and APPROXIMATE COUNT are compared in the following cases.

select  count(distinct pricepaid) from sales;
count
-------
  4528
(1 row)
 
Time: 48.048 ms
 
select approximate count(distinct pricepaid) from sales;
count
-------
  4541
(1 row)
 
Time: 21.728 ms

5) Redshift Aggregate Functions: LISTAGG 

The LISTAGG aggregate function sorts the results for each group in a query according to the ORDER BY expression and then combines the values into a single string for each group. LISTAGG is a function that only works on compute nodes. If the query does not reference a user-defined table or an Amazon Redshift system table, the function produces an error.

Syntax

LISTAGG( [DISTINCT] aggregate_expression [, ‘delimiter‘ ] ) [ WITHIN GROUP (ORDER BY order_list) ]   

Examples

The sample below combines seller IDs and sorts them by seller ID.

select listagg(sellerid, ', ') within group (order by sellerid) from sales
where eventid = 4337;
listagg                                                                                                                                 
----------------------------------------------------------------------------------------------------------------------------------------
380, 380, 1178, 1178, 1178, 2731, 8117, 12905, 32043, 32043, 32043, 32432, 32432, 38669, 38750, 41498, 45676, 46324, 47188, 47188, 48294

DISTINCT is used in the following example to return a list of unique seller IDs.

select listagg(distinct sellerid, ', ') within group (order by sellerid) from sales
where eventid = 4337;
 
listagg                                                                                    
-------------------------------------------------------------------------------------------
380, 1178, 2731, 8117, 12905, 32043, 32432, 38669, 38750, 41498, 45676, 46324, 47188, 48294

The following example collects seller IDs in chronological order.

select listagg(sellerid) 
within group (order by dateid)
from winsales;
 
   listagg
-------------
 31141242333

For customer B, the following example gives a pipe-separated list of sales dates.

elect listagg(dateid,'|') 
within group (order by sellerid desc,salesid asc)
from winsales
where buyerid  = 'b';
 
listagg
---------------------------------------
2003-08-02|2004-04-18|2004-04-18|2004-02-12

For each buyer ID, the following example provides a comma-separated list of sales IDs.

select buyerid, 
listagg(salesid,',')
within group (order by salesid) as sales_id
from winsales
group by buyerid
order by buyerid;
 
   buyerid | sales_id
-----------+------------------------
        a  |10005,40001,40005	
        b  |20001,30001,30004,30003	
        c  |10001,20002,30007,10006

6) Redshift Aggregate Functions: MAX 

The MAX function returns the highest value in a row set. DISTINCT or ALL can be used, but they have no effect on the outcome.

Syntax

MAX ( [ DISTINCT | ALL ] expression )

Examples

Find the highest price paid from all sales:

select max(pricepaid) from sales;
 
max
----------
12624.00
(1 row)

Find the highest per-ticket price paid across all sales:

select max(pricepaid/qtysold) as max_ticket_price
from sales;
 
max_ticket_price
-----------------
2500.00000000
(1 row)

7) Redshift Aggregate Functions: MEDIAN

The median value for a range of variables is calculated. The range is skipped over if any of the values are NULL. MEDIAN is a continuous distribution model inverse distribution function. The MEDIAN function is only available on compute nodes. If a user-defined table or an Amazon Redshift system table isn’t referenced in the query, the function produces an error.

Syntax

MEDIAN ( median_expression )

Examples

MEDIAN delivers the following outcomes, as shown in the following example:

select top 10  distinct sellerid, qtysold, 
percentile_cont(0.5) within group (order by qtysold),
median (qtysold) 
from sales
group by sellerid, qtysold;
 
sellerid | qtysold | percentile_cont | median
---------+---------+-----------------+-------
       1 |       1 |             1.0 |    1.0
       2 |       3 |             3.0 |    3.0
       5 |       2 |             2.0 |    2.0
       9 |       4 |             4.0 |    4.0
      12 |       1 |             1.0 |    1.0
      16 |       1 |             1.0 |    1.0
      19 |       2 |             2.0 |    2.0
      19 |       3 |             3.0 |    3.0
      22 |       2 |             2.0 |    2.0
      25 |       2 |             2.0 |    2.0

8) Redshift Aggregate Functions: MIN 

The MIN function returns the row with the lowest value. DISTINCT or ALL can be used, but they have no effect on the outcome.

Syntax

MIN ( [ DISTINCT | ALL ] expression )

Examples

Find the best deal from all of the sales:

select min(pricepaid) from sales;
 
min
-------
20.00
(1 row)

Find the lowest per-ticket pricing across all sales:

select min(pricepaid/qtysold)as min_ticket_price
from sales;
 
min_ticket_price
------------------
20.00000000
(1 row)

9) Redshift Aggregate Functions: PERCENTILE_CONT 

The inverse distribution function PERCENTILE CONT presupposes a continuous distribution model. It accepts a percentile value and a sort specification and returns an interpolated value that would fall into the supplied percentile value when the sort specification is followed.

After arranging values, PERCENTILE CONT computes a linear interpolation between them. The function computes the row number after ranking the rows according to the sort specification using the percentile value (P) and the number of not null rows (N) in the aggregation group.

The formula RN = (1+ (P*(N-1)) is used to calculate the row number (RN). The aggregate function’s final output is derived via linear interpolation between the values from rows CRN = CEILING(RN) and FRN = FLOOR (RN).

Final result will be:

If (CRN = FRN = RN) then the result is (value of expression from row at RN)

Otherwise the result:

(CRN – RN) * (value of expression for row at FRN) + (RN – FRN) * (value of expression for row at CRN).

The function PERCENTILE CONT is only available on compute nodes. If the query does not reference a user-defined table or an Amazon Redshift system table, the function produces an error.

Syntax

PERCENTILE_CONT ( percentile ) WITHIN GROUP (ORDER BY expr)

Examples

The following example demonstrates how MEDIAN yields the following outcomes. 

select top 10  distinct sellerid, qtysold, 
percentile_cont(0.5) within group (order by qtysold),
median (qtysold) 
from sales
group by sellerid, qtysold;
 
sellerid | qtysold | percentile_cont | median
---------+---------+-----------------+-------
       1 |       1 |             1.0 |    1.0
       2 |       3 |             3.0 |    3.0
       5 |       2 |             2.0 |    2.0
       9 |       4 |             4.0 |    4.0
      12 |       1 |             1.0 |    1.0
      16 |       1 |             1.0 |    1.0
      19 |       2 |             2.0 |    2.0
      19 |       3 |             3.0 |    3.0
      22 |       2 |             2.0 |    2.0
      25 |       2 |             2.0 |    2.0

10) Redshift Aggregate Functions: STDDEV_SAMP and STDDEV_POP 

The functions STDDEV SAMP and STDDEV POP retrieve the sample and population standard deviations of a set of numeric values, respectively (integer, decimal, or floating-point). The square root of the sample variance of the same set of values is the result of the STDDEV SAMP function. The functions STDDEV SAMP and STDDEV are interchangeable.

Syntax

STDDEV_SAMP | STDDEV ( [ DISTINCT | ALL ] expression)
STDDEV_POP ( [ DISTINCT | ALL ] expression)

Examples

The average of the values in the VENUESEATS column of the VENUE database, as well as the sample standard deviation and population standard deviation of the same set of values, are returned by the following query. The column VENUESEATS is an INTEGER. The result’s scale is decreased to two digits.

select avg(venueseats),
cast(stddev_samp(venueseats) as dec(14,2)) stddevsamp,
cast(stddev_pop(venueseats) as dec(14,2)) stddevpop
from venue;
 
avg  | stddevsamp | stddevpop
-------+------------+-----------
17503 |   27847.76 |  27773.20
(1 row)

The sample standard deviation for the SALES table’s COMMISSION column is returned using the following query. A DECIMAL column is a COMMISSION. The result’s scale is lowered to ten digits.

select cast(stddev(commission) as dec(18,10))
from sales;
 
stddev
----------------
130.3912659086
(1 row)

The sample standard deviation for the COMMISSION column is cast as an integer in the following query.

select cast(stddev(commission) as integer)
from sales;
 
stddev
--------
130
(1 row)

For the COMMISSION column, the following query provides the sample standard deviation as well as the square root of the sample variance. These calculations provide the same conclusions.

select
cast(stddev_samp(commission) as dec(18,10)) stddevsamp,
cast(sqrt(var_samp(commission)) as dec(18,10)) sqrtvarsamp
from sales;
 
stddevsamp   |  sqrtvarsamp
----------------+----------------
130.3912659086 | 130.3912659086
(1 row)

11) Redshift Aggregate Functions: SUM 

The SUM function returns the total of the values in the input column or expression. NULL values are ignored by the SUM function, which only works with numeric values.

Syntax

SUM ( [ DISTINCT | ALL ] expression )

Examples

From the SALES table, calculate the total of all commissions paid:

select sum(commission) from sales;
 
sum
-------------
16614814.65
(1 row)

Find out how many seats are available in every arena in Florida:

select sum(venueseats) from venue
where venuestate = 'FL';
 
sum
--------
250411
(1 row)

To find out how many seats were sold in May, use the following formula:

select sum(qtysold) from sales, date
where sales.dateid = date.dateid and date.month = 'MAY';
 
sum
-------
32291
(1 row)

12) Redshift Aggregate Functions: VAR_SAMP and VAR_POP 

The functions VAR SAMP and VAR POP retrieve the sample and population variances of a set of numeric values, respectively.  The squared sample standard deviation of the same set of data is equivalent to the output of the VAR SAMP function. The functions VAR SAMP and VARIANCE are interchangeable.

Syntax

VAR_SAMP | VARIANCE ( [ DISTINCT | ALL ] expression)
VAR_POP ( [ DISTINCT | ALL ] expression)

Examples

The NUMTICKETS field in the LISTING table’s NUMTICKETS column returns the rounded sample and population variance using the following query.

select avg(numtickets),
round(var_samp(numtickets)) varsamp,
round(var_pop(numtickets)) varpop
from listing;
 
avg | varsamp | varpop
-----+---------+--------
10 |      54 |     54
(1 row)

The following query performs the identical calculations as the previous one, but converts the results to decimal numbers.

select avg(numtickets),
cast(var_samp(numtickets) as dec(10,4)) varsamp,
cast(var_pop(numtickets) as dec(10,4)) varpop
from listing;
 
avg | varsamp | varpop
-----+---------+---------
10 | 53.6291 | 53.6288
(1 row)

Conclusion

In this article, you learnt about Amazon Redshift Aggregate Functions that combine a group of input values into a unique output value. They are commonly used in Databases, spreadsheets, and statistical software programs frequently use functions. Aggregate functions take a set of input values and produce a single return value. Every function delivers a set of values categorized by the dimension or dimensions provided. There is a conditional aggregation for each aggregate.

Visit our Website to Explore Hevo

Amazon Redshift stores data from multiple sources and every source follows a different schema. Instead of manually writing scripts for every source to perform the ETL (Extract Transform Load) process, one can automate the whole process. Hevo Data is a No-code Data pipeline solution that can help you transfer data from 100+ data sources to Amazon Redshift or other Data Warehouse of your choice. Its fault-tolerant and user-friendly architecture fully automates the process of loading and transforming data to destination without writing a single line of code. 

Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your experience of learning about the Amazon Redshift Aggregate Functions in the comments section below!

No-code Data Pipeline For your Amazon Redshift