Amazon Redshift is a fully managed, distributed relational database system. It is capable of performing queries efficiently over petabytes of data. Redshift has become a natural choice to many for the data warehousing needs. This makes it important to understand the concept of Sort Key in Redshift in order to get optimum performance from it.
Another important prerequisite to making most of your data warehouse is to have all your data available in Real-Time. The data needs to be accurate for business teams to trust the data in order to make data-driven decisions.
To solve this there are ready to use Data Integration Solutions such as Hevo (7 Day Free Trial) that allow you to move data from 100s of sources to your Redshift instance in minutes.
In this article, we will shed light on sort keys and their implementation in Redshift. If leveraged rightly, sort keys can help optimize the query performance on an Amazon Redshift cluster to a greater extent.
Introduction to Redshift Sort Key
Redshift Sort Key determines the order in which rows in a table are stored. Query performance is improved when Sort keys are properly used as it enables query optimizer to read fewer chunks of data filtering out the majority of it.
Redshift Sort Keys allow skipping large chunks of data during query processing. Fewer data to scan means a shorter processing time, thereby improving the query’s performance.
Redshift Sort Key
There can be multiple columns defined as Sort Keys. Data stored in the table can be sorted using these columns. The query optimizer uses this sort ordered table while determining optimal query plans.
Amazon Redshift supports two kinds of Sort Keys. Compound Sort Keys and Interleaved Sort Keys.
Compound Sort Key
These are made up of all the columns that are listed in the Redshift sort keys definition during the creation of the table, in the order that they are listed. Therefore, it is advisable to put the most frequently used column at the first in the list. COMPOUND is the default sort type. Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY.
For example, let’s create a table with two compound sort keys.
CREATE TABLE customer ( c_customer_id INTEGER NOT NULL, c_country_id INTEGER NOT NULL, c_name VARCHAR(100) NOT NULL) COMPOUND SORTKEY(c_customer_id, c_country_id);
You can see how data is stored in the table, it is sorted by the columns c_customer_id and c_country_id. Since the column c_customer_id is first in the list, the table is first sorted by c_customer_id and then by c_country_id.
Figure 1. Table compound sorted by columns – c_customer_id and c_country_id
As you can see in Figure.1, if you want to get all country IDs for a customer, you would require access to one block. If you need to get IDs for all customers with a specific country, you need to access all four blocks. This shows that we are unable to optimize two kinds of queries at the same time using Compound Sorting.
Interleaved Sort Key
Interleaved sort gives equal weight to each column in the Redshift sort keys. As a result, it can significantly improve query performance where the query uses restrictive predicates (equality operator in WHERE clause) on secondary sort columns.
Adding rows to a sorted table already containing data affects the performance significantly. VACUUM and ANALYZE operation should be used regularly to re-sort and update the statistical metadata for the query planner. The effect is greater when the table uses interleaved sorting, especially when the sort columns include data that increases monotonically, such as date or timestamp columns.
For example, let’s create a table with Interleaved Sort Keys.
CREATE TABLE customer (c_customer_id INTEGER NOT NULL, c_country_id INTEGER NOT NULL) INTERLEAVED SORTKEY (c_customer_id, c_country_id);
Figure 2. Data stored in an interleaved manner
As you can see, first block stores the first two customer IDs along with the first two country IDs. Therefore, you only scan 2 blocks to return data to a given customer or a given country.
The query performance is much better for the large table using interleave sorting. If the table contains 1M blocks (1 TB per column) with an interleaved sort key of both customer ID and country ID, you scan 1K blocks when you filter on a specific customer or country, a speedup of 1000x compared to the unsorted case.
Choosing Sorting Keys
Selecting the right kind needs the knowledge of the queries.
- Use Interleaved Sort Key when you plan to use one column as Sort Key or when WHERE clauses in your query have highly selective restrictive predicates. Or if the tables are huge. You may want to check table statistics by querying the STV_BLOCKLIST system table. Look for the tables with a high number of 1MB blocks per slice and distributed over all slices.
- Use Compound Sort Key, when you have more that one column as Sort Key, when your query includes JOINS, GROUP BY, ORDER BY and PARTITION BY when your table size is small.
- Don’t use an interleaved sort key on columns with monotonically increasing attributes, like an identity column, dates or timestamps.
We have also talked about Redshift Distribution Keys and how to choose the right distribution style to optimize your AWS Redshift performance.
Another way to get optimum Query performance from Redshift is to re-structure the data from OLTP to OLAP. You can create derived tables by pre-aggregating and joining the data. Data Integration Platform such as Hevo offers Data Modelling and Workflow Capability to achieve this in a simple and reliable manner. You can sign up for a trial at Hevo for free.
Let us know your experience using different sort key in the comments below.