Amazon Redshift is a fully managed, distributed relational database system. It is capable of performing queries efficiently over petabytes of data. Here is a short video that can give a glimpse of Redshift in 2 mins.
Redshift has become a natural choice for many for the data warehousing needs. This makes it important to understand the concept of Redshift Sort Key in order to get optimum performance from it.
Another important prerequisite to making most of your data warehouse is to have all your data available in Real-Time. The data needs to be accurate for business teams to trust the data in order to make data-driven decisions.
Hevo – a No Code Data Pipeline Solution for Redshift ETLGet Started For Free
In this article, we will shed light on sort keys and their implementation in Redshift. If leveraged rightly, sort keys can help optimize the query performance on an Amazon Redshift cluster to a greater extent.
Table of Contents
- Understanding Redshift Sort Keys
- Types of Redshift Sort Keys
- Choosing the Ideal Sort Key in Redshift
Understanding Redshift Sort Keys
Redshift Sort Key determines the order in which rows in a table are stored. Query performance is improved when Sort keys are properly used as it enables the query optimizer to read fewer chunks of data filtering out the majority of it.
Redshift Sort Keys allow skipping large chunks of data during query processing. Fewer data to scan means a shorter processing time, thereby improving the query’s performance.
Types of Redshift Sort Keys
There can be multiple columns defined as Sort Keys. Data stored in the table can be sorted using these columns. The query optimizer uses this sort ordered table while determining optimal query plans.
Amazon Redshift supports two kinds of Sort Keys: Redshift Compound Sort Keys and Redshift Interleaved Sort Keys.
Redshift Compound Sort Key
These are made up of all the columns that are listed in the Redshift sort keys definition during the creation of the table, in the order that they are listed. Therefore, it is advisable to put the most frequently used column at the first in the list. COMPOUND is the default sort type. Redshift compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY.
For example, let’s create a table with two compound sort keys.
CREATE TABLE customer ( c_customer_id INTEGER NOT NULL, c_country_id INTEGER NOT NULL, c_name VARCHAR(100) NOT NULL) COMPOUND SORTKEY(c_customer_id, c_country_id);
You can see how data is stored in the table, it is sorted by the columns c_customer_id and c_country_id. Since the column c_customer_id is first in the list, the table is first sorted by c_customer_id and then by c_country_id.
Figure 1. Table compound sorted by columns – c_customer_id and c_country_id
As you can see in Figure.1, if you want to get all country IDs for a customer, you would require access to one block. If you need to get IDs for all customers with a specific country, you need to access all four blocks. This shows that we are unable to optimize two kinds of queries at the same time using Compound Sorting.
Redshift Interleaved Sort Key
Interleaved sort gives equal weight to each column in the Redshift sort keys. As a result, it can significantly improve query performance where the query uses restrictive predicates (equality operator in WHERE clause) on secondary sort columns.
Adding rows to a sorted table already containing data affects the performance significantly. VACUUM and ANALYZE operation should be used regularly to re-sort and update the statistical metadata for the query planner. The effect is greater when the table uses interleaved sorting, especially when the sort columns include data that increases monotonically, such as date or timestamp columns.
For example, let’s create a table with Interleaved Sort Keys.
CREATE TABLE customer (c_customer_id INTEGER NOT NULL, c_country_id INTEGER NOT NULL) INTERLEAVED SORTKEY (c_customer_id, c_country_id);
Figure 2. Data stored in an interleaved manner
As you can see, first block stores the first two customer IDs along with the first two country IDs. Therefore, you only scan 2 blocks to return data to a given customer or a given country.
The query performance is much better for the large table using interleave sorting. If the table contains 1M blocks (1 TB per column) with an interleaved sort key of both customer ID and country ID, you scan 1K blocks when you filter on a specific customer or country, a speedup of 1000x compared to the unsorted case.
Choosing the Ideal Sort Key in Redshift
Selecting the right kind needs the knowledge of the queries.
- Use Interleaved Sort Key when you plan to use one column as Sort Key or when WHERE clauses in your query have highly selective restrictive predicates. Or if the tables are huge. You may want to check table statistics by querying the STV_BLOCKLIST system table. Look for the tables with a high number of 1MB blocks per slice and distributed over all slices.
- Use Compound Sort Key, when you have more than one column as Sort Key, when your query includes JOINS, GROUP BY, ORDER BY and PARTITION BY when your table size is small.
- Don’t use an interleaved sort key on columns with monotonically increasing attributes, like an identity column, dates or timestamps.
This is how you can choose the ideal sort key in Redshift for your unique data needs.
Another way to get optimum Query performance from Redshift is to re-structure the data from OLTP to OLAP. You can create derived tables by pre-aggregating and joining the data. Data Integration Platform such as Hevo offers Data Modelling and Workflow Capability to achieve this in a simple and reliable manner. You can sign up for a trial at Hevo for free.
In case you want to gain insights on how you can further optimise your query performance, you can click here to check out our detailed guide on Redshift distribution keys. It will help you choose the ideal table distribution style and thus combine the power of Redshift distkey and sortkey to optimise performance. With a clear idea about Redshift distkey and sort keys, you’ll be able to draw a holistic comparison of Redshift distkey vs sortkey, helping you figure out the type of optimisation you require to distribute & sort data effectively and thus boost performance.
Let us know your experience using different sort key in the comments below.