Amazon Redshift is a fully managed, distributed relational database system. It is capable of performing queries efficiently over petabytes of data. High parallel processing, columnar design, and data compression encoding schemes help achieve fast query processing.
In this article, we will see how to optimise the query performance on an Amazon Redshift cluster using Sort Keys.
Introduction to Redshift Sort Keys
Redshift Sort Keys determine the order in which rows in a table are stored. Query performance is improved when Sort keys are properly used as it enables query optimiser to read fewer chunks of data filtering out the majority of it.
Redshift Sort Keys allow skipping large chunks of data during query processing. Fewer data to scan means a shorter processing time, thereby improving the query’s performance.
There can be multiple columns defined as Sort Keys. Data stored in the table can be sorted using these columns. The query optimiser uses this sort ordered table while determining optimal query plans.
Amazon Redshift supports two kinds of Sort Keys. Compound Sort Keys and Interleaved Sort Keys.
Compound Sort Keys
These are made up of all the columns that are listed in the Redshift sort keys definition during the creation of the table, in the order that they are listed. Therefore, it is advisable to put the most frequently used column at the first in the list. COMPOUND is the default sort type. Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY.
For example, let’s create a table with two compound sort keys.
CREATE TABLE customer ( c_customer_id INTEGER NOT NULL, c_country_id INTEGER NOT NULL, c_name VARCHAR(100) NOT NULL) COMPOUND SORTKEY(c_customer_id, c_country_id);
You can see how data is stored in the table, it is sorted by the columns c_customer_id and c_country_id. Since the column c_customer_id is first in the list, the table is first sorted by c_customer_id and then by c_country_id.
Figure 1. Table compound sorted by columns – c_customer_id and c_country_id
As you can see in Figure.1, if you want to get all country IDs for a customer, you would require access to one block. If you need to get IDs for all customers with a specific country, you need to access all four blocks. This shows that we are unable to optimize two kinds of queries at the same time using Compound Sorting.
Interleaved Sort Keys
Interleaved sort gives equal weight to each column in the Redshift sort keys. As a result, it can significantly improve query performance where the query uses restrictive predicates (equality operator in WHERE clause) on secondary sort columns.
Adding rows to a sorted table already containing data affects the performance significantly. VACUUM and ANALYZE operation should be used regularly to re-sort and update the statistical metadata for the query planner. The effect is greater when the table uses interleaved sorting, especially when the sort columns include data that increases monotonically, such as date or timestamp columns.
For example, let’s create a table with Interleaved Sort Keys.
CREATE TABLE customer (c_customer_id INTEGER NOT NULL, c_country_id INTEGER NOT NULL) INTERLEAVED SORTKEY (c_customer_id, c_country_id);
Figure 2. Data stored in an interleaved manner
As you can see, first block stores the first two customer IDs along with the first two country IDs. Therefore, you only scan 2 blocks to return data to a given customer or a given country.
The query performance is much better for the large table using interleave sorting. If the table contains 1M blocks (1 TB per column) with an interleaved sort key of both customer ID and country ID, you scan 1K blocks when you filter on a specific customer or country, a speedup of 1000x compared to the unsorted case.
Choosing Sorting Styles
Selecting the right kind needs the knowledge of the queries.
- Use Interleaved Sort Keys when you plan to use one column as Sort Key or when WHERE clauses in your query have highly selective restrictive predicates. Or if the tables are huge. You may want to check table statistics by querying the STV_BLOCKLIST system table. Look for the tables with a high number of 1MB blocks per slice and distributed over all slices.
- Use Compound Sort Keys, when you have more that one column as Sort Key, when your query includes JOINS, GROUP BY, ORDER BY and PARTITION BY when your table size is small.
- Don’t use an interleaved sort key on columns with monotonically increasing attributes, like an identity column, dates or timestamps.
We have also talked about Redshift Distribution Keys and how to choose the right distribution style to optimise your AWS Redshift performance.
At Hevo, we use Amazon Redshift as one of the data warehouses and bring data from different sources in real-time. Data sources could be – databases, cloud applications, clickstreams. Let us know your experience with using different distribution styles in the comments below.