Amazon Redshift is a fully managed, distributed Relational Data Warehouse system. It is capable of performing queries efficiently over petabytes of data. Nowadays, Redshift has become a natural choice for many for their Data Warehousing needs. This makes it important to understand the concept of Redshift Sortkeys to derive optimum performance from it.
This article will introduce Amazon Redshift Data Warehouse and the Redshift Sortkeys. It will also shed light on the types of Sort Keys available and their implementation in Data Warehousing. If leveraged rightly, Sort Keys can help optimize the query performance on an Amazon Redshift Cluster to a greater extent. Read along to understand the importance of Sort Keys and the points that you must keep in mind while selecting a type of Sort Key for your Data Warehouse!
Table of Contents
What is Redshift Sortkey?
Amazon Redshift is a well-known Cloud-based Data Warehouse. Developed by Amazon, Redshift has the ability to quickly scale and deliver services to users, reducing costs and simplifying operations. Moreover, it links well with other AWS services, for example, AWS Redshift analyzes all data present in data warehouses and data lakes efficiently.
With machine learning, massively parallel query execution, and high-performance disk columnar storage, Redshift delivers much better speed and performance than its peers. AWS Redshift is easy to operate and scale, so users don’t need to learn any new languages. By simply loading the cluster and using your favorite tools, you can start working on Redshift. The following video tutorial will help you in starting your journey with AWS Redshift.
To learn more about Amazon Redshift, visit here.
Introduction to Redshift Sortkeys
Redshift Sortkeys determines the order in which rows in a table are stored. Query performance is improved when Redshift Sortkeys are properly used as it enables the query optimizer to read fewer chunks of data filtering out the majority of it.
During the process of storing your data, some metadata is also generated, for example, the minimum and maximum values of each block are saved and can be accessed directly without repeating the data. Every time a query is executed. This metadata is passed to the query planner, which extracts this information to create more efficient execution plans. This metadata is used by the Sort Keys to optimizing the query processing.
Redshift Sortkeys allow skipping large chunks of data during query processing. Fewer data to scan means a shorter processing time, thereby improving the query’s performance.
To learn more about Redshift Sortkeys, visit here.
Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ data sources and loads the data onto the desired Data Warehouse-like Redshift, enriches the data, and transforms it into an analysis-ready form without writing a single line of code.
Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different Business Intelligence (BI) tools as well.
Get Started with Hevo for Free
Check out why Hevo is the Best:
Sign up here for a 14-Day Free Trial!
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Types of Redshift Sortkeys
There can be multiple columns defined as Sort Keys. Data stored in the table can be sorted using these columns. The query optimizer uses this sort of ordered table while determining optimal query plans. There are 2 types of Amazon Redshift Sortkey available:
1) Compound Redshift Sortkeys
These are made up of all the columns that are listed in the Redshift Sortkeys definition during the creation of the table, in the order that they are listed. Therefore, it is advisable to put the most frequently used column at the first in the list. COMPOUND is the default Sort type. The Compound Redshift Sortkeys might speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY.
Download the Cheatsheet on How to Set Up High-performance ETL to Redshift
Learn the best practices and considerations for setting up high-performance ETL to Redshift
For example, let’s create a table with 2 Compound Redshift sortkeys.
CREATE TABLE customer ( c_customer_id INTEGER NOT NULL, c_country_id INTEGER NOT NULL, c_name VARCHAR(100) NOT NULL)
COMPOUND SORTKEY(c_customer_id, c_country_id);
You can see how data is stored in the table, it is sorted by the columns c_customer_id and c_country_id. Since the column c_customer_id is first in the list, the table is first sorted by c_customer_id and then by c_country_id.
As you can see in Figure.1, if you want to get all country IDs for a customer, you would require access to one block. If you need to get IDs for all customers with a specific country, you need to access all four blocks. This shows that we are unable to optimize two kinds of queries at the same time using Compound Sorting.
2) Interleaved Redshift Sortkeys
Interleaved Sort gives equal weight to each column in the Redshift Sortkeys. As a result, it can significantly improve query performance where the query uses restrictive predicates (equality operator in WHERE clause) on secondary sort columns.
Adding rows to a Sorted Table already containing data affects the performance significantly. VACUUM and ANALYZE operations should be used regularly to re-sort and update the statistical metadata for the query planner. The effect is greater when the table uses interleaved sorting, especially when the sort columns include data that increases monotonically, such as date or timestamp columns.
For example, let’s create a table with Interleaved Sort Keys.
CREATE TABLE customer (c_customer_id INTEGER NOT NULL, c_country_id INTEGER NOT NULL) INTERLEAVED
SORTKEY (c_customer_id, c_country_id);
As you can see, the first block stores the first two customer IDs along with the first two country IDs. Therefore, you only scan 2 blocks to return data to a given customer or a given country.
The query performance is much better for the large table using interleave sorting. If the table contains 1M blocks (1 TB per column) with an interleaved sort key of both customer ID and country ID, you scan 1K blocks when you filter on a specific customer or country, a speedup of 1000x compared to the unsorted case.
Choosing the Ideal Redshift Sortkey
Both Redshift Sorkeys have their own use and advantages. Keep the following points in mind for selecting the right Sort Key:
- Use Interleaved Sort Keys when you plan to use one column as Sort Key or when WHERE clauses in your query have highly selective restrictive predicates. Or if the tables are huge. You may want to check table statistics by querying the STV_BLOCKLIST system table. Look for the tables with a high number of 1MB blocks per slice and distributed over all slices.
- Use Compound Sort Keys when you have more than one column as Sort Key, when your query includes JOINS, GROUP BY, ORDER BY, and PARTITION BY when your table size is small.
- Don’t use an Interleaved Sort Key on columns with monotonically increasing attributes, like an identity column, dates, or timestamps.
This is how you can choose the ideal Sort Key in Redshift for your unique data needs.
This article introduced Amazon Redshift Data Warehouse and the Redshift Sortkeys. Moreover, it provided a detailed explanation of the 2 types of Redshift Sortkeys namely, Compound Sort Keys and Interleaved Sort Keys. The article also listed down the points that you must remember while choosing Sort Keys for your Redshift Data warehouse.
Visit our Website to Explore Hevo
Another way to get optimum Query performance from Redshift is to re-structure the data from OLTP to OLAP. You can create derived tables by pre-aggregating and joining the data. Data Integration Platform such as Hevo Data offers Data Modelling and Workflow Capability to achieve this simply and reliably. Hevo Data offers a faster way to move data from 100+ data sources such as SaaS applications or Databases into your Redshift Data Warehouse to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand
Share your experience of using different Redshift Sortkeys in the comments below!