Amazon Redshift is a fully managed, distributed relational database system. It is capable of performing queries efficiently over petabytes of data. High parallel processing, columnar design, and data compression encoding schemes help achieve fast query processing. Hence, it is important to understand how to optimize performance to leverage the highly parallel nature of Amazon Redshift by defining Redshift Distribution Keys (Redshift DIST Keys).
Additionally, working on Amazon Redshift sort keys can help you attain faster query performance times. In this article, we will discuss Amazon Redshift distribution Keys in detail.
Understanding Redshift Distribution Key (DIST Keys)
Redshift Distribution Key (DIST Keys) determine where data is stored in Redshift. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node.
The query optimizer distributes less number of rows to the compute nodes to perform joins and aggregation on query execution. This redistribution of data can include shuffling the entire tables across all the nodes.
Uneven distribution of data across computing nodes leads to the skewness of the work a node has to do, and you don’t want an under-utilized compute node. So, the distribution of the data should be uniform. Distribution is per table. So, you can select a different distribution style for each of the tables you will have in your database.
Efficient data distribution is essential for maximizing query performance in Amazon Redshift. Hevo’s no-code platform helps streamline your ETL process. Try Hevo and equip your team to:
- Integrate data from 150+ sources(60+ free sources).
- Simplify data mapping with an intuitive, user-friendly interface.
- Instantly load and sync your transformed data into your desired destination.
Choose Hevo for a seamless experience and know why Industry leaders like Meesho say- “Bringing in Hevo was a boon.”
Get Started with Hevo for Free
Types of Distribution Styles
Amazon Redshift supports three kinds of table distribution styles.
Even Distribution
This is the default distribution styles of a table. In Even Distribution the Leader node of the cluster distributes the data of a table evenly across all slices, using a round-robin approach.
Key Distribution
The data is distributed across slices by the leader node matching the values of a designated column. So all the entries with the same value in the column end up in the same slice.
All Distribution
Leader node maintains a copy of the table on all the computing nodes resulting in more space utilisation. Since all the nodes have a local copy of the data, the query does not require copying data across the network. This results in faster query operations. The negative side of using ALL is that a copy of the table is on every node in the cluster. This takes up too much of space and increases the time taken by Copy command to upload data into Redshift.
Understanding Redshift Sort Keys(SORTKEY)
In Amazon Redshift, Sort Keys (SORTKEYs) define the order in which data is stored within each node. By organizing data based on a specified column or set of columns, SORTKEYs help Amazon Redshift optimize query performance by reducing the number of rows scanned during query execution.
Types of Sort Keys
- Compound Sort Key
- A compound sort key sorts data according to an order of columns when a table is created.
- It is appropriate for frequent query filtering on the leading columns. For example, if you normally filter by date with subsequent customer_id, it would be excellent to have a date as the first column for a compound sort key.
- Best for predictable fixed query patterns where sequences are known.
- Interleaved Sort Key
- It is suitable for dynamic query patterns where different columns are used as filters in other queries.
- An interleaved sort key gives equal importance to each column specified, so Redshift can sort and optimize based on multiple columns at once.
- It is useful for tables where queries vary significantly in filtering columns because it distributes the focus of sorting across all the key columns.
Integrate your Source to Redshift Effortlessly!
No credit card required
Choosing the Right Distribution Styles
The motive in selecting a table distribution style is to minimize the impact of the redistribution by relocating the data where it was prior to the query execution. Choosing the right KEY is not as straightforward as it may seem. In fact, setting wrong DISTKEY can even worsen the query performance.
Choose columns used in the query that lead to the least skewness as the DISTKEY. A good choice is the column with the most distinct values, such as the timestamp. Avoid columns with few distinct values, such as months of the year payment card types.
- If the table(e.g. fact table) is highly de-normalised and no JOIN is required, choose the EVEN style.
- Choose ALL style for small tables that do not often change. For example, a table containing telephone ISD codes against the country name.
- It is beneficial to select a KEY distribution if a table is used in JOINS. Also, consider the other joining tables and their distribution style.
- If one particular node contains the skew data, the processing on this node will be slower. This results in much longer total query processing time. This query under skewed configuration may take even longer than the query made against the table without a DISTKEY
DISTKEY vs SORTKEY
Feature | DISTKEY | SORTKEY |
Purpose | Determines data distribution across nodes | Defines data sorting order within each node |
Use Case | Optimizes joins and aggregations across tables | Speeds up data retrieval by minimizing scans |
Impact on Data | Affects data placement across nodes | Affects data order within each node’s storage |
Ideal For | Columns used frequently in joins | Columns used frequently in filtering (WHERE) |
Best Practice | Use for joining columns with high distinct values | Use for columns in repetitive or dynamic filters |
Integrate Active Campaign to Redshift
Integrate Amazon DocumentDB to Redshift
Integrate Amazon RDS to Redshift
Conclusion
In summary, DISTKEY and SORTKEY are quite powerful in Amazon Redshift, and the ability to optimize query performance relies on better distribution and retrieval of data. With the proper selection of DISTKEY, you can minimize data movement across the nodes, and the proper selection of SORTKEY can help accelerate filter queries across the nodes. When these two keys go together, you will see that you are processing a big data set pretty fast in a nice way of storing it, with everything coming through pretty fluently in Redshift.
Additionally, you could re-structure the data in Redshift from OLTP to OLAP to gain faster query processing time. This can be achieved by creating aggregates and joins, thereby precomputing data for analysis. With a Data Integration Platform like Hevo, you can model your data and define workflows in a simple and reliable manner.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite firsthand.
FAQs
1. What are the distribution keys in Redshift?
Distribution keys (DISTKEYs) in Amazon Redshift determine how data is distributed across different compute nodes in a cluster. By selecting a DISTKEY, you can optimize how rows are stored, which helps improve query performance, especially for join operations.
2. Can we have multiple distkeys in Redshift?
No, you cannot have multiple DISTKEYs for a single table in Redshift. Each table can only have one DISTKEY, which controls how the data is distributed across compute nodes.
3. What is the difference between a sort key and a distkey in Redshift?
A DISTKEY determines how data is spread across different nodes in Redshift, while a SORTKEY defines the order of data storage within each node. The DISTKEY focuses on distribution efficiency for joins, whereas the SORTKEY enhances data retrieval speed during queries.
Vivek Sinha has extensive experience in real-time analytics and cloud-native technologies. With a focus on Apache Pinot, he was a driving force in shaping innovation and defensible differentiators, including enhanced query processing, data mutability support, and cost-effective tiered storage solutions at Hevo. He also demonstrates a passion for exploring and implementing innovative trends within the dynamic data industry landscape.