Amazon Redshift is a fully managed, distributed relational database system. It is capable of performing queries efficiently over petabytes of data. High parallel processing, columnar design and data compression encoding schemes help achieve fast query processing. Hence, it is important to understand how to optimize tables to leverage the highly parallel nature of Amazon Redshift by defining Redshift Distribution Keys (Redshift DIST Keys).
One of the crucial factors that can help you do more with your data warehouse is the availability of accurate and consistent data in Redshift in real-time. Ready solutions like Hevo Data Integration Platform (7-day free trial) can help you bring data from a variety of sources (databases, cloud applications, SDKs, File storage, and more) to Redshift in real-time.
Additionally, working on Amazon Redshift sort keys can help you attain faster query performance times.
In this article, we will discuss Amazon Redshift distribution Keys in detail.
Understanding Redshift Distribution Key (DIST Keys)
Redshift Distribution Keys (DIST Keys) determine where data is stored in Redshift. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node.
The query optimizer distributes less number of rows to the compute nodes to perform joins and aggregation on query execution. This redistribution of data can include shuffling of the entire tables across all the nodes.
Uneven distribution of data across computing nodes leads to the skewness of the work a node has to do and you don’t want an under-utilised compute node. So the distribution of the data should be uniform. Distribution is per table. So you can select a different distribution style for each of the tables you are going to have in your database.
Types of Distribution Styles
Amazon Redshift supports three kinds of table distribution styles.
This is the default distribution styles of a table. In Even Distribution the Leader node of the cluster distributes the data of a table evenly across all slices, using a round-robin approach.
The data is distributed across slices by the leader node matching the values of a designated column. So all the entries with the same value in the column end up in the same slice.
Leader node maintains a copy of the table on all the computing nodes resulting in more space utilisation. Since all the nodes have a local copy of the data, the query does not require copying data across the network. This results in faster query operations. The negative side of using ALL is that a copy of the table is on every node in the cluster. This takes up too much of space and increases the time taken by Copy command to upload data into Redshift.
Choosing the Right Distribution Styles
The motive in selecting a table distribution style is to minimize the impact of the redistribution by relocating the data where it was prior to the query execution. Choosing the right KEY is not as straightforward as it may seem. In fact, setting wrong DISTKEY can even worsen the query performance.
Choose columns used in the query that leads to least skewness as the DISTKEY. The good choice is the column with maximum distinct values, such as the timestamp. Avoid columns with few distinct values, such as months of the year, payment card types.
- If the table(e.g. fact table) is highly de-normalised and no JOIN is required, choose the EVEN style.
- Choose ALL style for small tables that do not often change. For example, a table containing telephone ISD codes against the country name.
- It is beneficial to select a KEY distribution if a table is used in JOINS. Also, consider the other joining tables and their distribution style.
- If one particular node contains the skew data, the processing on this node will be slower. This results in much longer total query processing time. This query under skewed configuration may take even longer than the query made against the table without a DISTKEY
We have also talked about Redshift Sort Key here and how to choose the right sort style to optimise your AWS Redshift performance.
Additionally, you could re-structure the data in Redshift from OLTP to OLAP to gain faster query processing time. This can be achieved by creating aggregates and joins thereby precomputing data for analysis. With a Data Integration Platform like Hevo, you can model your data and define workflows in a simple and reliable manner. You can use Hevo for – 14-Day Free Trial.