Amazon Redshift is a fully managed, distributed relational database system. It is capable of performing queries efficiently over petabytes of data. High parallel processing, columnar design and data compression encoding schemes help achieve fast query processing.
We will see how to optimise tables to leverage the highly parallel nature of Amazon Redshift by defining Redshift Distribution Keys.
Understanding Redshift Distribution Keys
Redshift Distribution Keys (DIST Keys) determine where data is stored in Redshift. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node.
The query optimiser distributes less number of rows to the compute nodes to perform joins and aggregation on query execution. This redistribution of data can include shuffling of the entire tables across all the nodes.
Uneven distribution of data across computing nodes leads to the skewness of the work a node has to do and you don’t want an under-utilised compute node. So the distribution of the data should be uniform. Distribution is per table. So you can select a different distribution style for each of the tables you are going to have in your database.
Types of Distribution Styles
Amazon Redshift supports three kinds of table distribution styles.
- Even distribution: This is the default distribution styles of a table. In Even Distribution the Leader node of the cluster distributes the data of a table evenly across all slices, using a round robin approach.
- Key distribution: The data is distributed across slices by the leader node matching the values of a designated column. So all the entries with the same value in the column end up in the same slice.
- All distribution: Leader node maintains a copy of the table on all the computing nodes resulting in more space utilisation. Since all the nodes have a local copy of the data, the query does not require copying data across the network. This results in faster query operations. The negative side of using ALL is that a copy of the table is on every node in the cluster. This takes up too much of space and increases the time taken by Copy command to upload data into Redshift.
Choosing the right Distribution Styles
The motive in selecting a table distribution style is to minimise the impact of the redistribution by relocating the data where it was prior to the query execution. Choosing the right KEY is not as straightforward as it may seem. In fact, setting wrong DISTKEY can even worsen the query performance.
Choose columns used in the query that leads to least skewness as the DISTKEY. The good choice is the column with maximum distinct values, such as the timestamp. Avoid columns with few distinct values, such as months of the year, payment card types.
- If the table(e.g. fact table) is highly de-normalised and no JOIN is required, choose the EVEN style.
- Choose ALL style for small tables that do not often change. For example, a table containing telephone ISD codes against the country name.
- It is beneficial to select a KEY distribution if a table is used in JOINS. Also, consider the other joining tables and their distribution style.
- If one particular node contains the skew data, the processing on this node will be slower. This results in much longer total query processing time. This query under skewed configuration may take even longer than the query made against the table without a DISTKEY
Amazon Redshift – Integral component at Hevo
At Hevo, we provide our customers with the capability to bring all their data to Amazon Redshift ( also BigQuery, SnowFlake), in real-time. Source data could be from – databases, cloud applications, clickstreams etc. Do let us know your experience with using different distribution styles in the comments below.