Redshift is a columnar database storage system that employs a unique methodology for index creation. Other database management systems normally use indexes to improve query performance. But, Redshift does not enforce the use of traditional indexes. As, in the Redshift document, they are termed “informational only.”
In Redshift, a user chooses between the primary and foreign key Redshift indexes — DISKEY, SORTKEY, and Column Compression Encoding — which are amongst the best-recommended practices.
In this article, we will talk about the basics of Redshift Architecture because the concept keeps on repeating. And then, we will talk about recommended Redshift Indexes and the best practices for their usage.
Redshift Architecture: Basics
Amazon Redshift architecture is based on an extensive communication channel network between the client application and the data warehouse clusters. These two communicate using the industry-standard JDBC and ODBC drivers for PostgreSQL. The Data warehouse cluster has a leader node, compute node, and node slices — these are the core infrastructure components of Redshift’s architecture.
Leader Node: The leader node communicates with the client application and the compute nodes. It resolves, develops, and executes database operations necessary to obtain results — for multiple, complex-in-nature queries.
Compute Node: A compute node consists of multiple node slices. Each compute node has its CPU and attached disk storage defined. It performs computational work in a cluster. The user can select from various node types, providing an optimum experience based on the user’s computation, storage, and business requirements.
Node Slice: A group of slices makes a node slice, making it the logical partition of a disk. Each slice has its own allocated share of the node’s memory and disk share. And the number of slices per node depends on the node’s size.
Use Hevo’s no-code data pipeline platform helps to load data from any data source to your desired destination. Hevo not only loads the data onto the desired Data Warehouse/destination like Amazon Redshift but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.
Why Hevo?
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
- Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
Discover why Hevo is rated as 4.3 on G2, pointing toward the intensity and effort toward customer satisfaction and powerful performance. Try out a 14-day free trial for seamless data integration.
Get Started with Hevo for Free
About Redshift Indexes: DISTKEY, SORTKEY, and Column Compression Encoding
In this section, we’ll specifically talk about three table constraints, i.e., DISTKEY, SORTKEY, and Define Column Compression Encoding. These table constraints are important to ensure and calibrate processes’ performance.
How to select a Sort Key?
In general, sorting makes engaging with rage-restricted predicates seamless. This property can be used in scanning petabytes of data in table format. Min and Max values can actually make scanning a large number of blocks efficient by skipping over a large number of blocks.
For example, if a table store ten years of data sorted by years and months and a query is placed with a specified date range of 2 months, you can eliminate more than 99% of the disk blocks from the scan. And understandably, if the data is not sorted, more disk blocks have to be scanned.
Nevertheless, you can select sort keys on the basis of the following points:
- If the most recent data is queried the most, then you must choose the timestamp column as the default leading column.
- If your use case requires you to filter information based on a range of values, that column should be the default leading column.
Below given are some examples of defining sort keys:
-- sale_date is the timestamp column
CREATE TABLE sales (
sale_id BIGINT NOT NULL PRIMARY KEY,
sale_date timestamp NOT NULL SORTKEY,
... <other colums>
);
-- use the SORTKEY table attribute keyword to create a multi-column sort key
-- In this case searches are done frequently by the location columns,
-- so state and city are part of sort key
CREATE TABLE dim_customers (
... <some columns>...
state VARCHAR,
city VARCHAR
)
SORTKEY (state, city);
Redshift Recommended Distribution Styles
DISKEY is the keyword that defines the data distribution style in Amazon Redshift Cluster. Redshift allocates (distribute) the data in the format of rows and columns to compute nodes according to the distribution style of the table selected. The Redshift query optimizer is responsible for redistributing rows to the compute nodes when you run a query. The redistribution is needed to perform joins and aggregations. The goal is to locate the required data before the query is active to minimize the impact of redistribution.
Listed below are the three distribution styles for Redshift Indexes:
1. EVEN Distribution: Even Distribution is the default setting using the generic round-robin method to distribute data evenly across the nodes. EVEN Distribution has its use case when tables are not used in queries with joins or in a scenario where there is no absolute choice of the distribution method chosen between the next two.
2. KEY Distribution: The values in columns are leveraged to determine the row distribution type in KEY Distribution. It’s then when Redshift will try attempting to place the matched values on the same node slice. KEY Distribution can be used for tables that are normally used together.
3. ALL Distribution: Using ALL Distribution will make the copy of an entire table to be stored on each node. This distribution style is normally used to improve execution efficiency, as for certain dimension tables KEY Distribution is not recommended.
In Redshift Indexes, the Distribution of a table is defined using the DISTSTYLE and/or DISTKEY.
-- Specifying a column as DISTKEY automatically sets distribution style to KEY
CREATE TABLE sales (
sale_id BIGINT NOT NULL PRIMARY KEY,
sale_date timestamp NOT NULL SORTKEY,
customer_id int DISTKEY,
amount float
);
-- Use DISTSTYLE table attribute to set it to ALL
CREATE TABLE atrribute_lookup (
attribute_id INT NOT NULL PRIMARY KEY,
attribute_name VARCHAR
)
DISTSTYLE ALL;
Column Compressing Encoding
The column compressing encoding reduces the disk I/O and substantially improves the query performance and speed. Although Redshift Indexes chooses compression based on its internally qualified conditions, you can still specify the compression per column.
List of the compressions that Redshift Indexes uses is as follows:
- By default, Redshift indexes assign RAW compression to all columns in the temporary tables, sort key defined tables, BOOLEAN, REAL, and DOUBLE PRECISION columns.
- In all other cases, Redshift indexes assign LZO compression.
Conclusion
In this tutorial article, we discussed in detail how to work our way through Redshift Indexes and talked, in brief, about Redshift Architecture. If you want to learn about table constraints, tables, data distribution, sort keys, and compression encoding below cited Redshift Documents can help a great deal.
- Defining table constraints
- CREATE TABLE
- Working with data distribution styles
- Working with sort keys
- Compression encodings
Nevertheless, if you are a data folk who needs a one-stop solution for data management and ETL-related tasks, including making Redshift Indexes hassle-free — Hevo has got your back!
Hevo Data is a no-code data pipeline platform that helps new-age businesses integrate their data from multiple source systems to a data warehouse and plug this unified data into any BI tool or Business Application. Try a 14-day free trial and experience the feature-rich Hevo suite firsthand. Also, checkout our unbeatable pricing to choose the best plan for your organization.
FAQ
1. Does AWS Redshift have indexes?
No, AWS Redshift does not support traditional indexing as seen in other databases; instead, it uses a columnar storage model and sorts data based on specified sort keys.
2. Why doesn’t Redshift have indexing?
Redshift does not use traditional indexing because its architecture is optimized for large-scale data warehousing and analytics, relying on sorting and compression rather than indexes to enhance query performance.
3. Does Redshift support secondary indexes?
No, Redshift does not support secondary indexes; it primarily uses sort keys and distribution styles to optimize query performance and manage data distribution across nodes.
Yash is a Content Marketing professional with over three years of experience in data-driven marketing campaigns. He has expertise in strategic thinking, integrated marketing, and customer acquisition. Through comprehensive marketing communications and innovative digital strategies, he has driven growth for startups and established brands.