How to Create Redshift Indexes: Table Constraints Simplified

on Amazon Redshift, Data Warehouses, redshift architecture, Redshift Functions • January 21st, 2022 • Write for Hevo

How to Create Redshift Indexes: Table Constraints Simplified | Cover Image

Redshift is a columnar database storage system that employs a unique methodology for index creation. Other database management systems normally use indexes to improve query performance. But, Redshift does not enforce the use of traditional indexes. As, in the Redshift document, they are termed “informational only.”

In Redshift, a user chooses between the primary and foreign key Redshift indexes — DISKEY, SORTKEY, and Column Compression Encoding — which are amongst the best-recommended practices.

It’s recommended to use primary and foreign key indexes in the data loading process — the ETL process — as hints to protect their integrity. For example, do not create the unique key constraint if you are not sure the table’s product name is not unique. Nonetheless, if you are confident, create the unique key constraint such that the database engine can execute queries for the same.

In this article, we will talk about the basics of Redshift Architecture because the concept keeps on repeating. And then, we will talk about recommended Redshift Indexes and the best practices for their usage.

Table of Contents

  1. Redshift Architecture: Basics
  2. About Redshift Indexes: DISTKEY, SORTKEY, & Column Compression Encoding
  3. Conclusion

Redshift Architecture: Basics

Amazon Redshift architecture is based on an extensive communication channel network between the client application and the data warehouse clusters. These two communicate using the industry-standard JDBC and ODBC drivers for PostgreSQL. The Data warehouse cluster has a leader node, compute node, and node slices — these are the core infrastructure components of Redshift’s architecture.

Redshift Architecture: Basics | How to Create Redshift Indexes: Table Constraints Simplified
 Redshift Architecture  

Leader Node: The leader node communicates with the client application and the compute nodes. It resolves, develops, and executes database operations necessary to obtain results — for multiple, complex-in-nature queries.

Compute Node: A compute node consists of multiple node slices. Each compute node has its CPU and attached disk storage defined. It performs computational work in a cluster. The user can select from various node types, providing an optimum experience based on the user’s computation, storage, and business requirements.

Node Slice: A group of slices makes a node slice, making it the logical partition of a disk. Each slice has its own allocated share of the node’s memory and disk share. And the number of slices per node depends on the node’s size.

Simplify Data Analysis with Hevo’s No-code Data Pipeline!

Hevo Data, a No-code Data Pipeline, helps you directly transfer data from 100+ data sources to Data Warehouses, BI tools, or a destination of your choice in a completely hassle-free & automated manner.

Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready format without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

Get Started with Hevo for Free

Check out why Hevo is the Best:

  1. Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  2. Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  3. Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  4. Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  5. Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  6. Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  7. Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

About Redshift Indexes: DISTKEY, SORTKEY, and Column Compression Encoding

In this section, we’ll specifically talk about three table constraints, i.e., DISTKEY, SORTKEY, and Define Column Compression Encoding. These table constraints are important to ensure and calibrate processes’ performance.

How to select a Sort Key?

By default, when you create a table in Redshift, columns (one or more) can be alternatively defined as sort keys, rows of data are stored in sorted order. Moreover, the information regarding the sort key columns is passed on to the query planner to leverage how data has been stored for construct processes.

Note: Columnar data in Amazon Redshift is stored in 1 MB disk blocks. And the min and the max value for each block are stored as part of the metadata.

In general, sorting makes engaging with rage-restricted predicates seamless. This property can be used in scanning petabytes of data in table format. Min and Max values can actually make scanning a large number of blocks efficient by skipping over a large number of blocks. For example, if a table store ten years of data sorted by years and months and a query is placed with a specified date range of 2 months, you can eliminate more than 99% of the disk blocks from the scan. And understandably, if the data is not sorted, more disk blocks have to be scanned. Learn more about Working with sort keys

Nevertheless, you can select sort keys on the basis of the following points:

  • If the most recent data is queried the most, then you must choose the timestamp column as the default leading column.
  • If your use case requires you to filter information based on a range of values, that column should be the default leading column.

Below given are some examples of defining sort keys:

-- sale_date is the timestamp column
CREATE TABLE sales (
  sale_id BIGINT NOT NULL PRIMARY KEY,
  sale_date timestamp NOT NULL SORTKEY,
  ... <other colums>
);

-- use the SORTKEY table attribute keyword to create a multi-column sort key
-- In this case searches are done frequently by the location columns,
-- so state and city are part of sort key
CREATE TABLE dim_customers (
  ... <some columns>...
  state VARCHAR,
  city VARCHAR
)
SORTKEY (state, city);

Redshift Recommended Distribution Styles

DISKEY is the keyword that defines the data distribution style in Amazon Redshift Cluster. Redshift allocates (distribute) the data in the format of rows and columns to compute nodes according to the distribution style of the table selected. The Redshift query optimizer is responsible for redistributing rows to the compute nodes when you run a query. The redistribution is needed to perform joins and aggregations. The goal is to locate the required data before the query is active to minimize the impact of redistribution.

Listed below are the three distribution styles for Redshift Indexes:

EVEN Distribution: Even Distribution is the default setting using the generic round-robin method to distribute data evenly across the nodes. EVEN Distribution has its use case when tables are not used in queries with joins or in a scenario where there is no absolute choice of the distribution method chosen between the next two.

KEY Distribution: The values in columns are leveraged to determine the row distribution type in KEY Distribution. It’s then when Redshift will try attempting to place the matched values on the same node slice. KEY Distribution can be used for tables that are normally used together. If that is the case, Redshift can combine the tables with similar columns to perform better efficiency. 

ALL Distribution: Using ALL Distribution will make the copy of an entire table to be stored on each node. This distribution style is normally used to improve execution efficiency, as for certain dimension tables KEY Distribution is not recommended.

In Redshift Indexes, the Distribution of a table is defined using the DISTSTYLE and/or DISTKEY.

-- Specifying a column as DISTKEY automatically sets distribution style to KEY
CREATE TABLE sales (
  sale_id BIGINT NOT NULL PRIMARY KEY,
  sale_date timestamp NOT NULL SORTKEY,
  customer_id int DISTKEY,
  amount float
);

-- Use DISTSTYLE table attribute to set it to ALL
CREATE TABLE atrribute_lookup (
  attribute_id INT NOT NULL PRIMARY KEY,
  attribute_name VARCHAR
)
DISTSTYLE ALL;

Column Compressing Encoding

The column compressing encoding reduces the disk I/O and substantially improves the query performance and speed. Although Redshift Indexes chooses compression based on its internally qualified conditions, you can still specify the compression per column. You can explore Redshift documents if you want to learn more about compression encoding.

List of the compressions that Redshift Indexes uses is as follows:

  • By default, Redshift indexes assign RAW compression to all columns in the temporary tables, sort key defined tables, BOOLEAN, REAL, and DOUBLE PRECISION columns.
  • In all other cases, Redshift indexes assign LZO compression.

Conclusion

In this tutorial article, we discussed in detail how to work our way through Redshift Indexes and talked, in brief, about Redshift Architecture. If you want to learn about table constraints, tables, data distribution, sort keys, and compression encoding below cited Redshift Documents can help a great deal.

  1. Defining table constraints
  2. CREATE TABLE
  3. Working with data distribution styles
  4. Working with sort keys
  5. Compression encodings

Nevertheless, if you are a data folk who needs a one-stop solution for data management and ETL-related tasks, including making Redshift Indexes hassle-free — Hevo has got your back!

Hevo Data is a no-code data pipeline platform that helps new-age businesses integrate their data from multiple sources systems to a data warehouse and plug this unified data into any BI tool or Business Application. The platform provides 100+ ready-to-use integrations with a range of data sources and is trusted by hundreds of data-driven organizations from 30+ countries.

Visit our Website to Explore Hevo
Hevo Product Video

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

No-code Data Pipeline for Amazon Redshift