Press "Enter" to skip to content

Understanding Amazon Redshift Architecture

Redshift Architecture

Amazon Redshift is a fully managed highly scalable data warehouse service in AWS. You can start using Redshift with even a few GigaBytes of data and scale it to PetaBytes or more. In this article, we will talk about Amazon Redshift architecture and its components, at high level.

Amazon Redshift Architecture

Redshift is meant to work in a Cluster formation. A typical Redshift Cluster has two or more Compute Nodes which are coordinated through a Leader Node. All client applications communicate with the cluster through the Leader Node.

Leader Node

The Leader Node in a Redshift Cluster manages all external and internal communication. It is responsible for preparing query execution plans whenever a query is submitted to the cluster. Once the query execution plan is ready, the Leader Node distributes query execution code on the compute nodes and assigns slices of data to each compute node for computation of results.

Leader Node distributes query load to compute node only when the query involves accessing data stored on the compute nodes. Otherwise, the query is executed on the Leader Node itself. There are several functions in Redshift architecture which are always executed on the Leader Node. You can read SQL Functions Supported on the Leader Node for more information on these functions.

Compute Nodes

Compute Nodes are responsible for actual execution of queries and have data stored with them. They execute queries and return intermediate results to the Leader Node which further aggregates the results.

There are two types of Compute Nodes available in Redshift architecture:

  • Dense Storage (DS) – Dense Storage nodes allow you to create large data warehouses using Hard Disk Drives (HDDs) for a low price point.
  • Dense Compute (DC) – Dense Compute nodes allow you to create high performance data warehouses using Solid-State Drives (SSDs).

A more detailed explanation of how responsibilities are divided among Leader and Compute Nodes is depicted in below diagram:

Redshift Architecture - Leader and Compute Nodes

Node slices

A compute node consist of slices. Each Slice has a portion of Compute Node’s memory and disk assigned to it where it performs Query Operations. The Leader Node is responsible for assigning a Query code and data to a slice for execution. Slices once assigned query load work in parallel to generate query results.

Data is distributed among the Slices on the basis of Distribution Style and Distribution Key of a particular table. An even distribution of data enables Redshift to assign workload evenly to slices and maximises the benefit of parallel processing.

Number of Slices per Compute Node is decided on the basis of the type of node. You can find more information on this in  About Clusters and Nodes.

Massively parallel processing (MPP)

Redshift architecture allows it use Massively parallel processing (MPP) for fast processing even for the most complex queries and a huge amount of data. Multiple compute nodes execute the same query code on portions of data to maximise parallel processing.

Columnar Data Storage

Data in Redshift is stored in a columnar fashion which drastically reduces the I/O on disks. Columnar storage reduces the number of disk I/O requests and minimises the amount of data loaded into the memory to execute a query. Reduction in I/O speeds up query execution and loading less data means Redshift can perform more in-memory processing.

Redshift uses Sort Keys to sort columns and filter out chunks of data while executing queries. You can read more about Sort Keys in our post on Choosing the best Sort Keys

Data compression

Data compression is one of the important factors in ensuring query performance. It reduces storage footprint and enables loading of large amounts of data in the memory fast. Owing to Columnar data storage, Redshift can use adaptive compression encoding depending on the column data type. Read more about using compression encodings in Compression Encodings in Redshift.

Query Optimizer

Redshift’s Query Optimizer generate query plans that are MPP-aware and takes advantage of Columnar Data Storage. Query Optimizer uses analyzed information about tables to generate efficient query plans for execution. Read more about Analyze to know how to make the best of Query Optimizer.

We, at Hevo, are building an ETL solution which can help bring your data from various sources to Redshift in real time. You can reach out to us if you need help in setting up your Redshift clusters or connecting your data sources to Redshift instance.

  • Shobhit Singh

    We have a large team of analysts that will simutaneously use Redshift. Was wondering how many queries can we run on a Redshift cluster? Will more people querying Redshift slow down the queries?

    • Sourabh

      Thanks Shobhit for the question. There are 2 parts to your questions.

      How many queries can we run on a Redshift cluster?

      Queries in Amazon Redshift are routed to query queues. Each query queue contains a number of query slots, and can run upto 50 queries concurrently.
      Amazon Redshift allocates, by default, an equal, fixed share of available memory to each queue, and an equal, fixed share of a queue’s memory to each query slot in the queue.
      As a best practice, AWS recommends using a concurrency level of 15 or lower.

      Will more people querying Redshift slow down the queries?

      In case, you are running small queries (which have to wait for the long-running queries), it is advisable to create a separate queue with a higher concurrency level and assign the smaller queries to that queue.
      A queue with a higher concurrency level has less memory allocated to each query slot, but the smaller queries require less memory.

      If you have multiple queries that each access data on a single slice, you should set up a separate WLM queue to execute those queries concurrently. Amazon Redshift assigns concurrent queries to separate slices, which allows multiple queries to execute in parallel on multiple slices. You can read more about it in the official AWS doc here – http://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html

  • Lalit Prakash

    Hi Sourabh, We are looking to use Redshift for taking daily snapshots of our transaction database by copying the tables through a nightly job. Does Redshift automatically reclaim space when we delete or update old rows? Or will he have to run vacuum after every dump?