Amazon Redshift is a fully managed highly scalable data warehouse service in AWS. You can start using Redshift with even a few GigaBytes of data and scale it to PetaBytes or more. In this article, we will talk about Amazon Redshift architecture and its components, at high level.
Redshift is meant to work in a Cluster formation. A typical Redshift Cluster has two or more Compute Nodes which are coordinated through a Leader Node. All client applications communicate with the cluster through the Leader Node.
The Leader Node in a Redshift Cluster manages all external and internal communication. It is responsible for preparing query execution plans whenever a query is submitted to the cluster. Once the query execution plan is ready, the Leader Node distributes query execution code on the compute nodes and assigns slices of data to each compute node for computation of results.
Leader Node distributes query load to compute node only when the query involves accessing data stored on the compute nodes. Otherwise, the query is executed on the Leader Node itself. There are several functions in Redshift architecture which are always executed on the Leader Node. You can read SQL Functions Supported on the Leader Node for more information on these functions.
Compute Nodes are responsible for actual execution of queries and have data stored with them. They execute queries and return intermediate results to the Leader Node which further aggregates the results.
There are two types of Compute Nodes available in Redshift architecture:
- Dense Storage (DS) – Dense Storage nodes allow you to create large data warehouses using Hard Disk Drives (HDDs) for a low price point.
- Dense Compute (DC) – Dense Compute nodes allow you to create high performance data warehouses using Solid-State Drives (SSDs).
A more detailed explanation of how responsibilities are divided among Leader and Compute Nodes is depicted in below diagram:
A compute node consist of slices. Each Slice has a portion of Compute Node’s memory and disk assigned to it where it performs Query Operations. The Leader Node is responsible for assigning a Query code and data to a slice for execution. Slices once assigned query load work in parallel to generate query results.
Data is distributed among the Slices on the basis of Distribution Style and Distribution Key of a particular table. An even distribution of data enables Redshift to assign workload evenly to slices and maximises the benefit of parallel processing.
Number of Slices per Compute Node is decided on the basis of the type of node. You can find more information on this in About Clusters and Nodes.
Massively parallel processing (MPP)
Redshift architecture allows it use Massively parallel processing (MPP) for fast processing even for the most complex queries and a huge amount of data. Multiple compute nodes execute the same query code on portions of data to maximise parallel processing.
Columnar Data Storage
Data in Redshift is stored in a columnar fashion which drastically reduces the I/O on disks. Columnar storage reduces the number of disk I/O requests and minimises the amount of data loaded into the memory to execute a query. Reduction in I/O speeds up query execution and loading less data means Redshift can perform more in-memory processing.
Redshift uses Sort Keys to sort columns and filter out chunks of data while executing queries. You can read more about Sort Keys in our post on Choosing the best Sort Keys
Data compression is one of the important factors in ensuring query performance. It reduces storage footprint and enables loading of large amounts of data in the memory fast. Owing to Columnar data storage, Redshift can use adaptive compression encoding depending on the column data type. Read more about using compression encodings in Compression Encodings in Redshift.
Redshift’s Query Optimizer generate query plans that are MPP-aware and takes advantage of Columnar Data Storage. Query Optimizer uses analyzed information about tables to generate efficient query plans for execution. Read more about Analyze to know how to make the best of Query Optimizer.
We, at Hevo, are building an ETL solution which can help bring your data from various sources to Redshift in real time. You can reach out to us if you need help in setting up your Redshift clusters or connecting your data sources to Redshift instance.