Amazon Redshift is a fully managed highly scalable Data Warehouse service in AWS. You can start using Redshift with even a few GigaBytes of data and scale it to PetaBytes or more. Amazon Redshift offers numerous benefits to the user and its unique architecture plays a huge role in it.
This article will introduce you to Amazon Redshift and will explain in detail the Amazon Redshift Architecture and its various components. Furthermore, the article will discuss the benefits of using Amazon Redshift as your Data Warehouse. Read along to learn more about this Data Warehouse and its Architecture!
What is Amazon Redshift?
Amazon Redshift is a Mass Parallel Processing (MPP) Data Warehouse owned and supported by Amazon Web Services (AWS) that can handle large amounts of data and workloads for optimal configuration and high performance even for large datasets. These features, coupled with its price tag, have made it one of the preferred Data Warehouses among modern data processing groups.
Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 150+ data sources and loads the data onto the desired Data Warehouse-like Redshift, enriches the data, and transforms it into an analysis-ready form without writing a single line of code.
Sign up here for a 14-Day Free Trial!
Understanding the Amazon Redshift Architecture Components
Redshift is meant to work in a Cluster formation. A typical Redshift Cluster has two or more Compute Nodes which are coordinated through a Leader Node. All client applications communicate with the Cluster only with the Leader Node. This Architecture can be broken down into the following components:
1: Leader Node
The Leader Node in an Amazon Redshift Cluster manages all external and internal communication. It is responsible for preparing query execution plans whenever a query is submitted to the Cluster. Once the query execution plan is ready, the Leader Node distributes the query execution code on the Compute Nodes and assigns Slices of data to each to Compute Node for computation of results.
Leader Node distributes query load to Compute Node only when the query involves accessing data stored on the Compute Nodes. Otherwise, the query is executed on the Leader Node itself. There are several functions in Redshift Architecture that are always executed on the Leader Node.
You can read SQL Functions Supported on the Leader Node for more information.
2: Compute Nodes
Compute Nodes are responsible for the actual execution of queries and have data stored with them. They execute queries and return intermediate results to the Leader Node which further aggregates the results.
There are two types of Compute Nodes available in Amazon Redshift Architecture:
- Dense Storage (DS): Dense Storage Nodes allow you to create large Data Warehouses using Hard Disk Drives (HDDs) for a low price point.
- Dense Compute (DC): Dense Compute nodes allow you to create high-performance Data Warehouses using Solid-State Drives (SSDs).
A more detailed explanation of how responsibilities are divided among Leader and Compute Nodes are depicted in the diagram below:
3: Node Slices
A Compute Node consists of Slices. Each Slice has a portion of Compute Node’s memory and disk assigned to it where it performs query operations. The Leader Node is responsible for assigning a query code and data to a slice for execution. Slices once assigned query load work in parallel to generate query results.
Data is distributed among the Slices on the basis of the Distribution Style and Distribution Key of a particular table. An even distribution of data enables Redshift to assign workload evenly to Slices and maximizes the benefit of parallel processing.
The number of Slices per Compute Node is decided on the basis of the type of node. You can find more information on Clusters and Nodes.
4: Massively parallel processing (MPP)
Amazon Redshift Architecture allows it to use Massively Parallel Processing (MPP) for fast processing even for the most complex queries and a huge amount of data set. Multiple compute nodes execute the same query code on portions of data to maximize Parallel Processing.
5: Columnar Data Storage
Data in the Amazon Redshift Data Warehouse is stored in a Columnar fashion which drastically reduces the I/O on disks. Columnar storage reduces the number of disk I/O requests and minimizes the amount of data loaded into the memory to execute a query. Reduction in I/O speeds up query execution and loading less data means Redshift can perform more in-memory processing.
Redshift uses Sort Keys to sort Columns and filter out chunks of data while executing queries. You can read more about Sort Keys in our post on Choosing the best Sort Keys.
6: Data Compression
Data compression is one of the important factors in ensuring query performance. It reduces the storage footprint and enables the loading of large amounts of data in the memory fast. Owing to Columnar storage, Redshift can use adaptive compression encoding depending on the Column data type. Read more about using compression encodings in Compression Encodings in Redshift.
7: Query Optimizer
Redshift’s Query Optimizer generates query plans that are MPP-aware and takes advantage of Columnar Data Storage. Query Optimizer uses analyzed information about tables to generate efficient query plans for execution. Read more about Analyze to know how to make the best of Query Optimizer.
8: Cluster Internal Network
Amazon Redshift provides private and high-speed network communication between leader node and compute nodes by leveraging high-bandwidth network connections and custom communication protocols. The compute nodes run on an isolated network that can never be accessed directly by Client Applications.
Learn more about AWS Redshift Architecture.
Benefits of Using Amazon Redshift
The Amazon Redshift Architecture is designed in such a way so as to optimize the user experience. The users of Amazon Redshift will experience the following benefits
1) Strong Data Encryption
All companies and organizations have to follow privacy and security regulations and encryption is one of the foundational blocks of data protection. Amazon Redshift has robust encryption features. It has separate use cases for AWS Managed or Client Managed Keys. Also, it allows the movement of data between encrypted and unencrypted Clusters. Furthermore, Redshift provides Single or double encryption options, depending on the situation
2) Concurrency Constraints
The concurrency limit defines the maximum number of Nodes that a user can accumulate at once. In this sense, concurrency constraints the distribution of Nodes and ensures that all users have enough Nodes available to them.
Redshift supports the same concurrency constraints as other Data Warehouses, but with some added flexibility. For example, the number of available Nodes in a Cluster is determined by the type of Cluster. Redshift also sets limits based on region, rather than enforcing a single limit for all users. Furthermore, in certain situations, users may submit a request to increase the limit.
3) Columnar Data Storage
The most commonly used mapping for organizing data is by storing it by rows. This is idle to handle large numbers of small operations quickly. The row storage system is implemented in Online Transaction Processing (OLTP) and is used in most of the Operational Databases.
However, Column-oriented Databases are faster when accessing large amounts of data. For example, in an Online Analytical Processing (OLAP) environment like Redshift, users tend to apply fewer queries on much larger Datasets. In this case, having a Column-Oriented Database allows Redshift to quickly execute Big Data jobs. Thus Redshift uses a Columnar approach to store data and this has proven to work in favor of this Data Warehouse.
4) Workload Management
Multiple users are expected to query concurrently on a Data Warehouse like Amazon Redshift. Hence, it becomes very essential to manage and control queries and workload effectively. With Workload Management (WLM), it is possible to prioritize workload and queries in order to stabilize the process.
Amazon Redshift Workload Management (WLM) allows users to have full control over running queries. This way you can flexibly manage priorities within workloads, allowing short, fast-running queries to be executed before the long-running queries.
Amazon Redshift WLM creates query queues based on the Service Classes. Service Classes define the configuration parameters for various kinds of queues. Depending on a user’s user group or the query group label set by the user at runtime, WLM assigns the query to the respective queue.
Conclusion
- The article introduced you to Amazon Redshift and its Architecture.
- It discussed in detail the various components involved in the Redshift Architecture and also listed down the benefits of using this Data Warehouse.
- Now you may want to transfer data from multiple sources to Redshift. This will require you to build the complex ETL process manually.
- Find out how to connect SQL Workbench to Redshift and enhance your database operations. Our guide offers practical tips for a smooth setup.
Sourabh is a seasoned tech entrepreneur with over a decade of experience in scalable real-time analytics. As the Co-Founder and CTO of Hevo Data, he has been instrumental in shaping a leading no-code data pipeline platform used by thousands globally. Previously, he co-founded SpoonJoy, a mass-market cloud kitchen platform acquired by Grofers. His technical acumen spans MySQL, Cassandra, Elastic Search, Redis, Java, and more, driving innovation and excellence in every venture he undertakes.