Amazon Redshift is a fully managed highly scalable Data Warehouse service in AWS. You can start using Redshift with even a few GigaBytes of data and scale it to PetaBytes or more. Amazon Redshift offers numerous benefits to the user and its unique architecture plays a huge role in it.

This article will introduce you to Amazon Redshift and will explain in detail the Amazon Redshift Architecture and its various components. Furthermore, the article will discuss the benefits of using Amazon Redshift as your Data Warehouse. Read along to learn more about this Data Warehouse and its Architecture!

What is Amazon Redshift?

Redshift Architecture: AWS Redshift Logo

Amazon Redshift is a Mass Parallel Processing (MPP) Data Warehouse owned and supported by Amazon Web Services (AWS) that can handle large amounts of data and workloads for optimal configuration and high performance even for large datasets. These features, coupled with its price tag, have made it one of the preferred Data Warehouses among modern data processing groups.

Redshift relies on its extremely fast processing times and high scalability. Whenever you need to increase storage capacity or speed, simply add more nodes using the AWS Console or the Cluster API, and your requirements will be fulfilled right away. Moreover, Redshift provides SORTKEY and DISTKEY clauses, which may require some prior knowledge to use, but greatly reduces the execution time of queries with JOIN and WHERE clauses. You can check the guide about Redshift SORTKEY to know more about it. Since Redshift is SQL-based, it works with existing JDBC/ODBC drivers and it easily integrates with most business intelligence tools.

Simplify your ETL Processes with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 150+ data sources and loads the data onto the desired Data Warehouse-like Redshift, enriches the data, and transforms it into an analysis-ready form without writing a single line of code.

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different Business Intelligence (BI) tools as well.

Get Started with Hevo for Free

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Understanding the Amazon Redshift Architecture Components

AWS Redshift Architecture Image
Image Source

Redshift is meant to work in a Cluster formation. A typical Redshift Cluster has two or more Compute Nodes which are coordinated through a Leader Node. All client applications communicate with the Cluster only with the Leader Node. This Architecture can be broken down into the following components:

Amazon Redshift Architecture Component 1: Leader Node

Redshift Architecture: Leader Node Image
Image Source

The Leader Node in an Amazon Redshift Cluster manages all external and internal communication. It is responsible for preparing query execution plans whenever a query is submitted to the Cluster. Once the query execution plan is ready, the Leader Node distributes the query execution code on the Compute Nodes and assigns Slices of data to each to Compute Node for computation of results.

Leader Node distributes query load to Compute Node only when the query involves accessing data stored on the Compute Nodes. Otherwise, the query is executed on the Leader Node itself. There are several functions in Redshift Architecture that are always executed on the Leader Node.

You can read SQL Functions Supported on the Leader Node for more information.

Learn more about Amazon Redshift.

Amazon Redshift Architecture Component 2: Compute Nodes

Compute Nodes are responsible for the actual execution of queries and have data stored with them. They execute queries and return intermediate results to the Leader Node which further aggregates the results.

There are two types of Compute Nodes available in Amazon Redshift Architecture:

  • Dense Storage (DS): Dense Storage Nodes allow you to create large Data Warehouses using Hard Disk Drives (HDDs) for a low price point.
  • Dense Compute (DC): Dense Compute nodes allow you to create high-performance Data Warehouses using Solid-State Drives (SSDs).

A more detailed explanation of how responsibilities are divided among Leader and Compute Nodes are depicted in the diagram below:

Redshift Architecture - Leader and Compute Nodes
Image Source

Amazon Redshift Architecture Component 3: Node Slices

A Compute Node consists of Slices. Each Slice has a portion of Compute Node’s memory and disk assigned to it where it performs query operations. The Leader Node is responsible for assigning a query code and data to a slice for execution. Slices once assigned query load work in parallel to generate query results.

Data is distributed among the Slices on the basis of the Distribution Style and Distribution Key of a particular table. An even distribution of data enables Redshift to assign workload evenly to Slices and maximizes the benefit of parallel processing.

The number of Slices per Compute Node is decided on the basis of the type of node. You can find more information on Clusters and Nodes.

Amazon Redshift Architecture Component 4: Massively parallel processing (MPP)

Amazon Redshift Architecture allows it to use Massively Parallel Processing (MPP) for fast processing even for the most complex queries and a huge amount of data set. Multiple compute nodes execute the same query code on portions of data to maximize Parallel Processing.

Amazon Redshift Architecture Component 5: Columnar Data Storage

Redshift Architecture: Columnar Storage
Image Source

Data in the Amazon Redshift Data Warehouse is stored in a Columnar fashion which drastically reduces the I/O on disks. Columnar storage reduces the number of disk I/O requests and minimizes the amount of data loaded into the memory to execute a query. Reduction in I/O speeds up query execution and loading less data means Redshift can perform more in-memory processing.

Redshift uses Sort Keys to sort Columns and filter out chunks of data while executing queries. You can read more about Sort Keys in our post on Choosing the best Sort Keys.

Amazon Redshift Architecture Component 6: Data Compression

Data compression is one of the important factors in ensuring query performance. It reduces the storage footprint and enables the loading of large amounts of data in the memory fast. Owing to Columnar storage, Redshift can use adaptive compression encoding depending on the Column data type. Read more about using compression encodings in Compression Encodings in Redshift.

Amazon Redshift Architecture Component 7: Query Optimizer

Redshift’s Query Optimizer generates query plans that are MPP-aware and takes advantage of Columnar Data Storage. Query Optimizer uses analyzed information about tables to generate efficient query plans for execution. Read more about Analyze to know how to make the best of Query Optimizer.

Amazon Redshift Architecture Component 8: Cluster Internal Network

Amazon Redshift provides private and high-speed network communication between leader node and compute nodes by leveraging high-bandwidth network connections and custom communication protocols. The compute nodes run on an isolated network that can never be accessed directly by Client Applications.

Learn more about AWS Redshift Architecture.

Benefits of Using Amazon Redshift

The Amazon Redshift Architecture is designed in such a way so as to optimize the user experience. The users of Amazon Redshift will experience the following benefits

1) Strong Data Encryption

Redshift Architecture: Data Encryption Logo
Image Source

All companies and organizations have to follow privacy and security regulations and encryption is one of the foundational blocks of data protection. Amazon Redshift has robust encryption features. It has separate use cases for AWS Managed or Client Managed Keys. Also, it allows the movement of data between encrypted and unencrypted Clusters. Furthermore, Redshift provides Single or double encryption options, depending on the situation

2) Concurrency Constraints

Redshift Architecture: Concurrency Constraints in Redshift
Image Source

The concurrency limit defines the maximum number of Nodes that a user can accumulate at once. In this sense, concurrency constraints the distribution of Nodes and ensures that all users have enough Nodes available to them.

Redshift supports the same concurrency constraints as other Data Warehouses, but with some added flexibility. For example, the number of available Nodes in a Cluster is determined by the type of Cluster. Redshift also sets limits based on region, rather than enforcing a single limit for all users. Furthermore, in certain situations, users may submit a request to increase the limit.

3) Columnar Data Storage

Redshift Architecture: Columnar Storage in Redshift
Image Source

The most commonly used mapping for organizing data is by storing it by rows. This is idle to handle large numbers of small operations quickly. The row storage system is implemented in Online Transaction Processing (OLTP) and is used in most of the Operational Databases.

However, Column-oriented Databases are faster when accessing large amounts of data. For example, in an Online Analytical Processing (OLAP) environment like Redshift, users tend to apply fewer queries on much larger Datasets. In this case, having a Column-Oriented Database allows Redshift to quickly execute Big Data jobs. Thus Redshift uses a Columnar approach to store data and this has proven to work in favor of this Data Warehouse.

4) Workload Management

Multiple users are expected to query concurrently on a Data Warehouse like Amazon Redshift. Hence, it becomes very essential to manage and control queries and workload effectively. With Workload Management (WLM), it is possible to prioritize workload and queries in order to stabilize the process.

Amazon Redshift Workload Management (WLM) allows users to have full control over running queries. This way you can flexibly manage priorities within workloads, allowing short, fast-running queries to be executed before the long-running queries.

Amazon Redshift WLM creates query queues based on the Service Classes. Service Classes define the configuration parameters for various kinds of queues. Depending on a user’s user group or the query group label set by the user at runtime, WLM assigns the query to the respective queue.

Conclusion

The article introduced you to Amazon Redshift and its Architecture. It discussed in detail the various components involved in the Redshift Architecture and also listed down the benefits of using this Data Warehouse. Now you may want to transfer data from multiple sources to Redshift. This will require you to build the complex ETL process manually.

Visit our Website to Explore Hevo

Hevo Data can simplify your task by eliminating the need to write any code. It will automate the process of data transfer from 150+ data sources to Redshift and provide you with a hassle-free experience. Hevo provides granular logs that allow you to monitor the health and flow of your data. This allows you to scale up your data infrastructure on demand and start moving data from all the applications important for your business.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your experience of this blog in the comments section

Sourabh
Founder and CTO, Hevo Data

Sourabh has more than a decade of experience building scalable real-time analytics and has worked for companies like Flipkart, tBits Global, and Unbxd. He is experienced in technologies like MySQL, Hibernate, Spring, CXF, php, ExtJS and Shell.

All your customer data in one place.