Amazon Redshift AQUA Simplified 101

on Amazon Redshift, AWS, Cloud Computing, Data Aggregation, Data Analytics, Data Driven, data management, Data Management Tools, Data Warehouse, SQL, Tutorials • September 27th, 2021 • Write for Hevo

Amazon Redshift is a Cloud-based Data Warehouse that allows you to analyse petabytes of data across multiple data sources and data lakes in a quick, easy, and cost-effective manner. Even with all that power, it’s likely that you’ll run into issues with query speed or workload scalability. This is where Amazon Redshift AQUA (Advanced Query Accelerator) comes into play.

Amazon Redshift AQUA is a cost-effective add-on to Amazon Redshift-managed storage that is designed for safe, transactional, multitenant access, and high-throughput analytic queries. It provides authentication, encryption, isolation, and compliance to keep your data safe at rest as well as in transit.

This article will give you a comprehensive guide to Amazon Redshift AQUA. You will get to know about Amazon Redshift and its key features. You will also explore the key concepts and performance tuning techniques associated with Amazon Redshift AQUA in the further sections. Let’s get started.

Table of Contents

Introduction to Amazon Redshift

Amazon Redshift Logo
Image Source

Amazon Redshift is a Cloud-based Data Warehouse service created by Amazon to manage massive datasets and make it easy to get insights from them. It allows you to query and integrate petabytes of structured and semi-structured data from a variety of Data Warehouses, Operational Databases, and Data Lakes.

Amazon Redshift is a database management system that uses industry-standard SQL to handle large datasets, execute high-performance analysis, generate reports, and undertake large-scale database migrations. A leader node and clusters of compute nodes make up Amazon Redshift, which is used to do data analytics. The Amazon Redshift architecture is depicted in the diagram below.

Amazon Redshift Architecture
Image Source

Key Features of Amazon Redshift

Amazon Redshift has gained significant popularity in the market. The following are some of Amazon Redshift’s key features:

  • Massively Parallel Processing: MPP (Massively Parallel Processing) is a distributed design solution to handle massive data workloads. A large processing job is divided into smaller jobs and distributed among a cluster of compute nodes. These nodes work in parallel rather than sequentially. As a result, Amazon Redshift’s time to finish a single, large operation is significantly reduced.
  • Materialized Views: Amazon Redshift enables you to query datasets ranging in size from gigabytes to petabytes. Data compression, columnar storage, and zone map all help to cut down on the amount of I/O required to run the queries.
  • Limitless Concurrency: Whether users query data directly from your Amazon S3 Data Lake or from your Amazon Redshift Data Warehouse, Amazon Redshift consistently offers fast performance, even when hundreds of queries are running at the same time.
  • Amazon Redshift ML (Machine Learning): Amazon Redshift ML is a feature of Amazon Redshift that allows Data Analysts and Database engineers to quickly design, train, and deploy Amazon SageMaker models using SQL.

To know more about Amazon Redshift, visit this link.

Introduction to Amazon Redshift AQUA 

Amazon Redshift AQUA
Image Source

AQUA (Advanced Query Accelerator) is a distributed and hardware-accelerated cache that enables Amazon Redshift to run up to 10x faster than other enterprise cloud Data Warehouses by automatically boosting certain types of queries. 

For queries that scan large datasets, Amazon Redshift AQUA acts as an analytics query accelerator. It uses specialized hardware and speeds up the queries while maintaining security, multi-tenant access, and high throughput.  

To know more about Amazon Redshift AQUA, visit this link.

Introduction to Amazon Redshift Cluster

Amazon Redshift Cluster
Image Source

Amazon Redshift Cluster consists of a set of nodes. In each cluster, there is a leader node and one or more compute nodes. Client applications send queries to the leader node, which parses them and creates query execution plans. The leader node then coordinates the concurrent execution of these plans with the compute nodes and collects the nodes’ intermediate results. Finally, the results are returned to the client applications.

Moreover, to serve these requests, compute nodes run query execution plans and communicate data among themselves. The intermediate results are transmitted to the leader node for aggregation before being sent to the client apps.

To know more about Amazon Redshift Cluster, visit this link.

Simplify Data Analysis Using Hevo’s No-code Data Pipeline

Hevo Data helps you directly transfer data from 100+ data sources (including 30+ free sources) to Business Intelligence tools, Data Warehouses, or a destination of your choice in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

Hevo takes care of all your data preprocessing needs required to set up the integration and lets you focus on key business activities and draw a much powerful insight on how to generate more Leads, retain customers, and take your business to new heights of profitability. It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination.

Get Started with Hevo for Free

Check out what makes Hevo amazing:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Key Concepts Involved in Amazon Redshift AQUA

Before diving deep into Amazon Redshift AQUA, let’s discuss a few key concepts that are deeply ingrained in Amazon Redshift AQUA.

Traditionally, there are 2 major bottlenecks to Data Warehouse performance:

  • Networking Bottleneck
  • CPU Bottleneck

1) Networking Bottleneck

Traditionally, warehousing structures hold data in central repositories, which have to be moved to compute nodes/clusters for processing and querying. Hence, the volume of network bandwidth needed to move all this data becomes a bottleneck on query performance.

Networking Bottleneck in Amazon Redshift AQUA
Image Source

2) CPU Bottleneck

The ability of CPUs to process in-memory data has only increased 2X in the last decade. Hence, traditional CPUs lag behind storage devices when it comes to speedy performance.

CPU Bottleneck in Amazon Redshift AQUA
Image Source

Today, Amazon Redshift AQUA uses 2 innovative and modern hardware structures to solve the above-mentioned problems:

  • Non-Volatile Memory Express (NVMe)
  • Field-Programmable Gate Arrays (FPGAs)

1) Non-Volatile Memory Express (NVMe)

NVMe is a new improved storage protocol that connects a host to a memory subsystem. It capitalizes on parallel, low latency data paths to the underlying media, resulting in low latency and very high performance. 

While traditional protocols require a large number of CPU cycles to make data available to applications, NVMe requires just a small number of CPU cycles and also has a reduced infrastructure footprint. It leverages SIMD (Single Instruction, Multiple Data) instructions for parallel processing and hardware acceleration in CPU chips. 

2) Field-Programmable Gate Arrays (FPGAs)

Another hardware construct that Amazon Redshift AQUA uses is a Contemporary Field-Programmable Gate Arrays (FPGAs), which is an integrated circuit that can be configured/programmed to suit its application, after being manufactured. It consists of an array of programmable logic blocks and memory elements that can be configured to perform complex combinational/logical functions, thus, allowing for faster data retrieval.

Hardware Components in Amazon Redshift AQUA
Image Source

The above hardware components are connected together in a unique way, allowing faster data scanning and aggregation of intermediate results in high-speed memory. 

Amazon Redshift AQUA is a fast data cache that maintains high-speed connections to Amazon Redshift managed storage. Another technique that Amazon Redshift AQUA uses is to maintain materialized views which are computed once and then queries many times. This Speed-up queries by orders of magnitude by caching joins, filters, aggregations, and projections. More specifically, materialized view converts join/aggregate into tables as shown in the below diagram.

Materialized View of Amazon Redshift AQUA
Image Source

To keep these materialized views up-to-date, Amazon Redshift uses incremental refresh and user-triggered maintenance. 

Architecture of Amazon Redshift AQUA Node

Amazon Redshift AQUA Node Architecture
Image Source

As shown above, an AQUA node uses Nitro-Accelerated compression & encryption. It houses a Custom processor for common analytics operations like scan (filter), aggregation and hashing, etc. Since this processor is near the data storage location, it also rules out the network latency factor.  The in-node compiler coordinates operations between HW accelerators and the CPU.

AZ64 is an Amazon-developed compression encoding technique that helps you to achieve a high compression ratio and faster query processing. Using these and some more intelligent innovations like having a custom “Enhanced Planner” for the above hardware, Amazon Redshift AQUA gives its users the ability to quickly analyze petabytes of data and deliver timely insights to multiple teams and multiple users, all at the same time. 

Amazon Redshift detects query sections that can be accelerated and send them to Amazon Redshift AQUA for processing. Next, this subset of the data that require extensive scans, filters, and aggregation, is processed quickly by Amazon Redshift AQUA, and results are sent back to coalesce with other wrappers/headings.

e.g. if a query contains LIKE or SIMILAR TO expressions, Amazon Redshift sends it to AQUA and then the entire operation runs on AQUA. 

SELECT  c.customer_name, p.product_name, sum(total) from orders o
  join customers c ON c.id = o.customer_id 
  join products p ON p.id = o.product_id
where c.customer_name LIKE ‘%John%Smith%’ 
order by sum(total);

Here, as the query has LIKE predicate, the scanning of the customer table based on customer_name, is sent to AQUA. After Amazon Redshift receives the results of this scan from AQUA, it will perform the JOINS locally and finish the query. 

Other queries, like INSERT, UPDATE, or SELECT queries without a predicate, are performed by the Amazon Redshift cluster itself, thereby letting AQUA process time-consuming queries only, enabling smart throughput in cases where the volume of the dataset and queries is high.

Performance Tuning Techniques for Amazon Redshift AQUA

There are certain performance tuning techniques that come in handy while using Amazon Redshift AQUA, these are:

1) Elastic Resize and Concurrency Scaling

If your application experiences bursts in workloads/queries,  you can use Elastic Resize and Concurrency Scaling features. Here, Elastic Resize allows you to quickly increase or decrease the number of compute nodes, as the need arises. Also, expand (scale) the cluster is used to provide additional processing power for periods when larger workloads hit your application. 

Not only does Concurrency scaling allow you to add capacity dynamically in response to the workload arriving at the cluster, but it also allows you to specify the maximum limit of scaling,  by setting the max_concurrency_scaling_clusters parameter, to keep your billing under control. 

2) Amazon Redshift Advisor

The Amazon Redshift Advisor makes recommendations tailored to your Amazon Redshift cluster to help you cut operating costs and boost performance.

Amazon Redshift maintains an event log and usage statistics, to fine-tune its recommendations for you. It also runs tests against your cluster, and if certain parameters go out of range, it alerts you and provides recommendations for rectification. Once you act on a recommendation, it gets removed from the list, thereby reducing the cluster as much as possible. 

Coupled with QMR(Query Monitoring Rules) and Amazon CloudWatch metrics, Amazon Redshift Advisor becomes a potent tool to effectively manage your clusters.  

3) Auto Workload Management

Auto Workload Management, or WLM is a smart solution that utilizes ML (Machine Learning) to dynamically manage memory and concurrency in your cluster. Also, Amazon Redshift Advisor evaluates your cluster’s current WLM consumption and provides recommendations to improve the throughput.

4) Short Query Acceleration

Short Query Acceleration or SQA is another feature of Amazon Redshift that allows running short-running jobs in their own queue. This way short-running jobs do not have to wait behind long-running jobs to get executed, thereby increasing your cluster’s responsiveness and efficiency. 

5) Federated Query

A Federated Query lets you run your analytics directly on your data residing on your OLTP (Online Transaction Processing) source system databases and Amazon S3 data lake, without even ingesting it in Amazon Redshift tables. This way, without incurring the overhead of performing ETL (Extract, Transform, and Load) into Amazon Redshift, you can coalesce your outside data with Amazon Redshift AQUA analytics, to provide a holistic view of your stakeholders. 

Conclusion

This article gave you a comprehensive guide to Amazon Redshift AQUA. You got to know about Amazon Redshift and its key features. You also explored the key concepts, architecture, challenges and performance tuning techniques associated with Amazon Redshift AQUA. You may now take advantage of Amazon Redshift AQUA to boost the query performance and challenges in scaling with ease.

Visit our Website to Explore Hevo

Businesses can use automated platforms like Hevo Data to set the integration and handle the ETL process. It helps you directly transfer data from a source of your choice to a Data Warehouse, Business Intelligence tools, or any other desired destination in a fully automated and secure manner without having to write any code and will provide you a hassle-free experience.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Share your experience of learning about Amazon Redshift AQUA in the comments section below!

No-code Data Pipeline for your Data Warehouse