Amazon Redshift is a data warehousing system that enables rapid, simple, and cost-effective analysis of large datasets from data warehouses and data lakes. It can give performance levels up to ten times faster than other data warehousing systems by combining machine learning, massively parallel processing (MPP), and columnar storage on SSD disks.
Despite its great capabilities, customers may experience swings in query performance or scaling issues. Optimizing Amazon Redshift performance requires thorough planning and regular monitoring as your data volume, user population, and cluster size grow.
In this blog, you will learn about 4 techniques to help you optimize your Amazon Redshift Performance effectively.
Introduction to Amazon Redshift
Amazon Redshift is a SaaS (Software as a Service) offering from AWS, a fully managed, enterprise-level Data Warehouse service coupled with high-performance analytics.
Amazon Redshift lets you federate data from various structured and semi-structured sources, scales easily to handle petabytes of storage and integrates well with other analytics services like Amazon EMR, Amazon Athena, and Amazon SageMaker.
Let’s see how Amazon Redshift is different from traditional SQL databases.
- Multiple Nodes: Amazon Redshift organizes data in multiple nodes, the first node to be commissioned acts as the leader node.
- Massively Parallel Processing: Instead of sequential processing, Amazon Redshift uses multiple compute nodes where each compute node is assigned a part/chunk of the data to be processed and all these nodes perform the same operation(s) on their slice of data.
Later, all the part processing results are aggregated by the leader node, to give the final computed result that has emerged from the whole data.
- Columnar Data Storage: Unlike traditional DBMS, Amazon Redshift stores data in columnar formats, where data is organized, aggregated, and laid out on a “column by column” basis. For example, As a crude analogy, visualize a product table, all Product+IDs are stored together and all product_prices will be stored together as a column. The advantage here is that it needs to fetch only those columns that a query is interested in.
So, if you want to find the average price, only the price column is read and processed, the rest of the columns are not accessed. In a row-based database, you would have to access each row, fetch each price entry in a row, and then aggregate them.
- Targeted Data Compression Encoding Schemes: Amazon Redshift compresses data in its own style to save storage space, during query execution this compressed data is read from storage into memory, uncompressed in the memory, and then processed by participating nodes. Data is divided based on a distribution key ( a column or an attribute of your data, like customer_ID ) and distributed amongst processing nodes to work upon.
Though it’s loosely based on PostgreSQL, it does not support tablespaces, table partitioning, and inheritance, etc. Overall, Amazon Redshift is specifically designed for Online Analytic Processing (OLAP) and Business Intelligence (BI) applications, which require complex queries against large datasets.
Amazon Redshift Performance Tuning Techniques
Now that you have understood how massive data volume and scalability and affect Amazon Redshift Performance. In this section, you will read about various Amazon Redshift Performance tuning techniques that will increase query processing. A few tuning techniques are listed below:
1) Choosing the Best Sort Key
If you need to query recent data frequently, choose your timestamp as the sort key, this will enable Amazon Redshift to easily skip all those columns that do not fall in your time range. If you need to frequently filter/range based on a certain column, specify that column as the sort key.
When Amazon Redshift organizes your data in blocks, each block stores the starting and ending value of your sort key in that block, hence this will allow Amazon Redshift to ignore entire blocks of data that do not suffice to your filtering/predicate range.
2) Choosing Optimal Distribution Style
Before executing any query, the optimizer redistributes the rows to the compute nodes to perform joins, aggregations, and processing. A few techniques for optimal distribution style are listed below:
- To minimize the impact of this redistribution, you must designate the primary key of the dimension table ( and the corresponding foreign key of the fact table), as your DISTKEY.
- Instead of the total size of a table participating in the query, keep an eye on the size of the selected dataset after filtering, as only those rows will be redistributed that are used in the join.
- If your query’s final resultant data occurs only on a few slices, your query workload will be skewed and hence slow. To avoid this, try to choose a high cardinality column from the filtered result set. High cardinality means a column for which values are unique or unlikely to occur in other columns, for example, Primary or Unique key.
3) Defining Foreign and Primary Key Constraints
Define foreign key and primary key constraints if your application logic demands. As a corollary to the previous fact, and though Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints; defining these constraints helps the query optimizer generate more efficient query plans.
4) Managing Automatic Compression Encoding
Though you can choose your compression style, do it only if you are a very experienced user and know what the repercussions might be. In most cases, the default for all tables, ENCODE AUTO, is the best.
For example, if sort columns are compressed more highly than other columns in a query, range-restricted scans will perform badly. Hence, ENCODE AUTO would choose a less efficient compression encoding to keep the sort key columns balanced with other columns.
Amazon Redshift Performance Tuning Tips
In this section, you will get to know about some common tips that you can use for tuning Amazon Redshift Performance. A few tips are listed below:
- Amazon Redshift would keep your columnar data in a compressed form but while processing complex queries, temporary tables are created that store intermediate results in uncompressed form. It’s good practice to use the smallest possible column size for your CHAR and VARCHAR type columns, to avoid unnecessary wastage of memory and temporary disk space, which could lead to slower queries.
- Don’t use CHAR or VARCHAR for DATE/TIMESTAMP types. Amazon Redshift stores DATE and TIMESTAMP data more efficiently, as compared to CHAR/VARCHAR, hence it’s a good practice to define your time-based columns as DATE/TIMESTAMP types only. An added advantage of this is that while querying DATE/TIMESTAMP types can use faster-inbuilt comparisons and aggregations.
- It is recommended to use DC2 node types for datasets under 1TB to deliver the best Amazon Redshift performance at the lowest price and using RA3 is a good option if you expect your data to grow so that you can configure compute and storage independently to get better price and performance.
- To monitor the physical aspects of the cluster, such as CPU utilization, latency, and throughput; use Amazon CloudWatch metrics. Amazon Redshift also provides query and load performance data to help you monitor the database activity in your cluster.
- You can increase the query performance by adding nodes to your cluster because Amazon Redshift parallelly executes queries across all the nodes.
Amazon Redshift Performance Tuning Based on Data
In this section, you will learn about some optimizations that are mandated based on the peculiarity of your application/data. A few points for application-based Amazon Redshift Performance are listed below:
For Long Tables of ~ 5-10 Billion Rows or More
Very long tables offer a few challenges in their maintenance, like expensive sorts, time-consuming deletes, etc. Deleting will be particularly expensive as every delete must be followed by:
- Reclaiming the Space on the table, which can be an expensive operation, is somewhat akin to defragmentation in hard disk drives.
- Sorting the Table, where sorting such a huge volume of data is a resource-consuming and sometimes impossible task.
- Updating Statistics to ease operations and increase the accuracy of future reports.
In case your primary need is to get sorted results from the table and using newer/current records only, i.e. sorting and deleting will be frequently done on this long table, the suggestions below could be helpful.
Partition your very long table into smaller partitions, i.e. smaller tables with slightly different names but having the same schema ( as the parent table) and a chunk of the very long data. Choose a partition key that is most relevant and partition the rows into these multiple tables based on it.
Next, create a “View” that uses UNION to provide a holistic and consistent view of the whole data. This way, your SELECTs might be a bit slower but it’s transparent to your application. It also gives you the added flexibility of adding only sorted data to the very long table, in case you need sorted results every time/mostly.
For Storing Legacy Data
Deleting data is not an option due to regulatory or operational reasons. For example, Financial transactions, healthcare records, and security trials mandate keeping volumes of legacy data. Sooner than later, the cost of holding this data in Amazon Redshift would become too expensive.
Keeping this in mind, AWS offered Amazon Redshift Spectrum as an alternate solution.
Amazon Redshift Spectrum allows you to store your legacy/infrequently used data, as-is in your S3 data lake, and query it whenever needed via Amazon Redshift. So, while your frequently accessed or latest data resides in Amazon Redshift, your “not so concurrent” data resides in the Amazon Redshift spectrum.
Amazon Redshift serves your perennial queries quickly, while sporadic queries that need to access your “colder” data can be served via the Amazon Redshift spectrum. Essentially, this technique can decouple your storage from compute, and keep your costs low while not compromising on Amazon Redshift Performance.
This technique is useful only if you have significant legacy/infrequently accessed data along with frequently accessed current data sets. Amazon Redshift takes advantage of optimizations such as data block temperature, data blockage, and workload patterns for optimizing Amazon Redshift Performance and managing data placement automatically across different tiers of storage. Amazon Redshift also automatically scales storage to Amazon S3 without any manual action.
Need for Amazon Redshift Performance Tuning
In this section, you will read why Amazon Redshift Performance Tuning is required even when it can deliver 10x query processing using Machine Learning. A few points are listed below:
- It lets you harness the full power of the AWS platform in general and Amazon Redshift in particular.
- It keeps your costs under control and avoids unnecessary spending.
- On average, your data would get double every year, and managing Amazon Redshift Performance will get essential as the data volume grows.
- As more and more complex queries and analytics hit your data, and as data volume grows exponentially, Proactive planning and thoughtful utilization of resources gain paramount importance.
- Better infrastructure choices and planned execution will save programming hours and time spent in fine-tuning the analytics.
- As you learn how to best use the platform, you also gain meaningful insights on what data should be collected and how it should be fed to AWS for better results.
Conclusion
In this article, you read learnt different Amazon Redshift Performance tuning techniques & strategies that can be used to increase the performance of Amazon Redshift to handle massive data volume and queasy processing. Also, you read about few more tips on Amazon Redshift Performance tuning for optimization based on characteristics of data. Amazon Redshift Performance matter when it comes to scaling the Data Warehouse.
Amazon Redshift stores important data from multiple data sources. It is a tedious task to manually load and transform data from data sources to Amazon Redshift. Instead of manually writing scripts for every source to perform the ETL (Extract Transform Load) process, one can automate the whole process.
Share your experience of learning in the comments section below!
Arsalan is a research analyst at Hevo and a data science enthusiast with over two years of experience in the field. He completed his B.tech in computer science with a specialization in Artificial Intelligence and finds joy in sharing the knowledge acquired with data practitioners. His interest in data analysis and architecture drives him to write nearly a hundred articles on various topics related to the data industry.