Amazon Redshift is a petabyte-scale data warehouse, managing such mammoth disk space is no easy job. Unlike traditional databases which have limited disk space and performs housekeeping activity without user intervention, Redshift leaves it up to the user to perform its housekeeping activity so as not to hamper its performance. Ensuring the real-time availability of data should be one of the first things that you should work on to get the most out of your Redshift Data Warehouse. If you are dealing with a huge amount of data, then it will be an absolute necessity to guarantee accurate, consistent and latest data in the warehouse.

For the scope of this article, we will talk about Amazon Redshift Vacuum and Analyze and how they can help optimize Redshift Performance by improving Redshift space utilization.

Introduction to Amazon Redshift

Redshift Logo.
Image Source:https://www.cloud-rush.com/amazon-redshift-aws-analytics-tool/

Amazon Redshift is a fully-managed petabyte-scale cloud-based data warehouse, designed to store large scale data sets and perform insightful analysis on them in real-time.

It is highly column-oriented & designed to connect with SQL-based clients and business intelligence tools, making data available to users in real-time. Supporting PostgreSQL 8, Redshift delivers exceptional performance and efficient querying. Each Amazon Redshift data warehouse contains a collection of computing resources (nodes) organized in a cluster, each having an engine of its own and a database to it.

Amazon Redshift is known for providing consistently fast performance, even in the face of thousands of concurrent queries. Amazon Redshift Concurrency Scaling supports nearly unlimited concurrent queries and users. By leveraging Redshift’s managed storage, capacity is added to support workloads of up to 8 PB of compressed data.

For further information check out Amazon Redshift.

Understanding the Need for Amazon Redshift Vacuum and Analyze

Redshift does not automatically reclaim and reuse space that is freed when you delete rows and update rows. Moreover, when data is inserted into database Redshift does not sort it on the go. Redshift reclaims deleted space and sorts the new data when VACUUM query is issued.

Also to help plan the query execution strategy, Redshift Vacuum and Analyze uses stats from the tables involved in the query like the size of the table, distribution style of data in the table, sort keys of the table etc. These stats information needs to be kept updated for better performance of queries on Redshift, this is where Analyze command plays its role. In this document below we will discuss when and how to use Amazon Redshift Vacuum and Analyze.

Simplify your Data Analysis with Hevo’s No-code Data Pipeline

Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready.

Start for free now!

Get Started with Hevo for Free

Using Amazon Redshift Vaccum for High-Performance

Using Amazon Redshift Vaccum to Sort Data.
Image Source: https://docs.aws.amazon.com/redshift/latest/dg/vacuum-managing-volume-of-unmerged-rows.html

When data is inserted into Redshift, it is not sorted and is written on an unsorted block. With unsorted data on disk, query performance might be degraded for operations that rely on sorted data, such as range-restricted scans or merge joins. When you run a DELETE query, redshift soft deletes the data. Similar is the case when you are performing UPDATE, Redshift performs a DELETE followed by an INSERT in the background. When vacuum command is issued it physically deletes the data which was soft-deleted and sorts the data again.

Among other things, you might want to focus on Amazon Redshift Sort Keys and Amazon Redshift Distribution keys to optimize the query performance on Redshift.

Understanding When to Use Amazon Redshift Vaccum

Sorting and Merging rows with Amazon Redshift Vacuum.
Image Source: https://docs.aws.amazon.com/redshift/latest/dg/vacuum-managing-volume-of-unmerged-rows.html

Although VACUUM improves query performance it comes at a cost of time and hits performance during its execution. It is recommended to perform Redshift Vacuum and Analyze depending on the amount of space that needs to be reclaimed and also upon unsorted data. To figure out which tables require vacuuming we can run the following query.

SELECT "schema" + '.' + "table" FROM svv_table_info where unsorted > 10

The query above will return all the tables which have unsorted data of above 10%.

Figuring out tables which have soft deleted rows is not straightforward, as Redshift Vacuum and Analyze does not provide this information directly. What it provides is the number of total rows in a table including ones that are marked for deletion(tbl_rows column in the svv_table_info table). To get an actual number of rows (excluding ones which are marked for deletion) you will simply have to run a count query on the table and figure out the number of rows which have been marked for deletion. For example, let’s consider a hypothetical table ‘users’ in ‘public’ schema. we have to calculate soft deleted rows in the users table.

SELECT tbl_rows - (SELECT count(*) FROM public.users) 
FROM svv_table_info 
WHERE "schema" = 'public' AND "table" = 'users';

Understanding How to Run Amazon Redshift Vaccum

You can issue vacuum either on a table or on the complete database.

There are following 4 options available:

  • VACUUM DELETE ONLY: A DELETE ONLY vacuum reclaims disk space and do not sort new rows. For example, you might perform a DELETE ONLY vacuum operation if you don’t need to resort rows to optimize query performance.
  • VACUUM SORT ONLY: A SORT ONLY vacuum do not reclaim disk space it just sort new rows in the table.
  • VACUUM FULL: It is a combination of DELETE ONLY and SORT ONLY vacuum. As the operation is handled internally by Redshift it is better to run VACUUM FULL rather than manually running DELETE ONLY followed by SORT ONLY vacuum. VACUUM FULL is the same as VACUUM meaning VACUUM FULL is the default vacuum operation.
  • VACUUM REINDEX: Use this for tables that use interleaved sort keys. When you initially load an empty interleaved table using COPY or CREATE TABLE AS, Redshift automatically builds the interleaved index. If you initially load an interleaved table using INSERT, you need to run VACUUM REINDEX afterwards to initialize the interleaved index.

To perform vacuum use the query below

Amazon Redshift Vacuum Syntax.
Image Source: Self

E.g VACUUM FULL users;

This completes the first section of Amazon Redshift Vacuum and Analyze tutorial on when and how you can leverage Amazon Redshift Vacuum.

Using Amazon Redshift Analyze for High Performance

When a query is issued on Redshift, it breaks it into small steps, which includes the scanning of data blocks. To minimize the amount of data scanned, Redshift relies on stats provided by tables. Stats are outdated when new data is inserted in tables. ANALYZE is used to update stats of a table.

Although when there is a small change in the data in the table(i.e. When new rows are added in the table) it may not have a huge impact when there is a major change in stats, redshift starts to scan more data. So as to make the right query execution plan, Redshift Vacuum and Analyze requires knowing the stats about tables involved. Stats for table changes when new data is inserted or deleted.

Understanding When to Use Amazon Redshift Analyze

Similar to Redshift Vacuum, analyze too is a time-consuming operation. To get all the tables which need to have its stats updated you can run the query below.

SELECT "schema" + '.' + "table" FROM svv_table_info where stats_off > 10

This Redshift Vacuum and Analyze query will return all the tables whose stats are off by 10%.

Understanding How to Run Amazon Redshift Analyze

You can run the Amazon Redshift Vacuum and Analyze command using the following syntax:

Amazon Redshift Analyze syntax.
Image Source: Self

E.g. ANALYZE VERBOSE users;

Another way to improve the performance of Redshift is by re-structuring the data from OLTP to OLAP. You can create derived tables by pre-aggregating and joining the data for faster query performance. Solutions such as Hevo Data Integration Platform offer Data Modelling and Workflow Capability to achieve this in a simple and reliable manner. You can use Hevo for 14-day Free Trial.

Apart from this guide on Amazon Redshift Vacuum and Analyze, we have also discussed the right way to choose distribution keys and sort keys. Also, as part of our Amazon Redshift blog series, you can read a detailed account where we have gone deep into understanding Amazon Redshift architecture.

This completes the second section of the Amazon Redshift Vacuum and Analyze tutorial on when and how you can leverage Amazon Redshift Vacuum.

Conclusion

This article teaches you about the Amazon Redshift Vacuum and Analyze. It provides in-depth knowledge about the concepts associated with it to help you implement them and improve your query and data performance. Integrating complex data from numerous sources into a destination of your choice such as Amazon Redshift can be challenging especially for a beginner & this is where Hevo saves the day.

Visit our Website to Explore Hevo

Hevo Data, a No-code Data Pipeline, helps you transfer data from a source of your choice in a fully automated and secure manner without having to write the code repeatedly.

Tell us about your experience of learning about Amazon Redshift Vacuum and Analyze! Share your thoughts in the comments section below!

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Sarad Mohanan
Software Engineer, Hevo Data

With around a decade of experience, Sarad has designed and developed fundamental components of Hevo. His expertise lies in building lean solutions for various software problems, mentoring fellow engineers and exploring new technologies.

All your customer data in one place.