Redshift Vacuum and Analyze: 4 Critical Aspects

on Engineering • September 25th, 2018 • Write for Hevo

Amazon Redshift is a petabyte-scale data warehouse, managing such mammoth disk space is no easy job. Unlike traditional databases which have limited disk space and performs housekeeping activity without user intervention, Redshift leaves it up to the user to perform its housekeeping activity so as not to hamper its performance. Ensuring the real-time availability of data should be one of the first things that you should work on to get the most out of your Redshift Data Warehouse. If you are dealing with a huge amount of data, then it will be an absolute necessity to guarantee accurate, consistent and latest data in the warehouse.

Real-time data integration solutions like Hevo can help you seamlessly move data from 100s of sources into Redshift in minutes. This not only guarantees data consistency and accuracy but also takes many ETL housekeeping tasks off you. Among other things, you might want to focus on Amazon Redshift Sort Keys and Amazon Redshift Distribution keys to optimize the query performance on Redshift.

For the scope of this article, we will talk about Amazon Redshift Vacuum and Analyze and how they can help optimize Redshift Performance by improving Redshift space utilization.

Table of Contents

Introduction to Amazon Redshift

Redshift Logo.
Image Source:https://www.cloud-rush.com/amazon-redshift-aws-analytics-tool/

Amazon Redshift is a fully-managed petabyte-scale cloud-based data warehouse, designed to store large scale data sets and perform insightful analysis on them in real-time.

It is highly column-oriented & designed to connect with SQL-based clients and business intelligence tools, making data available to users in real-time. Supporting PostgreSQL 8, Redshift delivers exceptional performance and efficient querying. Each Amazon Redshift data warehouse contains a collection of computing resources (nodes) organized in a cluster, each having an engine of its own and a database to it.

Amazon Redshift is known for providing consistently fast performance, even in the face of thousands of concurrent queries. Amazon Redshift Concurrency Scaling supports nearly unlimited concurrent queries and users. By leveraging Redshift’s managed storage, capacity is added to support workloads of up to 8 PB of compressed data.

For further information on Amazon Redshift, you can check the official website here.

Understanding the Need for Amazon Redshift Vacuum and Analyze

Redshift does not automatically reclaim and reuse space that is freed when you delete rows and update rows. Moreover, when data is inserted into database Redshift does not sort it on the go. Redshift reclaims deleted space and sorts the new data when VACUUM query is issued.

Also to help plan the query execution strategy, Redshift Vacuum and Analyze uses stats from the tables involved in the query like the size of the table, distribution style of data in the table, sort keys of the table etc. These stats information needs to be kept updated for better performance of queries on Redshift, this is where Analyze command plays its role. In this document below we will discuss when and how to use Amazon Redshift Vacuum and Analyze.

Simplify your Data Analysis with Hevo’s No-code Data Pipeline

A fully managed No-code Data Pipeline platform like Hevo Data helps you integrate and load data from 100+ sources (including 30 Free Data Sources) to a destination like Redshift of your choice in real-time in an effortless manner. 

Get Started with Hevo for Free

Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. Its strong integration with umpteenth sources allows users to bring in data of different kinds in a smooth fashion without having to code a single line. 

Check out some of the cool features of Hevo:

  • Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
  • Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the Data Pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use.
  • Connectors: Hevo supports 100+ integrations to SaaS platforms such as Trello, files, databases, analytics, and BI tools. It supports various destinations including Amazon Redshift, Snowflake Data Warehouses; Amazon S3 Data Lakes; and MySQL, SQL Server, TokuDB, DynamoDB, PostgreSQL databases to name a few.  
  • Real-Time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
  • 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources that can help you scale your data infrastructure as required.
  • 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Using Amazon Redshift Vaccum for High Performance

Using Amazon Redshift Vaccum to Sort Data.
Image Source: https://docs.aws.amazon.com/redshift/latest/dg/vacuum-managing-volume-of-unmerged-rows.html

When data is inserted into Redshift, it is not sorted and is written on an unsorted block. With unsorted data on disk, query performance might be degraded for operations that rely on sorted data, such as range-restricted scans or merge joins. When you run a DELETE query, redshift soft deletes the data. Similar is the case when you are performing UPDATE, Redshift performs a DELETE followed by an INSERT in the background. When vacuum command is issued it physically deletes the data which was soft-deleted and sorts the data again.

Understanding When to Use Amazon Redshift Vaccum

Sorting and Merging rows with Amazon Redshift Vacuum.
Image Source: https://docs.aws.amazon.com/redshift/latest/dg/vacuum-managing-volume-of-unmerged-rows.html

Although Vacuum improves query performance it comes at a cost of time and hits performance during its execution. It is recommended to perform Redshift Vacuum and Analyze depending on the amount of space that needs to be reclaimed and also upon unsorted data. To figure out which tables require vacuuming we can run the following query.

SELECT "schema" + '.' + "table" FROM svv_table_info where unsorted > 10

The query above will return all the tables which have unsorted data of above 10%.

Figuring out tables which have soft deleted rows is not straightforward, as Redshift Vacuum and Analyze does not provide this information directly. What it provides is the number of total rows in a table including ones that are marked for deletion(tbl_rows column in the svv_table_info table). To get an actual number of rows (excluding ones which are marked for deletion) you will simply have to run a count query on the table and figure out the number of rows which have been marked for deletion. For example, let’s consider a hypothetical table ‘users’ in ‘public’ schema. we have to calculate soft deleted rows in the users table.

SELECT tbl_rows - (SELECT count(*) FROM public.users) 
FROM svv_table_info 
WHERE "schema" = 'public' AND "table" = 'users';

Understanding How to Run Amazon Redshift Vaccum

You can issue vacuum either on a table or on the complete database.

There are following 4 options available:

  • VACUUM DELETE ONLY: A DELETE ONLY vacuum reclaims disk space and do not sort new rows. For example, you might perform a DELETE ONLY vacuum operation if you don’t need to resort rows to optimize query performance.
  • VACUUM SORT ONLY: A SORT ONLY vacuum do not reclaim disk space it just sort new rows in the table.
  • VACUUM FULL: It is a combination of DELETE ONLY and SORT ONLY vacuum. As the operation is handled internally by Redshift it is better to run VACUUM FULL rather than manually running DELETE ONLY followed by SORT ONLY vacuum. VACUUM FULL is the same as VACUUM meaning VACUUM FULL is the default vacuum operation.
  • VACUUM REINDEX: Use this for tables that use interleaved sort keys. When you initially load an empty interleaved table using COPY or CREATE TABLE AS, Redshift automatically builds the interleaved index. If you initially load an interleaved table using INSERT, you need to run VACUUM REINDEX afterwards to initialize the interleaved index.

To perform vacuum use the query below

Amazon Redshift Vacuum Syntax.
Image Source: Self

E.g VACUUM FULL users;

This completes the first section of Amazon Redshift Vacuum and Analyze tutorial on when and how you can leverage Amazon Redshift Vacuum.

Using Amazon Redshift Analyze for High Performance

When a query is issued on Redshift, it breaks it into small steps, which includes the scanning of data blocks. To minimize the amount of data scanned, Redshift relies on stats provided by tables. Stats are outdated when new data is inserted in tables. ANALYZE is used to update stats of a table.

Although when there is a small change in the data in the table(i.e. When new rows are added in the table) it may not have a huge impact when there is a major change in stats, redshift starts to scan more data. So as to make the right query execution plan, Redshift Vacuum and Analyze requires knowing the stats about tables involved. Stats for table changes when new data is inserted or deleted.

Understanding When to Use Amazon Redshift Analyze

Similar to Redshift Vacuum, analyze too is a time-consuming operation. To get all the tables which need to have its stats updated you can run the query below.

SELECT "schema" + '.' + "table" FROM svv_table_info where stats_off > 10

This Redshift Vacuum and Analyze query will return all the tables whose stats are off by 10%.

Understanding How to Run Amazon Redshift Analyze

You can run the Amazon Redshift Vacuum and Analyze command using the following syntax:

Amazon Redshift Analyze syntax.
Image Source: Self

E.g. ANALYZE VERBOSE users;

Another way to improve the performance of Redshift is by re-structuring the data from OLTP to OLAP. You can create derived tables by pre-aggregating and joining the data for faster query performance. Solutions such as Hevo Data Integration Platform offer Data Modelling and Workflow Capability to achieve this in a simple and reliable manner. You can use Hevo for 14-day Free Trial.

Apart from this guide on Amazon Redshift Vacuum and Analyze, we have also discussed the right way to choose distribution keys and sort keys. Also, as part of our Amazon Redshift blog series, you can read a detailed account where we have gone deep into understanding Amazon Redshift architecture.

This completes the second section of Amazon Redshift Vacuum and Analyze tutorial on when and how you can leverage Amazon Redshift Vacuum.

Conclusion

This article teaches you about the Amazon Redshift Vacuum and Analyze. It provides in-depth knowledge about the concepts associated with it to help you implement them and improve your query and data performance. Integrating complex data from numerous sources into a destination of your choice such as Amazon Redshift can be challenging especially for a beginner & this is where Hevo saves the day.

Visit our Website to Explore Hevo

Hevo Data, a No-code Data Pipeline, helps you transfer data from a source of your choice in a fully automated and secure manner without having to write the code repeatedly. Hevo, with its strong integration with 100+ sources & BI tools, allows you to not only export & load data but also transform & enrich your data & make it analysis-ready in a jiff.

Tell us about your experience of learning about Amazon Redshift Vacuum and Analyze! Share your thoughts in the comments section below!

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

No-code Data Pipeline for Redshift