Amazon Redshift is a petabyte-scale data warehouse, managing such mammoth disk space is no easy job. Unlike traditional databases which have limited disk space and performs housekeeping activity without user intervention, Redshift leaves it up to the user to perform its housekeeping activity so as not to hamper its performance. Ensuring the real-time availability of data should be one of the first things that you should work on to get the most out of your Redshift Data Warehouse. If you are dealing with a huge amount of data, then it will be an absolute necessity to guarantee accurate, consistent and latest data in the warehouse.
Real-time data integration solutions like Hevo can help you seamlessly move data from 100s of sources into Redshift in minutes. This not only guarantees data consistency and accuracy but also takes many ETL housekeeping tasks off you. Among other things, you might want to focus on Amazon Redshift Sort Keys and Amazon Redshift Distribution keys to optimize the query performance on Redshift.
For the scope of this article, we will talk about Amazon Redshift Vacuum and Analyze and how they can help optimize Redshift Performance by improving Redshift space utilization.
Table of Contents
- Introduction to Amazon Redshift
- Understanding the Need for Amazon Redshift Vacuum and Analyze
- Using Amazon Redshift Vaccum for High Performance
- Using Amazon Redshift Analyze for High Performance
Introduction to Amazon Redshift
Amazon Redshift is a fully-managed petabyte-scale cloud-based data warehouse, designed to store large scale data sets and perform insightful analysis on them in real-time.
It is highly column-oriented & designed to connect with SQL-based clients and business intelligence tools, making data available to users in real-time. Supporting PostgreSQL 8, Redshift delivers exceptional performance and efficient querying. Each Amazon Redshift data warehouse contains a collection of computing resources (nodes) organized in a cluster, each having an engine of its own and a database to it.
For further information on Amazon Redshift, you can check the official website here.
Understanding the Need for Amazon Redshift Vacuum and Analyze
Redshift does not automatically reclaim and reuse space that is freed when you delete rows and update rows. Moreover, when data is inserted into database Redshift does not sort it on the go. Redshift reclaims deleted space and sorts the new data when VACUUM query is issued.
Also to help plan the query execution strategy, redshift uses stats from the tables involved in the query like the size of the table, distribution style of data in the table, sort keys of the table etc. These stats information needs to be kept updated for better performance of queries on redshift, this is where Analyze command plays its role. In this document below we will discuss when and how to use Amazon Redshift Vacuum and Analyze.
Using Amazon Redshift Vaccum for High Performance
When data is inserted into Redshift, it is not sorted and is written on an unsorted block. With unsorted data on disk, query performance might be degraded for operations that rely on sorted data, such as range-restricted scans or merge joins. When you run a DELETE query, redshift soft deletes the data. Similar is the case when you are performing UPDATE, Redshift performs a DELETE followed by an INSERT in the background. When vacuum command is issued it physically deletes the data which was soft-deleted and sorts the data again.
Understanding When to Use Amazon Redshift Vaccum
Although Vacuum improves query performance it comes at a cost of time and hits performance during its execution. It is recommended to perform vacuum depending on the amount of space that needs to be reclaimed and also upon unsorted data. To figure out which tables require vacuuming we can run the following query.
SELECT "schema" + '.' + "table" FROM svv_table_info where unsorted > 10
The query above will return all the tables which have unsorted data of above 10%.
Figuring out tables which have soft deleted rows is not straightforward, as redshift does not provide this information directly. What it provides is the number of total rows in a table including ones that are marked for deletion(tbl_rows column in the svv_table_info table). To get an actual number of rows (excluding ones which are marked for deletion) you will simply have to run a count query on the table and figure out the number of rows which have been marked for deletion. For example, let’s consider a hypothetical table ‘users’ in ‘public’ schema. we have to calculate soft deleted rows in the users table.
SELECT tbl_rows - (SELECT count(*) FROM public.users) FROM svv_table_info WHERE "schema" = 'public' AND "table" = 'users';
Understanding How to Run Amazon Redshift Vaccum
You can issue vacuum either on a table or on the complete database.
There are following 4 options available:
- VACUUM DELETE ONLY: A DELETE ONLY vacuum reclaims disk space and do not sort new rows. For example, you might perform a DELETE ONLY vacuum operation if you don’t need to resort rows to optimize query performance.
- VACUUM SORT ONLY: A SORT ONLY vacuum do not reclaim disk space it just sort new rows in the table.
- VACUUM FULL: It is a combination of DELETE ONLY and SORT ONLY vacuum. As the operation is handled internally by Redshift it is better to run VACUUM FULL rather than manually running DELETE ONLY followed by SORT ONLY vacuum. VACUUM FULL is the same as VACUUM meaning VACUUM FULL is the default vacuum operation.
- VACUUM REINDEX: Use this for tables that use interleaved sort keys. When you initially load an empty interleaved table using COPY or CREATE TABLE AS, Redshift automatically builds the interleaved index. If you initially load an interleaved table using INSERT, you need to run VACUUM REINDEX afterwards to initialize the interleaved index.
To perform vacuum use the query below
E.g VACUUM FULL users;
This completes the first section of Amazon Redshift Vacuum and Analyze tutorial on when and how you can leverage Amazon Redshift Vacuum.
Using Amazon Redshift Analyze for High Performance
When a query is issued on Redshift, it breaks it into small steps, which includes the scanning of data blocks. To minimize the amount of data scanned, Redshift relies on stats provided by tables. Stats are outdated when new data is inserted in tables. ANALYZE is used to update stats of a table.
Although when there is a small change in the data in the table(i.e. When new rows are added in the table) it may not have a huge impact when there is a major change in stats, redshift starts to scan more data. So as to make the right query execution plan, Redshift requires knowing the stats about tables involved. Stats for table changes when new data is inserted or deleted.
Understanding When to Use Amazon Redshift Analyze
Similar to vacuum, analyze too is a time-consuming operation. To get all the tables which need to have its stats updated you can run the query below.
SELECT "schema" + '.' + "table" FROM svv_table_info where stats_off > 10
This query will return all the tables whose stats are off by 10%.
Understanding How to Run Amazon Redshift Analyze
You can run the Amazon Redshift Analyze command using the following syntax:
E.g. ANALYZE VERBOSE users;
Another way to improve the performance of Redshift is by re-structuring the data from OLTP to OLAP. You can create derived tables by pre-aggregating and joining the data for faster query performance. Solutions such as Hevo Data Integration Platform offer Data Modelling and Workflow Capability to achieve this in a simple and reliable manner. You can use Hevo for 7-day Free Trial.
Apart from this guide on Amazon Redshift Vacuum and Analyze, we have also discussed the right way to choose distribution keys and sort keys. Also, as part of our Amazon Redshift blog series, you can read a detailed account where we have gone deep into understanding Amazon Redshift architecture.
This completes the second section of Amazon Redshift Vacuum and Analyze tutorial on when and how you can leverage Amazon Redshift Vacuum.
This article teaches you about the Amazon Redshift Vacuum and Analyze. It provides in-depth knowledge about the concepts associated with it to help you implement them and improve your query and data performance. Integrating complex data from numerous sources into a destination of your choice such as Amazon Redshift can be challenging especially for a beginner & this is where Hevo saves the day. Hevo Data, a No-code Data Pipeline, helps you transfer data from a source of your choice in a fully-automated and secure manner without having to write the code repeatedly. Hevo, with its strong integration with 100+ sources & BI tools, allows you to not only export & load data but also transform & enrich your data & make it analysis-ready in a jiff.
Tell us about your experience of learning about Amazon Redshift Vacuum and Analyze! Share your thoughts in the comments section below!