Why Redshift Vacuum and Analyze?
Amazon Redshift is a petabyte-scale data warehouse, managing such mammoth disk space is no easy job. Unlike traditional databases which have limited disk space and performs housekeeping activity without user intervention, Redshift leaves it up to the user to perform its housekeeping activity so as not to hamper its performance.
Redshift does not automatically reclaim and reuse space that is freed when you delete rows and update rows. Moreover, when data is inserted into database Redshift does not sort it on the go. Redshift reclaims deleted space and sorts the new data when VACUUM query is issued.
Also to help plan the query execution strategy, redshift uses stats from the tables involved in the query like the size of the table, distribution style of data in the table, sort keys of the table etc. These stats information needs to be kept updated for better performance of queries on redshift, this is where ANALYZE command plays its role. In this document below we will discuss when and how to use both VACUUM and ANALYZE.
Redshift Vacuum – In Detail
When data is inserted into Redshift, it is not sorted and is written on an unsorted block. With unsorted data on disk, query performance might be degraded for operations that rely on sorted data, such as range-restricted scans or merge joins. When you run a DELETE query, redshift soft deletes the data. Similar is the case when you are performing UPDATE, Redshift performs a DELETE followed by an INSERT in the background. When vacuum command is issued it physically deletes the data which was soft deleted and sorts the data again.
WHEN should you perform Vacuum?
Although vacuum improves query performance it comes at a cost of time and hits performance during its execution. It is recommended to perform vacuum depending on the amount of space that needs to be reclaimed and also upon unsorted data. To figure out which tables require vacuuming we can run the following query.
SELECT "schema" + '.' + "table" FROM svv_table_info where unsorted > 10
The query above will return all the tables which have unsorted data of above 10%.
Figuring out tables which have soft deleted rows is not straightforward, as redshift does not provide this information directly. What it provides is the number of total rows in a table including ones that are marked for deletion(tbl_rows column in the svv_table_info table). To get an actual number of rows (excluding ones which are marked for deletion) you will simply have to run a count query on the table and figure out the number of rows which have been marked for deletion. For example, let’s consider a hypothetical table ‘users’ in ‘public’ schema. we have to calculate soft deleted rows in the users table.
SELECT tbl_rows - (SELECT count(*) FROM public.users) FROM svv_table_info WHERE "schema" = 'public' AND "table" = 'users';
HOW should you run Vacuum?
You can issue vacuum either on a table or on the complete database.
There are following 4 options available:
- VACUUM DELETE ONLY: A DELETE ONLY vacuum reclaims disk space and do not sort new rows. For example, you might perform a DELETE ONLY vacuum operation if you don’t need to resort rows to optimize query performance.
- VACUUM SORT ONLY: A SORT ONLY vacuum do not reclaim disk space it just sort new rows in the table.
- VACUUM FULL: It is a combination of DELETE ONLY and SORT ONLY vacuum. As the operation is handled internally by Redshift it is better to run VACUUM FULL rather than manually running DELETE ONLY followed by SORT ONLY vacuum. VACUUM FULL is the same as VACUUM meaning VACUUM FULL is the default vacuum operation.
- VACUUM REINDEX: Use this for tables that use interleaved sort keys. When you initially load an empty interleaved table using COPY or CREATE TABLE AS, Redshift automatically builds the interleaved index. If you initially load an interleaved table using INSERT, you need to run VACUUM REINDEX afterward to initialize the interleaved index.
To perform vacuum use the query below
VACUUM [ FULL | SORT ONLY | DELETE ONLY | REINDEX ] [ [ table_name ] [ TO threshold PERCENT ] ]
E.g VACUUM FULL users;
Redshift Analyze – In Detail
When a query is issued on Redshift, it breaks it into small steps, which includes scanning of data blocks. To minimize the amount of data scanned, Redshift relies on stats provided by tables. Stats are outdated when new data is inserted in tables. ANALYZE is used to update stats of a table.
Although when there is a small change in the data in the table(i.e. When new rows are added in the table) it may not have a huge impact when there is a major change in stats, redshift starts to scan more data. So as to make the right query execution plan, Redshift requires knowing the stats about tables involved. Stats for table changes when new data is inserted or deleted.
WHEN should you perform Analyze?
Similar to vacuum, analyze too is a time-consuming operation. To get all the tables which need to have its stats updated you can run the query below.
SELECT "schema" + '.' + "table" FROM svv_table_info where stats_off > 10
This query will return all the tables whose stats are off by 10%.
HOW should you run Analyze?
To perform analyze run the query below.
ANALYZE [ VERBOSE ] [ [ table_name [ ( column_name [, ...] ) ] ] [ PREDICATE COLUMNS | ALL COLUMNS ]
E.g. ANALYZE VERBOSE users;
Apart from this guide on Redshift Vacuum and Analyze, we have also discussed about the right way to choose distribution keys and sort keys. Also, as part of our Amazon Redshift blog series, you can read a detailed account where we have gone deep into understanding Amazon Redshift Architecture.
At Hevo, we use Amazon Redshift as one of the data warehouses and bring data from different sources in real-time. Data sources could be – databases, cloud applications, clickstreams. Let us know your experience with Redshift Vacuum and Analyze commands in the comments below.