Are you setting up AWS Redshift as your organization’s data warehouse? Or are you wondering what the guidelines and best practices to make the most out of your Redshift warehouse are? If the answer to any of these questions is yes, then you are in the right place. This article seeks to explore the various components that Redshift comprises of, and provides guidelines that can help you make the best out of the real-world Redshift cluster operations on the AWS public cloud. These AWS Redshift best practices will make your data warehousing operations a lot smoother and better.
Introduction to AWS Redshift
AWS Redshift is a very cost-effective cloud data warehouse that gives you access to high performance and high quality analytical services that can help you turn your organization into a data-driven enterprise. It costs less than $1,000/TB/Year which is roughly 10% of the traditional data warehouse TCO. It is capable of very high performance on datasets ranging from a gigabyte to a petabyte and beyond.
It uses modern data warehousing architectural approaches like columnar storage, data compression, and zone mapping to reduce the amount of I/O that is needed to perform queries and can offer significantly better performance than an OLTP data warehouse optimised for the same use case.
In order to utilize Redshift’s complete potential, you first need to set it up by consolidating all your data from different sources that you are using. These may include cloud applications, databases, or local files. To do this, you may use custom-built data pipelines or opt for a completely managed data pipeline solution that works out of the box. Custom development will require you to invest significantly in engineering resources, whereas a No-code Data Pipeline like Hevo can automate this process without any hassles.
Hevo, a No-code Data Pipeline, automates the entire process of ingesting data from various sources to Redshift in real-time. Hevo is also an official AWS Technology Partner. Hevo is currently able to integrate with hundreds of data sources ranging from SQL, NoSQL, SaaS products, etc. with the click of a button. Some of its salient features are:
- Fully automated, easy to set up data pipelines.
- Fault-tolerant replication with a zero data loss guarantee.
- On-the-fly data transformations.
- Zero maintenance and supervision.
- Near real-time data replication.
History of AWS Redshift
Launched in February 2013, Redshift is one of the fastest growing services in the AWS ecosystem. AWS took PostgreSQL and packaged that with OLAP, MPP, and Columnar storage. It also works with all the AWS dependent services such as:
- Amazon S3
- AWS KMS
- Amazon VPC
- Amazon SWF
- Amazon Route S3
- Amazon CloudWatch
- Amazon EC2
AWS Redshift Cluster Architecture
The cluster is split into two parts: one is the leader node and others are compute nodes.
There are two or more compute nodes attached to a leader node. The leader node:
- Is the SQL endpoint, so your Tableau servers, Workbench all connect to the leader node.
- Exposes PostgreSQL catalog tables.
- Stores metadata.
- Coordinates parallel SQL processing.
- Does parsing and execution of your queries. When a query is submitted to a leader node, it converts that to C++ code and pushes it down to the compute nodes for them to execute.
On the compute nodes, parallel query execution processes occur. Backup and restore processes also occur in AWS S3. AWS continuously backs up data into S3 for you. Simultaneous replication processes are also happening between the compute nodes.
Local storage is attached to the compute nodes which contains your:
- Tables (columns, blocks)
AWS Redshift Best Practices
Having understood the basics of AWS Redshift architecture, you will now learn some of the top Redshift storage best practices associated with the following parameters:
- Data Distribution
- QMR: Query Monitoring Rules
- Distribution Key
- Sorting Data
- Schema Design
- Vacuuming Tables
Performance in Redshift comes from an MPP (Massively Parallel Processing) architecture that enables the system to parallelize and distribute SQL operations taking advantage of all the resources that are available in your cluster. To leverage the parallel processing system, ideally you would want to distribute your data across the cluster. There are four ways of distributing data in Redshift:
You can distribute data using KEYS where you define a KEY such that it describes how you want to distribute the data across your compute nodes. The rows are distributed according to the values in the joining columns so that matching values from the common columns are physically stored together.
You can distribute your data into all the slices in your Redshift cluster. A copy of the entire table is distributed to every node.
In EVEN distribution, the leader node distributes the rows across the slices using a round-robin algorithm. EVEN distribution is appropriate when a table does not participate in joins or when there is no clear choice between KEY distribution and ALL distribution.
Redshift assigns an optimal distribution style based on the size of the table data. If you don’t specify a distribution style with the CREATE TABLE statement, Amazon Redshift applies AUTO distribution.
AWS Redshift Best Practices: Data Distribution
Ensure that your data is evenly distributed and your KEY is able to distribute data across your compute nodes appropriately. If you have lopsided distribution you will see that your cluster won’t perform optimally.
The underlying hardware is designed for high performance data processing using features such as locally attached storage to maximize throughput between CPUs and drives. Each compute node contains ephemeral storage disks with each node having between 1, 3, 8, and 24 disks depending on the node type. Each disk is split into two partitions:
- Local data storage accessed by compute nodes
- Mirrored/remote data accessed by remote compute nodes
AWS Redshift Best Practices: Disks
AWS exposes internal views that showcase disks for you. You can take a look at these views to understand how much of the data of the remote node is stored in your local compute nodes.
Blocks are immutable 1MB chunks, that are used to persist columnar data in Redshift. A block contains in-memory metadata on it with:
- Zone Maps information
- MVCC data
- Location of previous/next block
A full block contains between 16 and 8.4 million values.
AWS Redshift Best Practices: Blocks
Blocks are always formatted when you do an update. During an update event, AWS has to format your block and a new block has to be picked up when an update is done. You therefore need to be careful when making many updates. In this case, you have to do a vacuum and a deep copy right after an update to reclaim space and sorts. By doing this, you avoid having ghost rows inside blocks.
I/O in your database is the biggest concern for query performance. Redshift was designed to reduce I/O but you have to use these components wisely to achieve maximum throughput.
AWS Redshift Best Practices: Queries
You can have multiple Redshift clusters hitting your data in S3 through a Spectrum cluster which means you are able to increase the concurrency for your Redshift cluster.
Using a Spectrum cluster allows you to run SQL queries directly against data stored in your Amazon S3 datalake using thousands of compute nodes without having to load the data into your Redshift cluster.
Spectrum supports SQL and is ANSI compliant. You should push SQL predicates into the Spectrum layer so that your Redshift cluster is not overburdened.
QMR: Query Monitoring Rules
Redshift exposes the QMR feature which allows you to set metrics-based performance boundaries for workload management queues and concurrency, and also to specify what action to take when a query goes beyond the set boundaries.
AWS Redshift Best Practices: Query Monitoring
Monitor your inflight queries and then check if a query is taking up too much memory, too much CPU, or is returning more rows than needed. QMR allows you to kill such queries which means your data scientists cannot do runaway queries on your Redshift cluster.
Distribution keys determine where data is stored in Redshift.
AWS Redshift Best Practices: Distribution Keys
You would want to look for data that is most commonly joined by the most commonly run queries. You may also want to look at queries that compute/consume the most CPU and use that as a mechanism for identifying how you should distribute data.
In the slices on a disk, the data is sorted by a sort key. If there is no sort key, it defaults to using the insertion order.
AWS Redshift Best Practices: Sort Keys
When working with sort keys, you would want to use a sort key that is frequently used in your queries as a:
- Query predicate (date range, identifier)
- Join parameter with a hash key
The sort key allows Redshift to use zone maps to avoid reading entire blocks based on predicates e.g. a table containing a timestamp for a sort key, will skip blocks containing old data.
To optimise your data for querying, you need to co-locate data by frequently joined tables using distribution keys to avoid data transfers between nodes.
AWS Redshift Best Practices: Schema Design
Use sort keys on joined columns to allow fast merge joins and use of zone maps. This compression allows you to denormalize data without a heavy overhead or penalty in incremental storage allowing you to simplify queries and limit joins. Always denormalise commonly used join attributes onto large tables.
AWS does not automatically reclaim the unused space that is free when you delete rows and update rows. To perform an update, Redshift deletes the original row and appends the updated row so every update is effectively deleted followed by an INSERT.
Rows are marked for deletion but are not removed and the query processor will scan deleted rows as well as rows that are undeleted.
Best Practice for Vacuuming Tables
Vacuuming is therefore an important aspect of operating clusters if you intend to run them for a long time and make significant updates.
Vacuuming operations need to be regularly carried out to clean up tables after a bulk delete, a load, or a series of incremental updates with the Vacuum command either against the entire database or against individual tables. This will reclaim space, purge those deleted rows and will improve query performance.
In this post, we have looked at multiple best practices for optimizing your Redshift cluster to pull off high-performance data analytics. These practices will let you make the best out of the varied amazing features that Redshift has to offer.
Are there any other AWS Redshift best practices that make data warehousing simpler and more efficient? Please let us know in the comments.