Amazon Redshift is a fully-managed, cloud data warehouse that you can draw useful insights from. Clients rely on Amazon Redshift to analyze data and to execute complex analytical queries.
You can receive real-time insights and predictive analytics on your operational databases, data lake, data warehouse, and third-party datasets.
This article will talk about Redshift Managed Storage and Columnar Data Storage in Redshift Data Storage.
What is Amazon Redshift Managed Storage?
Redshift Managed Storage (RMS) is a separate storage tier for data warehouse data. RMS provides the capacity to scale your Amazon S3 storage up to petabytes.
RMS also enables you to scale and pay for computing and storage independently, allowing you to size your cluster based solely on computing requirements. As a tier-1 cache, it automatically employs high-performance SSD-based local storage.
It also takes advantage of optimizations, such as data block temperature, data blockage, and workload patterns, to deliver high performance while automatically scaling storage to Amazon S3 when required.
How to Use Amazon Redshift’s Managed Storage?
Suppose you are currently utilizing Amazon Redshift Dense Storage or Dense Compute nodes. In that case, you can use Elastic Resize to upgrade your existing clusters to the new compute instance RA3 Amazon Redshift Serverless.
For clusters utilizing the RA3 instance use Redshift-managed storage to store data automatically. This capability can be utilized without additional steps beyond Amazon Redshift Serverless or RA3 instances.
Columnar Data Storage
Columnar storage for database tables is crucial for optimizing the performance of analytic queries because it drastically reduces the overall disc I/O requirements. It reduces the quantity of data that must be loaded from the disc.
The following images illustrate how columnar data storage implements efficiencies and how this translates to memory retrieval efficiencies.
This first illustration demonstrates how rows of database table records are typically stored in disc blocks.
- A typical relational database table row contains the values for a single record’s fields.
- In row-wise database storage, data blocks store sequential values for each column comprising the entire row.
- If the block size is smaller than the record size, it may take more than one block to store an entire record.
- If the block size is greater than the record size, a record may be stored in less than one block, resulting in inefficient use of disc space.
- The majority of transactions in online transaction processing (OLTP) applications involve reading and writing the values for entire records, typically one record or a small number of records at a time. Row-wise storage is, therefore, optimal for OLTP databases.
The following illustration demonstrates how columnar storage stores the values for each column sequentially in disc blocks.
- Using columnar storage, each data block stores values for multiple rows in a single column. As records enter the system, Amazon Redshift transparently converts each column’s data to columnar storage.
- In this simplified example, columnar storage allows each data block to contain column field values for up to three times as many records as row-based storage.
- This indicates that reading the same number of column field values for the same number of records requires one-third fewer I/O operations than row-wise storage. In practice, the storage efficiency is even greater when utilizing tables with a very large number of columns and rows.
- Since each block contains the same data type, block data can utilize a compression scheme chosen specifically for the column data type, reducing disc space and I/O even further.
- The space savings realized when storing data on a disc are also realized when retrieving and storing the same data in memory. Since many database operations only need to access or operate on one or a small number of columns at a time, you can conserve memory by only retrieving blocks for columns that are required by a query.
- Whereas OLTP transactions typically involve most or all of the columns in a row for a small number of records, data warehouse queries typically read only a handful of columns for a vast number of rows.
- This implies that reading the same quantity of column field values for the same quantity of rows requires a fraction of the I/O operations. It utilizes a fraction of the memory needed to process row-wise blocks. In practice, when utilizing tables with many columns and rows, the efficiency gains are proportionally greater.
- For example, let’s consider a table containing 100 columns. A query that utilizes five columns will only require reading approximately five percent of the table’s data. This is repeated for millions, billions, or even trillions of records in large databases. A row-wise database, on the other hand, would also read the 95 unnecessary columns’ blocks.
- The average size of a database block ranges between 2 KB and 32 KB.
- Amazon Redshift utilizes a block size of 1 MB, which is more efficient and reduces the number of I/O requests required to perform database loading and other operations associated with a query run.
Backup and Restoration of Data
- Amazon Redshift RA3 clusters and Amazon Redshift Serverless utilize Redshift Managed Storage, which always has the most recent data copy available. The DS2 and DC2 clusters mirror the cluster’s data to ensure that the most recent copy is accessible in the event of a failure.
- On all Redshift cluster types, backups are created automatically and kept for 24 hours, and on serverless environments, recovery points are available for the past 24 hours.
- Additionally, you can create your own backups that can be kept indefinitely. The Amazon Redshift automated backups or Amazon Redshift Serverless recovery points can be converted into user backups for longer retention.
- For disaster recovery, Amazon Redshift can also asynchronously replicate your snapshots or recovery points to Amazon S3 in a different Region.
- On a DS2 or DC2 cluster, free backup storage is limited to the total storage capacity of the data warehouse cluster’s nodes and is only applicable to active data warehouse clusters.
- For instance, if your data warehouse has a total storage capacity of 8 TB, we will provide up to 8 TB of backup storage at no additional cost. Using the AWS Management Console or Amazon Redshift APIs, you can extend your backup retention period beyond a single day.
- Amazon Redshift only backs up changed data, so most snapshots consume a negligible amount of your free backup storage. When you need to restore a backup, you can access all automated backups created during your backup retention period. Once you select a backup to restore from, they will create a new data warehouse cluster and restore your data to it.
What happens to the backups if you delete your data warehouse cluster?
When deleting a data warehouse cluster, you can specify whether a final snapshot should be created. This enables a later restoration of the deleted data warehouse cluster.
Unless you delete them, all previously created manual snapshots of your data warehouse cluster will be retained and billed at standard Amazon S3 rates.
Best Practices
Following are a few best practices for designing tables and loading tables with data in Redshift Data Storage:
- Amazon Redshift stores your data on a disc according to the sort key in sorted order. When determining optimal query plans, Amazon Redshift’s query optimizer employs sort order.
- Specify AUTO for the sort key to have Amazon Redshift determine the optimal sort order.
- If the most frequently requested data is recent, specify the timestamp column as the leading column for the sort key.
- Avoid the habit of using the maximum column size for convenience. During the processing of complex queries, it may be necessary to store intermediate query results in temporary tables.
- Since temporary tables are not compressed, unnecessarily large columns consume an excessive amount of memory and temporary disc space, which can negatively impact the performance of queries.
Conclusion
In this article about Redshift Data Storage, you learned about Amazon Redshift’s managed storage and columnar data storage. You also learned the best practices for designing tables and loading data. You also read about some of the limitations of Amazon Redshift.
Getting data from or into Amazon Redshift can be a time-consuming and resource-intensive task. Instead of spending months developing and maintaining such data integrations, you can enjoy a smooth ride with Hevo Data’s 150+ plug-and-play integrations (including 40+ free sources).
Saving countless hours of manual data cleaning & standardizing, Hevo Data’s pre-load data transformations get it done in minutes via a simple drag n drop interface or your custom python scripts.
No need to go to your data warehouse for post-load transformations. You can run complex SQL transformations from the comfort of Hevo’s interface and get your data in the final analysis-ready form.
Want to take Hevo Data for a ride? Sign Up for a 14-day free trial and simplify your data integration process. Check out the pricing details to understand which plan fulfills all your business needs.
Sharon is a data science enthusiast with a hands-on approach to data integration and infrastructure. She leverages her technical background in computer science and her experience as a Marketing Content Analyst at Hevo Data to create informative content that bridges the gap between technical concepts and practical applications. Sharon's passion lies in using data to solve real-world problems and empower others with data literacy.