AWS Redshift is a pioneer when it comes to completely managed data warehouse services. With its ability to scale to petabytes of data, a comprehensive Postgres compatible querying engine, and multitudes of AWS tools to augment the core capability, Redshift provides everything a customer needs to use it as the sole data warehouse solution. All of these comes at reasonable prices considering the fact that the customer is relieved of all the housekeeping activities related to maintaining an ultra-reliable and always available data warehouse. However, the process of understanding Redshift pricing is not straightforward. AWS offers a wide variety of pricing options to choose from depending upon the use cases and customer’s budget constraints.
In this post, we will explore the different Redshift pricing options available. Additionally, we will also explore some of the best practices that can help in optimizing your data warehouse costs.
Table of Contents
- Introduction to Redshift
- Factors that affect Amazon Redshift Pricing
- Effect of Node Type on Redshift Pricing
- Effect of Regions on Redshift Pricing
- On-demand vs Reserved Instance Pricing
- Optimizing Redshift ETL Cost
Introduction to Redshift
Amazon Redshift is a fully-managed petabyte-scale cloud-based data warehouse, designed to store large scale data sets and perform insightful analysis on them in real-time.
It is highly column-oriented & designed to connect with SQL-based clients and business intelligence tools, making data available to users in real-time. Supporting PostgreSQL 8, Redshift delivers exceptional performance and efficient querying. Each Amazon Redshift data warehouse contains a collection of computing resources (nodes) organized in a cluster, each having an engine of its own and a database to it.
For further information on Amazon Redshift, you can check the official site here.
Hevo Data: A Smart Alternative for Redshift ETL
Hevo Data, a No-code Data Pipeline helps to transfer data from multiple sources to Redshift. Hevo is fully-managed and completely automates the process of not only exporting data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.
It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination. It allows you to focus on key business needs and perform insightful analysis using BI tools such as Tableau and many more.
Check out some amazing features of Hevo:
- Completely Managed Platform: Hevo is fully managed. You need not invest time and effort to maintain or monitor the infrastructure involved in executing codes.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
- Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to export.
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects schema of incoming data and maps it to the destination schema.
- Minimal Learning: Hevo with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
Simplify your data analysis with Hevo today! Sign up here for a 14-day free trial!
Factors that affect Amazon Redshift Pricing
Amazon Redshift Pricing is broadly affected by four factors:
- The node type that the customer chooses to build his cluster.
- The region where the cluster is deployed.
- Billing strategy – on-demand billing or a reserved pricing strategy.
- Use of Redshift Spectrum.
You can learn about these factors in depth, in the following sections:
- Effect of Node Type on Redshift Pricing
- Effect of Regions on Redshift Pricing
- On-demand vs Reserved Instance Pricing
- Amazon Redshift Spectrum Pricing
Effect of Node Type on Redshift Pricing
Redshift follows a cluster-based architecture with multiple nodes allowing it to massively parallel process data. (You can read more on Redshift architecture here). This means Redshift performance is directly correlated to the specification and number of nodes that form the cluster. It offers multiple kinds of nodes from which the customers can choose based on the computing and storage requirements.
- Dense compute nodes: These nodes are optimized for computing and offer SSDs up to 2.5 TB and physical memory up to 244 GB. The pricing will also depend on the region in which your cluster will be located. The price of the lowest spec dc2.large instance varies from .25 to .37 $ per hour depending on the regions. There is also a higher spec version available which is called dc2.8xlarge that can cost anywhere from 4.8 to 7 $ per hour depending on regions.
- Dense storage nodes: These nodes offer higher storage capacity per node, but the storage hardware will be HDDs.Dense storage nodes also allow two versions – a basic version called ds2.large which offers HDDs up to 2 TB and a higher spec version that offers HDDs up to 16 TB per node. Price can vary from .85 to 1.4 $ per hour for the basic version and 6 to 11 $ per hour for the ds2.8xlarge version.
As mentioned in the above sections, Redshift pricing varies on a wide range depending on the node types. One another critical constraint is that your cluster can be formed only using the same type of nodes. So you would need to find the most optimum node type based on specific use cases. As a thumb rule, AWS itself recommends a dense compute type node for use cases with less than 500 GB of data. There is a possibility of using previous generation nodes for a further decrease in price, but we will not recommend them since they miss out on the critical elastic resize feature. Which means scaling could go into hours when using such nodes.
Effect of Regions on Redshift Pricing
Since Amazon has different costs for running their data centres in different parts of the world, the pricing of nodes also varies over a large range depending on the region where the cluster is to be deployed. Lets deliberate on some of the factors that may affect the decision of which region to deploy the cluster.
- While choosing regions, it may not be sensible to choose the regions with the cheapest price, because the data transfer time can vary according to the distance at which the clusters are located from their data source or targets. It is best to choose a location that is nearest to your data source.
- In specific cases, this decision may be further complicated by the mandates to follow data storage compliance, which requires the data to be kept in specific country boundaries.
- AWS deploys its features in different regions in a phased manner. While choosing regions, it would be worthwhile to ensure that the AWS features that you intend to use outside of Redshift are available in your preferred region.
In general, US-based regions offer the cheapest price while Asia based regions are the most expensive ones.
On-demand vs Reserved Instance Pricing
Redshift offers pricing discounts on its usual rates if the customer is able to commit to a longer duration of using the clusters. Usually, this duration is in terms of years. Amazon claims a saving of up to 75 per cent if a customer uses reserved instance pricing. When you choose reserved pricing, irrespective of whether a cluster is active or not for the particular time period, you still have to pay the predefined amount. Redshift currently offers three types of reserved pricing strategies:
- No upfront: This is offered only for a one-year duration. The customer gets a 20 per cent discount over existing on-demand prices.
- Partial upfront: Customer needs to pay half of the money upfront and rest in monthly instalments. Amazon assures up to 41 % discount on on-demand prices for one year and 71% over 3 years. This can be purchased for a one to three-year duration.
- Full payment upfront: Amazon claims a 42 % discount over a year period and a 75 % discount over three years if the customer chooses to go with this option.
Even though the on-demand strategy offers the most flexibility, a customer may be able to save quite a lot of money if he is sure that the cluster will be engaged over a longer period of time. Redshift’s concurrency scaling is charged at on-demand rates on a per-second basis for every transient cluster that is used. AWS provides 1 hour of free credit for concurrency scaling for every 24 hours that a cluster remains active. The free credit is calculated on a per hour basis.
Amazon Redshift Spectrum Pricing
Redshift Spectrum is a querying engine service offered by the AWS allowing customers to use only the computing capability of Redshift clusters on data available in S3 in different formats. This feature enables customers to add external tables to Redshift clusters and run complex read queries over them without actually loading or copying data to Redshift.
Pricing of Redshift Spectrum is based on the amount of data scanned by each query and is fixed at 5$ per TB of data scanned. The cost is calculated in terms of the nearest megabyte with each megabyte costing .05 $. There is a minimum limit of 10 MB per query. Only the read queries are charged and the table creation and other DDL queries are not charged.
Optimizing Redshift ETL Cost
Now that we have seen the factors that broadly affect the Redshift pricing lets look into some of the best practices that can be followed to keep the total cost of ownership down.
- Data Transfer Charges: Amazon charges for data transfer also and these charges can put a serious dent to your resources if not careful enough. Data transfer charges are applicable for intra-region transfer and every transfer involving data movement from or to the locations outside AWS. It is best to keep all your deployment and data in one region as much as possible. That said this is not always practical and customers need to factor in data transfer cost while finalizing the budget
- Tools: In most cases, Redshift will be used with the AWS Data pipeline for data transfer. AWS data pipeline only works for AWS specific data sources and for external sources you may have to use other ETL tools which may also cost money. As a best practice, it is better to use a fuss-free ETL tool like Hevo Data for all your ETL data transfer rather than separate tools to deal with different sources. This can help save some budget and offer a clean solution.
- Vacuuming Tables: Redshift needs some housekeeping activities like VACUUM to be executed periodically for claiming the data back after deletes. Even though it is possible to automate this to execute on a fixed schedule, it is a good practice to run it after large queries that use delete markers. This can save space and thereby cost.
- Archival Strategy: Follow a proper archival strategy that removes less used data into a cheaper storage mechanism like S3. Make use of the Redshift spectrum feature in the rare cases where this data is required.
- Data Backup: Redshift offers backup in the form of snapshots. Storage is free for backups up to 100 per cent of the Redshift cluster data volume and using the automated incremental snapshots, customers can create finely-tuned backup strategies.
- Data Volume: While fixing node types, it is great to have a clear idea of the total data volume right from the start itself. dc2.8xlarge systems generally offer better performance than a cluster of eight dc2.xlarge nodes.
- Encoding Columns: AWS recommends customers to use data compression as much as possible. Encoding the columns can not only make a difference to space but also can improve performance.
Want to give Hevo a spin? Signing up for a 14-day free trial and experience the feature-rich Hevo suite first hand. Have a look at our unbeatable pricing that will help you choose the right plan for you.
Why don’t you share your experience of setting up MongoDB Replication in the comments? We would love to hear from you!