AWS Redshift is a pioneer when it comes to completely managed data warehouse services. With its ability to scale to petabytes of data, a comprehensive Postgres compatible querying engine, and multitudes of AWS tools to augment the core capability, Redshift provides everything a customer needs to use it as the sole data warehouse solution. All of these comes at reasonable prices considering the fact that the customer is relieved of all the housekeeping activities related to maintaining an ultra-reliable and always available data warehouse.
However, the process of understanding Redshift pricing is not straightforward. AWS offers a wide variety of pricing options to choose from depending upon the use cases and customer’s budget constraints.
We at Hevo Data (Hevo helps companies load data from any data source into Amazon Redshift in real-time without writing any code), often come across customers who do not find it easy to decode Redshift pricing.
In this post, we will explore the different Redshift pricing options available. Additionally, we will also explore some of the best practices that can help in optimizing your data warehouse costs.
Redshift pricing includes both compute and storage resources. It is broadly affected by four factors.
- The node type that the customer chooses to build his cluster
- The region where the cluster is deployed
- Billing strategy – on-demand billing or a reserved pricing strategy
- Use of Redshift Spectrum
Amazon Redshift Pricing – Effect of Node Types
Redshift follows a cluster-based architecture with multiple nodes allowing it to massively parallel process data. (You can read more on Redshift architecture here). This means Redshift performance is directly correlated to the specification and number of nodes that form the cluster. It offers multiple kinds of nodes from which the customers can choose based on the computing and storage requirements.
- Dense compute nodes – These nodes are optimized for computing and offer SSDs up to 2.5 TB and physical memory up to 244 GB. The pricing will also depend on the region in which your cluster will be located. The price of the lowest spec dc2.large instance varies from .25 to .37 $ per hour depending on the regions. There is also a higher spec version available which is called dc2.8xlarge that can cost anywhere from 4.8 to 7 $ per hour depending on regions.
- Dense storage nodes – These nodes offer higher storage capacity per node, but the storage hardware will be HDDs.Dense storage nodes also allow two versions – a basic version called ds2.large which offers HDDs up to 2 TB and a higher spec version that offers HDDs up to 16 TB per node. Price can vary from .85 to 1.4 $ per hour for the basic version and 6 to 11 $ per hour for the ds2.8xlarge version.
As mentioned in the above sections, Redshift pricing varies on a wide range depending on the node types. One another critical constraint is that your cluster can be formed only using the same type of nodes. So you would need to find the most optimum node type based on specific use cases. As a thumb rule, AWS itself recommends a dense compute type node for use cases with less than 500 GB of data. There is a possibility of using previous generation nodes for a further decrease in price, but we will not recommend them since they miss out on the critical elastic resize feature. Which means scaling could go into hours when using such nodes.
Amazon Redshift Pricing – Effect of Regions
Since Amazon has different costs for running their data centres in different parts of the world, the pricing of nodes also varies over a large range depending on the region where the cluster is to be deployed. Lets deliberate on some of the factors that may affect the decision of which region to deploy the cluster.
- While choosing regions, it may not be sensible to choose the regions with the cheapest price, because the data transfer time can vary according to the distance at which the clusters are located from their data source or targets. It is best to choose a location that is nearest to your data source.
- In specific cases, this decision may be further complicated by the mandates to follow data storage compliance, which requires the data to be kept in specific country boundaries.
- AWS deploys its features in different regions in a phased manner. While choosing regions, it would be worthwhile to ensure that the AWS features that you intend to use outside of Redshift are available in your preferred region.
In general, US-based regions offer the cheapest price while Asia based regions are the most expensive ones.
Amazon Redshift Pricing – On-demand vs Reserved Instance Pricing
Redshift offers pricing discounts on its usual rates if the customer is able to commit to a longer duration of using the clusters. Usually, this duration is in terms of years. Amazon claims a saving of up to 75 per cent if a customer uses reserved instance pricing. When you choose reserved pricing, irrespective of whether a cluster is active or not for the particular time period, you still have to pay the predefined amount. Redshift currently offers three types of reserved pricing strategies:
- No upfront – This is offered only for a one-year duration. The customer gets a 20 per cent discount over existing on-demand prices.
- Partial upfront – Customer needs to pay half of the money upfront and rest in monthly instalments. Amazon assures up to 41 % discount on on-demand prices for one year and 73% over 3 years. This can be purchased for a one to three-year duration.
- Full payment upfront – Amazon claims a 42 % discount over a year period and a 75 % discount over three years if the customer chooses to go with this option.
Even though the on-demand strategy offers the most flexibility, a customer may be able to save quite a lot of money if he is sure that the cluster will be engaged over a longer period of time. Redshift’s concurrency scaling is charged at on-demand rates on a per-second basis for every transient cluster that is used. AWS provides 1 hour of free credit for concurrency scaling for every 24 hours that a cluster remains active. The free credit is calculated on a per hour basis.
Amazon Redshift Spectrum Pricing
Redshift Spectrum is a querying engine service offered by the AWS allowing customers to use only the computing capability of redshift clusters on data available in S3 in different formats. This feature enables customers to add external tables to redshift clusters and run complex read queries over them without actually loading or copying data to Redshift.
Pricing of redshift spectrum is based on the amount of data scanned by each query and is fixed at 5$ per TB of data scanned. The cost is calculated in terms of the nearest megabyte with each megabyte costing .05 $. There is a minimum limit of 10 MB per query. Only the read queries are charged and the table creation and other DDL queries are not charged.
Redshift ETL – Optimizing Total Cost
Now that we have seen the factors that broadly affect the Redshift pricing lets look into some of the best practices that can be followed to keep the total cost of ownership down.
- Redshift pricing is not the only component of the cost of running your ETL. Amazon charges for data transfer also and these charges can put a serious dent to your resources if not careful enough. Data transfer charges are applicable for intra-region transfer and every transfer involving data movement from or to the locations outside AWS. It is best to keep all your deployment and data in one region as much as possible. That said this is not always practical and customers need to factor in data transfer cost while finalizing the budget
- AWS ETL tools that will be used along with Redshift can also impact the total cost. In most cases, Redshift will be used with the AWS Data pipeline for data transfer. AWS data pipeline only works for AWS specific data sources and for external sources you may have to use other ETL tools which may also cost money. As a best practice, it is better to use a fuss-free ETL tool like Hevo Data for all your ETL data transfer rather than separate tools to deal with different sources. This can help save some budget and offer a clean solution.
Hevo for Redshift ETL:
Hevo Data is an enterprise-grade data pipeline platform that can bring data from any data source to Redshift in real-time – without having to write any code. Hevo’s unique architecture ensures that your data is streamed with zero data loss and consistently into Redshift – redeeming you of all the data migration hassles.
Sign up for a 14-day free trial here to experience a hassle-free data migration experience first hand.
- Redshift needs some housekeeping activities like VACUUM to be executed periodically for claiming the data back after deletes. Even though it is possible to automate this to execute on a fixed schedule, it is a good practice to run it after large queries that use delete markers. This can save space and thereby cost.
- Follow a proper archival strategy that removes less used data into a cheaper storage mechanism like S3. Make use of the Redshift spectrum feature in the rare cases where this data is required.
- Redshift offers backup in the form of snapshots. Storage is free for backups up to 100 per cent of the Redshift cluster data volume and using the automated incremental snapshots, customers can create finely-tuned backup strategies.
- While fixing node types, it is great to have a clear idea of the total data volume right from the start itself. dc2.8xlarge systems generally offer better performance than a cluster of eight dc2.xlarge nodes.
- AWs recommends customers to use data compression as much as possible. Encoding the columns can not only make a difference to space but also can improve performance.
What are your thoughts on Redshift pricing? Let us know in the comments.