As your application gains traction, size of data that you have to analyse increases exponentially. After a certain point, your queries start taking a lot of time, size of data becomes unmanageable on conventional databases. Consequently, you start looking for a warehousing solution for data storage which can keep your data organised and can make it easily accessible for reporting and analytics. Amazon Redshift, a Data Warehousing solution by AWS, is a strong contender for this.
Researching and deciding on the best suited Data Warehouse solution is often not an easy task. It not only requires a good commitment on the finances but more importantly, it becomes an integral part of your analytics stack.
So to help you decide on that, let’s talk about Amazon Redshift and its pros and cons in using it as your primary data warehousing solution.
What is Amazon Redshift
Amazon Redshift is a fully managed, cloud-based, petabyte-scale data warehouse service by Amazon Web Services (AWS). It is an efficient solution to collect and store all your data and enables you to analyze it using various business intelligence tools to acquire new insights for your business and customers.
Let’s look at some of the advantages of Amazon Redshift.
Exceptionally fast – Redshift is very fast when it comes to loading data and querying it for analytical and reporting purposes.
Redshift has Massively Parallel Processing (MPP) Architecture which allows you to load data at blazing fast speed. In addition, using this architecture, Redshift distributes and parallelize your queries across multiple nodes.
Redshift gives you an option to use Dense Compute nodes which are SSD based data warehouses. Using this you can run most complex queries in very less time.
High Performance – As discussed in the previous point, Redshift gains high performance using massive parallelism, efficient data compression, query optimization and distribution.
MPP enables Redshift to parallelize data loading, backup and restore operation. Furthermore, queries that you execute get distributed across multiple nodes.
Redshift is a columnar storage database, which is optimised for huge and repetitive type of data. Using columnar storage, reduces the I/O operations on disk drastically, improving performance as a result.
Redshift gives you an option to define column based encoding for data compression. If not specified by the user, redshift automatically assigns compression encoding. Data compression helps in reducing memory footprint and significantly improves the I/O speed. To read more about it, check out our blog Understanding Amazon Redshift Architecture.
- Horizontally Scalable – Scalability is a very crucial point for any Data warehousing solution and Redshift does pretty well job in that. Redshift is horizontally scalable. Whenever you need to increase the storage or need it to run faster, just add more nodes using AWS console or Cluster API and it will upscale immediately.
During this process, your existing cluster will remain available for read operations so your application stays uninterrupted.
During the scaling operation, Redshift moves data parallelly between compute nodes of old and new clusters. Therefore enabling the transition to complete smoothly and as quickly as possible.
- Massive Storage capacity – As expected from a Data warehousing solution, Redshift provides massive storage capacity. A basic setup can give you petabyte range of data storage. In addition, Redshift gives you an option to choose Dense Storage type of compute nodes which can provide large storage space using Hard Disk Drives for very low price. You can further increase the storage by adding more nodes in your cluster and it can go well beyond petabyte of data range.
- Attractive and transparent pricing – Pricing is a very strong point in favour of Redshift, it is considerably cheaper than alternatives or on-premise solution. Redshift has 2 pricing models, pay as you go and reserved instance. Hence this gives you the flexibility to categorize this expense as an operational expense or capital expense.
If your use case requires more data storage, then with 3 years reserved instance Dense Storage plan, effective price per terabyte per year can be as low as $935. Comparing this to traditional on-premise storage, which roughly costs around $19k-$25k per terabyte, Redshift is significantly cheaper.
- SQL interface – Redshift Query Engine is based on ParAccel which has same interface as PostgreSQL If you are already familiar with SQL, you don’t need to learn a lot of new tech to start using query module of Redshift. Since Redshift uses SQL, it works with existing postgres JDBC/ODBC drivers, readily connecting to most of the Business Intelligence tools.
AWS ecosystem – Many businesses are running their infrastructure on AWS already, EC2 for servers, S3 for long-term storage, RDS for database and this number is constantly increasing. Redshift works very well if rest of your infra is already on AWS and you get the benefit of data locality and cost of data transport is comparatively low.
For a lot of businesses, S3 has become the de facto destination for cloud storage. Since Redshift is virtually co-located with S3 and it can access formatted data on S3 with a single COPY command. When loading or dumping data on S3, Redshift uses Massive Parallel Processing which can move data at a very fast speed.
- Security – Redshift comes packed with various security features. There are options like VPC for network isolation, various ways to handle access control, data encryption etc.
Data encryption option is available at multiple places in Redshift. To encrypt data stored in your cluster you can enable cluster encryption at time of launching the cluster. Also, to encrypt data in transit, you can enable SSL encryption. When loading data from S3, redshift allows you to use either server-side encryption or client-side encryption. Finally, at the time of loading data, S3 or Redshift copy command handles the decryption respectively.
Redshift clusters can be launched inside your infrastructure Virtual Private Cloud (VPC). Hence you can define VPC security groups to restrict inbound or outbound access to your redshift clusters.
Using robust Access Control system of AWS, you can grant privilege to specific users or maintain access on specific database level. Additionally, you can even define users and groups to have access to specific data in tables.
This section details out some of the limitations of Amazon Redshift.
- Doesn’t enforce uniqueness – There is no way in redshift to enforce uniqueness on inserted data. Hence, if you have a distributed system and it writes data on Redshift, you will have to handle the uniqueness yourself either on the application layer or by using some method of data de-duplication.
- Only S3 and DynamoDB support for parallel upload – If your data is in Amazon S3 or relational DynamoDB, Redshift can load it using Massively Parallel Processing which is very fast. But for all other sources, parallel loading is not supported. You will either have to use JDBC inserts or some scripts to load data into Redshift. Alternatively, you can use an ETL solution like Hevo which can load your data into Redshift parallelly from 100s of sources.
- Requires a good understanding of Sort and Dist keys – Sort keys and Distribution keys decide how data is stored and indexed across all Redshift nodes. Therefore, you need to have a solid understanding of these concepts and you need to properly set them on your tables for optimal performance. There can be only one distribution key for a table and that can not be changed later on, which means you have to think carefully and anticipate future workloads before deciding Dist key.
You can read our blog discussing Distribution keys and Sort keys in detail.
- Can’t be used as live app database – While Redshift is very fast when running queries on huge amount of data or running reporting and analytics, but it is not fast enough for live web apps. So you will have to pull data into a caching layer or a vanilla Postgres instance to serve redshift data to web apps.
- Data on Cloud – Though it is a good thing for most of the people, in some use cases it could be a point of concern. So if you are concerned with the privacy of data or your data has extremely sensitive content, you may not be comfortable putting it on the cloud.
Amazon Redshift is an amazing solution for data warehousing. We have given a brief overview on Redshift – pros and cons. It has some shortcomings but it is way ahead of the alternatives. You may need to learn a few things to use it wisely, but once you get the hang of it, it works without a hassle.
We, at Hevo, are building an ETL solution which can help bring your data from various sources to Redshift in real time. You can reach out to us if you need help in setting up your Redshift clusters or connecting your data sources to Redshift instance.