What is Amazon Redshift Data Warehouse?
Amazon Redshift is a completely managed large scale data warehouse offered as a cloud service by Amazon. Completely managed in this context means that the end-user is spared of all activities related to hosting, maintaining and ensuring the reliability of an always running data warehouse. It offers a Postgres compatible querying layer and is compatible with most SQL based tools and commonly used data intelligence applications. Other than the data warehouse service, AWS also offers another service called Redshift Spectrum – which is for running SQL queries against S3 data. This service is not dealt with here since it is a fundamentally different concept.
When contemplating the usage of a third-party managed service as the backbone data warehouse, the first point of contention for a data architect would be the foundation on which the service is built, especially since the foundation has a critical impact on how the service will behave under various circumstances. This article aims to give you a detailed overview of what is Amazon Redshift, it’s features, capabilities and shortcomings. Let us dive into the details.
Amazon Redshift Architecture
This section highlights the components of AWS Redshift architecture, thereby giving you enough pointers to decide if this is favourable for your use case. The Redshift Architecture Diagram is as below:
- Redshift Cluster: Redshift uses a cluster of nodes as its core infrastructure component. A cluster usually has one leader node and a number of compute nodes. In cases where there is only one compute node, there is no additional leader node.
- Compute Nodes: Each compute node has its own CPU, memory and storage disk. Client applications are oblivious to the existence of compute nodes and never have to deal directly with compute nodes.
- Leader Node: The leader node is responsible for all communications with client applications. The leader node also manages the coordination of compute nodes. Query parsing and execution plan development is also the responsibility of the leader node. On receiving a query the leader node creates the execution plan and assigns the compiled code to compute nodes. A portion of the data is assigned to each compute node. The final aggregation of the results is performed by the leader node.
Redshift allows the users to select from two types nodes – Dense Storage nodes and Dense Compute node. Customers can select them based on the nature of their requirements – whether it is storage heavy or compute-heavy. Redshift’s cluster can be upgraded by increasing the number of nodes or upgrading individual node capacity or both.
Internally the compute nodes are partitioned into slices with each slice having a portion of CPU and memory allocated to it. The node slices will work in parallel to complete the work that is allocated by the leader node.
You can read more on Amazon Redshift architecture here.
Now that we have an idea about how Redshift architecture works, let us see how this architecture translates to performance.
Amazon Redshift Performance
The next part of completely understanding what is Amazon Redshift is to decode Redshift architecture. Redshift’s architecture allows massively parallel processing, which means most of the complex queries gets executed lightning quick. One quirk with Redshift is that a significant amount of query execution time is spent on creating the execution plan and optimizing the query.
In the case of frequently executing queries, subsequent executions are usually faster than the first execution. Query execution can be optimized considerably by using proper distribution keys and sort styles.
Data loading from flat files is also executed parallel using multiple nodes, enabling fast load times. At the time of writing this, Redshift is capable of running the standard cloud data warehouse benchmark of TPC-DS in 25 minutes on 3 TB data set using 4 node cluster.
Amazon Redshift ETL and Data Transfer
A significant part of jobs running in an ETL platform will be the load jobs and transfer jobs. Data load to Redshift is performed using the COPY command of Redshift. For executing a copy command, the data needs to be in EC2. If there is already existing data in Redshift, using this command can be problematic since it results in duplicate rows. In such cases, a temporary table may need to be used. The best method to overcome such complexity is to use a proven Data Integration Platform like Hevo, which can abstract most of these details and allow you to focus on the real business logic.
Hevo for Amazon Redshift ETL:
With Hevo Data, you can bring data from over 100+ data sources into Redshift without writing any code. Once the data source is connected, Hevo does all the heavy lifting to move your data to Redshift in real-time. Hevo is also fully managed, so you need have no concerns about maintenance and monitoring of any ETL scripts/cron jobs. This will let you focus your efforts on delivering meaningful insights from data.
Additionally, Amazon offers two services that can make things easier for running an ETL platform on AWS. AWS Glue and AWS Data Pipeline. AWS glue can generate python or scala code to run transformations considering the metadata that is residing in the Glue Data catalog.
AWS data pipeline, on the other hand, helps schedule various jobs including data transfer using different AWS services as source and target.
Both the above services support Redshift, but there is a caveat. These services are tailor-made for AWS services and do not really do a great job in integrating with non-AWS services. So if part of your data resides in on-premise setup or a non-AWS location, you can not use the ETL tools by AWS.
In those cases, it is better to use a reliable ETL tool like Hevo which has the ability to integrate with multitudes of databases, managed services, and cloud applications. Hevo will help you move your data through simple configurations and supports all the widely used data warehouses and managed services out of the box.
Amazon Redshift uses Postgres as its query standard with its own set of data types. Since the data types are Redshift proprietary ones, there needs to be a strategy to map the source data types to Redshift data types. Redshift can manage this automatically using its own logic but can surprise the user with unexpected results if the mapping logic is not carefully considered during the data transfers.
Modern ETL systems these days also have to handle near real-time data loads. Even though Redshift is a data warehouse and designed for batch loads, combined with a good ETL tool like Hevo, it can also be used for near real-time data loads.
Scaling on Redshift
One of the most critical factors which makes a completely managed data warehouse service valuable is its ability to scale. Redshift can scale quickly and customers can choose the extent of capability according to their peak workload times.
It supports two types of scaling operations:
- Classic Resizing: First is the classic resizing which allows customers to add nodes in a matter of a few hours. Classic resizing is available for all types of nodes.
- Elastic Resizing: Elastic resizing makes even faster-scaling operations possible but is available only in case of nodes except the DC1 type of nodes. That said, there is a short window of time during even the elastic resize operation where the database will be unavailable for querying.
Redshift also allows you to spin up a cluster by quickly restoring data from a snapshot. This is very helpful when customers need to add compute resources to support high concurrency.
Amazon Redshift Pricing
Redshift prices are including compute and storage pricing. Cost is calculated based on the hours of usage. With the ability to quickly restore data warehouses from EC2 snapshots, it is possible to spin up clusters only when required allowing the users to closely manage their budgets.
Redshift offers two types of nodes – Dense compute and Dense storage nodes. These nodes can be selected based on the nature of data and the queries that are going to be executed. Dense Compute nodes starts from .25$ per hour and comes with 16TB of SSD. Dense storage nodes have 2 TB HDD and start at .85 $ per hour. It is to be noted that even though dense storage comes with higher storage, they are HDDs and hence the speed of I/O operations will be compromised.
Details on Redshift pricing will not be complete without mentioning Amazon’s reserved instance pricing which is applicable for almost all of AWS services. By committing to using Redshift for a period of 1 year to 3 years, customers can save up to 75% of the cost they would be incurring in case they were to use the on-demand pricing policy.
Amazon Redshift Maintenance
As mentioned in the beginning, AWS Redshift is a completely managed service and as such does not require any kind of maintenance activity from the end-users except for small periodic activity.
Redshift internally uses delete markers instead of actual deletions during the update and delete queries. This means there is to be a housekeeping activity for archiving these rows and performing actual deletions. For Redshift, this process is called vacuuming and can only be executed by a cluster administrator. More details about this process can be found here.
Amazon Redshift Data Security
AWS Redshift provides complete security to the data stored throughout its lifecycle – irrespective of whether the data is at rest or in transit. The security is tested regularly by third-party auditors. It also enables complete security in all the auxiliary activities involved in Redshift usage including cluster management, cluster connectivity, database management, and credential management.
By default, all network communication is SSL enabled. It is possible to encrypt all the data. You can also start your cluster in a virtual private cloud for enterprise-level security.
AWS Redshift also complies with all the well-known data protection and security compliance programs like SOC, PCI, HIPAA BAA, etc.
Now that we know about the capability of Amazon Redshift in various parameters, let us try to examine the strengths and weaknesses of AWS Redshift.
Amazon Redshift Pros:
- Redshift is a completely managed service with little intervention needed from the end-user.
- It can scale up to storing a Petabyte of data.
- Scaling takes minimal effort and is limited only by the customer’s ability to pay.
- Redshift is faster than most data warehouse services available out there and it has a clear advantage when it comes to executing repeated complex queries.
- Redshift offers on-demand pricing. Together with its ability to spin up clusters from snapshots, this can help customers manage their budget better. It also provides great flexibility with respect to choosing node types for different kinds of workloads.
- It offers a complete suite of security with little effort needed from the end-user.
- Redshift undergoes continuous improvements and the performance keeps improving with every iteration with easily manageable updates without affecting data.
- AWS Data Pipeline and AWS Glue help a great deal in running a completely managed ETL system with little intervention from end-users. Redshift also integrates tightly with all the AWS Services.
Amazon Redshift Cons:
- Even though it is a completely managed service, it still needs some extent of user intervention for vacuuming.
- Redshift pricing is including computing and storage. It is not possible to separate these two. Alternatives like Snowflake enables this.
- Data load and transfer involving non-AWS services are complex in Redshift. Using a service like Hevodata can greatly improve this experience.
- Redshift is not tailor-made for real-time operations and is suited more for batch operations. Again, a platform like Hevo Data can solve this for you.
- Redshift scaling is not completely seamless and includes a small window of downtime where the cluster is not available for querying. This downtime is in the range of minutes for newer generation nodes using elastic scaling but can go to hours for previous generation nodes.
Amazon Redshift Alternatives
Redshift is not the only cloud data warehouse service available in the market. A list of the most popular cloud data warehouse services which directly competes with Redshift can be found below.
- Azure SQL Data Warehouse – Microsoft’s own cloud data warehouse service provides a completely managed service with the ability to analyze petabytes of data. Even though this is considered slower in case of complex queries, it makes complete sense for a customer already using the Microsoft stack.
- Google Big Query – Big Query offers a cheap alternative to Redshift with better pricing. Generally benchmarked as slower than Redshift, BigQuery is considered far more usable and easier to learn because of Google’s emphasis on usability. You can read a comparison – Redshift Vs BigQuery here.
- Oracle Autonomous Data Warehouse – Oracle claims ADW to be faster than Redshift, but at the moment standard benchmark tests are not available. Oracle allows customers to use their on-premise Oracle licenses to decrease the costs. For customers already spending money on Oracle infrastructure, this is a big benefit.
- Snowflake – Snowflake offers a unique pricing model with separate compute and storage pricing. The performance is comparable to Redshift or even higher in specific cases. For customers with light workloads, Snowflake’s pure on-demand pricing only for compute can turn out cheaper than Redshift. You can read a comparison – Snowflake Vs Redshift here.
When to use Amazon Redshift?
Redshift advertises itself as a know it all data warehouse service, but it comes with its own set of quirks. But, there are some specific scenarios where using Redshift may be better than some of its counterparts.
- Your ETL design involves many Amazon services and plans to use many more Amazon services in the future. Redshift with its tight integration to other Amazon services is the clear winner here.
- Your cluster will be always running near-maximum capacity and query workloads are spread across time with very little idle time. This particular use case voids the pricing advantage of most competitors in the market.
- You are completely confident in your product and anticipate a cluster running at full capacity for at least a year. This allows you to use AWS Reserved pricing and can help cut costs to a big extent.
- The data design is completely structured with no requirement or future plans for storing semi-structured on unstructured data in the warehouse.
- Complete security and compliance are needed from the very start itself and there is no scope to skip on security and save costs.
Redshift offers a strong value proposition as a data warehouse service and delivers on all counts. Amazon continuously updates it and performance improvements are clearly visible with each iteration. Tight integration with AWS Services makes it the defacto choice for someone already deep into AWS Stack. Most of the limitations addressed on the data loading front can be overcome using a Data Pipeline platform like Hevo Data (14-day free trial) in combination with Redshift, creating a very reliable, always available data warehouse service.