Are you looking for the best data warehouse between Snowflake vs Redshift ? If yes, then you have landed on the right article. Most companies with highly critical ETL workloads are switching away from self-managed on-premise data warehouses to completely managed cloud data warehouse services. A major reason for this shift is to avoid dealing with the overhead of maintaining a high availability service and to enable the workforce to focus more on the analytics part. In this post, we compare two of the most popular data warehouse services available at this point.
Snowflake vs Redshift Understanding
Snowflake is a popular completely managed data warehouse service that offers instant scaling, a unique pricing model that separates compute loads from storage usage and multi-cloud support. It works using a hybrid of shared nothing and shared disk architecture. More information on Snowflake architecture can be found here.
Alternatively, watch this 3-minute video to understand what Snowflake is about:
What makes snowflake unique from the other data warehouse services is its ability to scale instantly with minimum downtime. There is no need for the end user to select hardware or software, install, configure or manage anything to get a full-fledged data warehouse running. It is the best example of a completely managed data warehouse service at this point. Its storage mechanism is independent of the compute architecture and allows the user to exploit third-party services like AWS S3.
From an architecture point of view, Snowflake uses a concept called virtual warehouse which lies on top of the database storage service. A query services layer that sits on top of virtual warehouses manages the infrastructure, metadata, query optimization and security. This architecture allows it to build multiple virtual data warehouses over the same data. This enables different types of jobs to be run on the same data in different virtual data warehouses without affecting each other.
Amazon Redshift Features
Amazon Redshift is a completely managed large scale data warehouse service. In this context, completely managed means, the end-users do not have to worry about monitoring, troubleshooting, updating software, etc. That said there are some housekeeping activities related to data manipulation, that still require developer involvement. Here is a short introduction on Redshift:
Redshift provides a columnar data structure and provides a query layer compatible with the Postgres SQL standard. It also provides a feature called spectrum which allows users to query data stored in S3 in predefined formats like JSON or ORC. Combined with the AWS pipeline which enables users to schedule jobs using multiple AWS components for loading or processing, It offers a complete solution for building an ETL pipeline and data warehouse.
From an architecture perspective, Redshift is organized into a cluster of compute nodes with one node being assigned as a leader node. The compute nodes have their own memory, disk space, and CPU – which means it uses a shared nothing architecture. The communication between client programs, other nodes and query execution plans are all handled by the Leader Node. It’s cluster can have multiple databases and supports even OLTP transaction type workloads. That said, it is more optimized for high-performance analysis workloads. A detailed post on Redshift architecture can be found here.
Redshift offers flexible pricing plans according to the amount of data volume in use and offers a significant discount in cases where the customer can commit to a longer duration of time. Redshift pricing is inclusive of computing and storage requirements.
Snowflake vs Redshift: Parameters to Consider
Now that we have an idea about what Redshift and Snowflake are, let us go into a comparison of the two. Here are the top factors that would influence your choice on these two:
- Scope for Scale
- Data Warehouse Maintainance
- Ease of moving Data to Warehouse
- Data Storage Formats Supported
- Data Security
In addition to these, this article also highlights the use cases where either Snowflake or Redshift may be preferred. Let us dive in.
Ease of scaling is one of the most critical factors based on which a data architect will make the choice of environment.
Snowflake offers instant scaling without needing to redistribute data or interrupting users. It’s auto concurrency allows users to set a minimum and maximum cluster size and the clusters will scale automatically over this range in case of high demands.
Redshift can also scale, but not as instantly as Snowflake and takes anywhere between minutes to hours while adding new nodes to the cluster. So in case of scaling, Snowflake has an obvious advantage over Redshift. It also offers a mechanism called concurrency scaling that can increase the cluster capacity automatically when there is an increase in concurrent read query load.
Snowflake is completely automated and does not require any maintenance activity from end users.
Redshift requires the users to execute some housekeeping activities, especially after a series of updates or deletes. This is called vacuuming and can only be done by an administrator. This is because of Redshift’s architectural design which manages delete by adding a delete marker to the row. These markers are later removed after actual deletes which happens during vacuuming. Redshift’s vacuum process is documented in great detail in this post.
Since both the databases use different architectures and behave differently as per the type of queries run, it is tough to declare a clear winner when it comes to performance.
On raw query run times using unoptimized queries, Snowflake offers higher performance. Redshift query run times for unoptimized queries generally includes a long query optimization time and runs much faster if the same query is run frequently.
If you know your data, Redshift offers different ways of tuning both your queries and data structure to attain significant gains. Redshift offers SORTKEY and DISTKEY clauses which can be used while setting up the data. If used effectively, these clauses can provide a significant reduction in run times for queries involving JOINs and WHERE clauses.
Snowflake also offers a clause called PARTITION BY which helps in optimizing queries with WHERE clause, but optimization over JOIN queries is limited in it.
In some specific cases, where the query usage is minimal and scattered across larger time windows, Snowflake offers better pricing in comparison to Redshift, because of its pay as you use policy. It’s clusters are not charged while there is no query load on them and shutdowns automatically when they are idle. So for a customer with light query loads, Snowflake may offer better value.
Redshift offers very attractive pricing if the users can commit to a certain level of usage. This commit time is typically in the range of years. In this kind of usage, it’s pricing is more attractive when compared to Snowflake.
5. Data Replication
Data replication in both the warehouses comes with its own set of challenges. Redshift uses the COPY command to load data and needs the data to be present in S3. Snowflake uses the COPY INTO command to load data.
Snowflake is flexible enough to allow the use of S3, Azure storage or Snowflake storage for staging the data. Data type conversions are a challenge in both the cases and one needs to have a grasp on automatic data type enforcing rules of the warehouses to manage a perfect copy.
AWS data pipeline can ease the complexity to an extent but works only for a limited number of source configurations. When it comes to sharing data across accounts, Snowflake as an advantage over Redshift. Redshift does not have a provision to accomplish this other than by manual copying.
Snowflake enables users to exploit multiple clouds and third-party storage services. Since the compute part is separately charged, this helps the users to select storage services that give better value and use snowflake only as a compute engine. Redshift offers limited flexibility in this regard and needs you to use AWS components.
Snowflake also allows the customers to create multiple virtual warehouses on the same data enabling them to use different warehouses for different use-cases. Customers have the possibility of creating a virtual warehouse for all the routine jobs and another one for all the analytical or research workloads; both running on the same data. It also allows the sharing of data across accounts without time-consuming copying operations. Redshift spectrum provides a querying engine as a service, helping customers query data on files in S3, but the functionality is not as intuitive as a virtual data warehouse.
Easy Way to Move Data to Snowflake or Redshift
A much simpler way to move data from any data source to Snowflake or Redshift would be to use a Data Integration Platform like Hevo Data (comes with a 14-Day Free Trial).
Hevo can help you move data from 100s of data sources, in real-time, without writing any code.
Hevo brings in reliability, data security, and data accuracy – making it the right Data Pipeline Platform for your ETL needs.
Data Structure in Snowflake & Redshift
Snowflake works well with multiple data formats including JSON, Avro, and ORC and has extensive query support for JSON. It allows semi-structured data as well.
Redshift follows a pure columnar database format and is optimized for structured data storage. Redshift also can work with JSON querying, but there are major speed implications.
Both Redshift and Snowflake offer the whole suite of security and compliance. It offers different editions with varying levels of security. So if you are not particular about compliances, you may opt for a lower edition thereby saving quite a lot of money.
You have now seen the differences and capabilities of the two Data Warehouse services. Like in case of all the choices in life, there is no definite answer or framework to choose one warehouse over the other. It depends on how critical each of these factors is for the specific use case that you are trying to solve. The points below try to enumerate the scenarios where Redshift or Snowflake is to be chosen.
Amazon Redshift Use cases:
- You have a very high query load and the cluster will be running in full load for the most part of the day.
- You plan to use AWS components for different modules in your architecture and need tight integration with them.
- You anticipate a very busy cluster for a couple of years and are ready to commit a minimum amount of usage for at least 3 years to get cost savings.
- Your data is completely structured and the design does not need semi-structured data storage.
- You need the full suite of security and compliance and has no plans to compromise on security for cost savings.
The choice could be made in favour of Snowflake in case of the following conditions.
Snowflake Use Cases:
- Your query load is not high and scattered across the day with a lot of idle time for cluster.
- You would prefer to use a different storage service and use only the Snowflake compute engine.
- You want a completely automated solution with zero effort needed to maintain the service.
- You want varying levels of security during different phases and are ready to compromise on compliance in the initial stages for cost savings but want to get full-blown security when needed.
- You want to separate your routine workloads to a different virtual warehouse so that your analytical loads are not affected by a busy cluster during peak routine loads.
We are keen to know more about your use case and the warehouse you opted for. Let us know in the comments?