Snowflake is a Software-as-a-Service (SaaS) Data Warehousing solution. It uses SQL for querying the data, and like most other warehouses, it stores data in columns instead of rows to facilitate Parallel Query Execution. It stands out from its peers (like Redshift), by separating compute from Storage. It does this by using what it calls a Virtual Data Warehouse.
Thus, as your computing requirements increase, you can resize your warehouse (for example, you can go from X-Small size to 2X-Large size; see image below). And when the load is not there, you can pause the warehouse entirely. Because of this, Snowflake is able to support concurrent workloads. This works well for planned workloads (you resize your warehouse when you know the load is going to be high). However, in case of sudden spikes in your workload, the concept of automatic concurrency is implemented.
In this article, you will learn everything about Snowflake Scaling Policy and its types.
Table of Contents
To achieve Automated Concurrency, Snowflake introduced multi-cluster Data Warehouses. These are, as the name suggests, composed of multiple clusters. So when your workload increases, Snowflake can automatically employ more clusters of the warehouse. This way, a sudden increase in workload won’t lead to queued or failed queries, and consequently frustrated users. When the workload decreases, the additional compute resources are automatically shut down. Thus, the warehouse scales automatically depending on the workload. The policy that dictates this scaling is going to be the focus of this article.
Further information on Multi-Cluster Warehouses can be found here.
Before moving on, let’s have a look at the prerequisites.
Understanding the content of this article won’t require any prerequisites apart from a familiarity with data warehousing in general, and Snowflake in particular. It is assumed that you have worked with Snowflake in the past, and are familiar with terminologies used in connection to Snowflake (warehouse, cluster, etc.)
When creating a Data Warehouse in Snowflake, the image below represents the typical dialog box that you will see:
If the value for ‘Maximum Clusters’ is more than 1 (you can go up to 10), then the warehouse is multi-cluster.
There are two modes you can choose from for multi-cluster warehouses:
- Maximized: Here, all the clusters are started by default. There is no concept of scaling here. This is the mode selected when the value of ‘Minimum Clusters’ is equal to that of ‘Maximum Clusters’
- Auto Scaling: Here clusters are started and stopped dynamically by Snowflake to manage the workload. The behavior here is dictated by the Snowflake Scaling Policy. This is the mode selected when the values of ‘Minimum Clusters’ and ‘Maximum Clusters’ are different. If you have seen auto-scaling in cloud applications (like Amazon EC2) in the past, then this concept would sound familiar.
Simplify Snowflake Data Transfer with Hevo’s No-code Pipeline
Hevo Data is a No-code Data Pipeline that helps you transfer data from 100+ sources (including 40+ Free Data Sources) to Snowflake in real-time in an effortless manner. After using Hevo you can easily carry out Snowflake Create Users Tasks.Get Started with Hevo for Free
Key Features of Hevo Data:
- Fully Managed: Hevo Data is a fully managed service and is straightforward to set up.
- Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the Data Pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use.
- Connectors: Hevo supports 100+ integrations to SaaS platforms, files, databases, analytics, and BI tools. It supports various destinations including Amazon Redshift, Snowflake Data Warehouses; Amazon S3 Data Lakes; and MySQL, SQL Server, TokuDB, DynamoDB, PostgreSQL databases to name a few.
- Ensure Unique Records: Hevo Data helps you ensure that only unique records are present in the tables if Primary Keys are defined.
- Multiple Sources: Hevo Data has various connectors incorporated with it, which can connect to multiple sources with ease.
- Automatic Mapping: Hevo Data automatically maps the source schema to perform analysis without worrying about the changes in the schema.
- Real-time Data Transfer: Hevo Data works on both batch as well as real-time data transfer.
- Resume from Point of Failure: Hevo Data can resume the ingestion from the point of failure if it occurs.
- Advanced Monitoring: Advanced monitoring gives you a one-stop view to watch all the activity that occurs within pipelines.
- 24/7 Support: With 24/7 Support, Hevo provides customer-centric solutions to the business use case.
Steps to load Snowflake data using Hevo Data:
- Sign up on the Hevo Data, and select Snowflake as the destination.
- Provide the user credentials and connect to the server.
- Select the database, and schema to load the data.
Snowflake Scaling Policy
As you can see, there are two possible options for the Snowflake ‘Scaling Policy’. Let’s understand them in detail.
This is the default Snowflake Scaling Policy. This policy is more expensive (in terms of credits consumed) in comparison to ‘Economy’. However, as you would have guessed, it provides better performance. The objective of this policy is to prevent or minimize queuing of queries.
A new cluster starts if one of the following is true:
- A query is queued
- The system detects that there’s one more query than what the presently-running clusters can handle
- 20 seconds have passed since the prior cluster has started
Thus, the first cluster will start immediately if either A or B is true. The next, and each successive cluster will require one of A or B, and C to be true. Thus, if you have a maximum of 5 clusters, then the total time taken to start all 5 is at least 80 seconds (one immediately, and 20 seconds gap between each successive cluster)
Every minute, the system checks the load on the Least Loaded Cluster. If in 2 to 3 consecutive checks, the system determines that the load on the Least Loaded Cluster can be handled by the others, i.e., it can be redistributed without spinning up the cluster again, then the cluster is shut down.
Thus, if you have a spiky load for 20 minutes, and have a minimum of 1 and a maximum of 5 clusters, then within the first 80 seconds, all 5 clusters will be up and running. After 1200 seconds, i.e., at the end of the spike, the number of clusters will start reducing by 1 every 2-3 minutes.
As the name suggests, this policy’s objective is to conserve credits, and hence the cost. This can, therefore, result in queries being queued up and taking longer to complete. It does spin off additional clusters, but only when the system determines that there is enough workload to keep a new cluster busy for at least 6 minutes. Therefore, enough queries will have to be queued up initially for the system to consider starting a new cluster.
The shutdown of a cluster happens similarly to that in the case of the ‘Standard’ Snowflake Scaling Policy (i.e., by checking the load on the least loaded cluster and determining if the other clusters can handle it). However, over here, instead of 2 to 3 consecutive checks, the system performs 5 to 6 consecutive checks before shutting down a cluster. Thus, the cluster shutdown time is also longer, and you should account for this when deciding between ‘Standard’ and ‘Economy’ policies.
The ‘Economy’ Snowflake Scaling Policy saves cost in the case of short spikes (where the load is not high enough for sustained use of a second cluster). The downside, though, is that the user may experience short to moderate queues, and delayed query execution.
You had a decent understanding of the limitations of warehouse resizing and realized the need for multi-cluster warehousing. Next, we saw the different modes of multi-cluster warehousing and understood that only one mode (Auto Scale) requires a Snowflake Scaling Policy. Finally, you understood the two Snowflake Scaling Policy options provided by Snowflake in detail, with a focus on their objectives, conditions for starting a new cluster, and conditions for shutting down a cluster. In case you want to export data from a source of your choice into your desired Database/destination like Snowflake, then Hevo Data is the right choice for you!Visit our Website to Explore Hevo
Hevo Data provides its users with a simpler platform for integrating data from 100+ sources for Analysis. It is a No-code Data Pipeline that can help you combine data from multiple sources. You can use it to transfer data from multiple data sources into your Data Warehouses such as Snowflake, Database, or a destination of your choice. It provides you with a consistent and reliable solution to managing data in real-time, ensuring that you always have Analysis-ready data in your desired destination.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at our unbeatable pricing that will help you choose the right plan for your business needs!
Share your experience of learning about the Snowflake Scaling Policy! Let us know in the comments section below!