For decades, traditional On-Premise Data Warehouses have been tightly coupled with Data Storage and Computing, making them difficult to scale. However, today’s businesses must store and analyze massive amounts of Structured and Unstructured data from disparate services & sources, necessitating a service like Snowflake that can handle large data volumes as well as variable compute requirements for applications like visualization. One of the first things you’ll do while setting up your Snowflake environment establishes a Snowflake Virtual Warehouse.
Upon a complete walkthrough of this article, you will gain a holistic understanding of Snowflake Virtual Warehouses along with the advantages offered by its architecture. You will also learn about the steps involved in creating a Snowflake Virtual Warehouse. This article will further provide you with some key points that can help you manage the cost of your Snowflake Virtual Warehouse.
Table of Contents
What is Snowflake?
Snowflake is one of the most popular Cloud Data Warehouses that offers a plethora of features without compromising simplicity. It scales automatically, both up and down, to offer the best Performance-to-Cost ratio. The distinguishing feature of Snowflake is that it separates Compute from Storage. This is significant as almost every other Data Warehouse, including Amazon Redshift, combines the two, implying that you must consider the size for your highest workload and then incur the costs associated with it.
Snowflake requires no hardware or software to be Chosen, Installed, Configured, or Managed, making it ideal for organizations that do not want to dedicate resources to the Setup, Maintenance, and Support of In-house Servers. It allows you to store all of your data in a centralized location and size your Compute independently. For example, if you require real-time data loads for complex transformations but only have a few complex queries in your reporting, you can script a massive Snowflake Warehouse for the data load and then scale it back down after it’s finished – all in real-time. This will save you a significant amount of money without jeopardizing your solution goals.
Key Features of Snowflake
Some of the key features of Snowflake are as follows:
- Scalability: The Compute and Storage resources are separated in Snowflakes’ Multi-Cluster Shared Data Architecture. This strategy gives users the ability to scale up resources when large amounts of data are required to be loaded quickly and scale back down when the process is complete without disrupting any kind of operation.
- No Administration Required: It enables businesses to set up and manage a solution without requiring extensive involvement from Database Administrators or IT teams. It does not necessitate the installation of software or the commissioning of hardware.
- Security: Snowflake houses a wide range of security features, from how users access Snowflake to how the data is stored. To restrict access to your account, you can manage Network Policies by whitelisting IP addresses. Snowflake supports a variety of authentication methods, including Two-Factor Authentication and SSO via Federated Authentication.
- Support for Semi-Structured Data: Snowflake’s architecture enables the storage of Structured and Semi-Structured data in the same location by utilizing the VARIANT schema on the Read data type. VARIANT can store both Structured and Semi-structured data. Once the data is loaded, Snowflake automatically parses it, extracts the attributes out of it, and stores it in a Columnar Format.
What is a Snowflake Virtual Warehouse?
Snowflake Virtual Warehouse is a cluster of Database Servers that are deployed on-demand to handle user queries. It is equivalent to an MPP (Massively Parallel Processing) Server for an On-Premise database and is a dynamic cluster of virtual Database Servers made up of CPU Cores, Memory, and SSD that is kept in a hardware pool and deployed in milliseconds. This process is completely transparent to the end-user.
Snowflake Virtual Warehouse comes in a variety of T-shirt sizes ranging from Extra Small to 4XL, with each representing an increase in the hardware resources available to the user. The users always connect to Snowflake via a URL that initiates a process in the Cloud Services layer. Any SQL query that requires data access is executed on a Virtual Warehouse, and data is retrieved from Cloud Storage.
Following are the advantages of the architecture of a Snowflake Virtual Warehouse:
- Dynamic Sizing: Since the Storage and Compute hardware are completely independent, and the fast SSD storage in the Virtual Warehouse is entirely temporary, a variety of Virtual Warehouse sizes can be allocated and adjusted on the fly as needed. This means that you can start with a single XSMALL server and scale up to a 3XLARGE monster machine in milliseconds.
- Zero Contention: Any number of Virtual Warehouses (each sized to the specific demands of the task) can be deployed, and each is completely independent of the others. When compared to traditional database architecture, where multiple groups share the same machine resources, this results in Zero Resource Contention. Each team has dedicated hardware with a Snowflake Virtual Warehouse.
- Automatic Suspension: When queries on a Snowflake Virtual Warehouse is no longer running, it can automatically suspend and resume within milliseconds if a new SQL query needs to be executed. To the end-user, this entire process is transparent.
- Pay as you go: Snowflake charges per second after the first minute, unlike other Cloud-based Services that charge for a machine for an hour. This means that you only pay for the compute resources you use.
Hevo Data is a No-code Data Pipeline that offers a fully managed solution to set up Data Integration for 100+ Data Sources (including 40+ Free sources) and will let you directly load data to a Data Warehouse such as Snowflake or the Destination of your choice. It will automate your data flow in minutes without writing any line of code. Its fault-tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully automated solution to manage data in real-time and always have analysis-ready data.
Get Started with Hevo for Free
Let’s look at some of the salient features of Hevo:
Sign up here for a 14-Day Free Trial!
- Fully Managed: It requires no management and maintenance as Hevo is a fully automated platform.
- Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer.
- Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
- Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
- Connectors: Hevo supports 100+ Integrations to SaaS platforms FTP/SFTP, Files, Databases, BI tools, and Native REST API & Webhooks Connectors. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake, Firebolt, Data Warehouses; Amazon S3 Data Lakes; Databricks; MySQL, SQL Server, TokuDB, DynamoDB, PostgreSQL Databases to name a few.
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Live Monitoring: Advanced monitoring gives you a one-stop view to watch all the activities that occur within Data Pipelines.
- Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
How to Determine the Size of the Snowflake Virtual Warehouse?
Snowflake Warehouse T-Shirts are available in a variety of sizes, as shown below:
An increase in T-Shirt size (XS-4XL) corresponds to a proportional increase in CPU, Memory, and Temporary Storage. You don’t have control over individual sizes, but you can change your warehouse size by selecting one of the T-Shirt sizes. Because Snowflake storage and computing are loosely coupled, you can start and stop the warehouse at any time.
You can specify the size, multi-cluster attribute (Enterprise & above), and scaling policy at the time of creation. See the list below.
Because there are a variety of workloads in any organization, you can’t choose just one or two to cover the entire workload. Also, in Snowflake, creating multiple warehouses is free; you can create as many as you want and only pay for what you use. For instance, we have the following groups in any high-level IT project:
- Development Team
- ETL Team
- Test Team
- Reporting Team
Each group of people is assigned a different amount of work. Let’s say the ETL team is constantly busy loading data, while the reporting team is only working on query reports. While the development team is busy writing code, they may try a variety of activities.
Similarly, a test team will run the entire flow from beginning to end or module by module to ensure the program’s sanity. As a result, each group of people has a different workload, necessitating the use of different warehouses.
So the best option is to set up a separate warehouse for each team with some initial sizing and monitor whether you need to adjust the size as they require.
What is Snowflake’s Scaling Policy?
You can choose between two types of scaling policies in Snowflake:
What are the 2 modes to Define Warehouse in Snowflake?
Warehouses can be provisioned in two different ways in Snowflake. It can be set to either maximized or Auto-Scale mode.
You select auto-scale mode when you enter different values for Minimum and Maximum clusters, as shown below. Snowflake uses this mode to dynamically manage the warehouse load by starting and stopping clusters as needed. The warehouse scales out/in as the number of concurrent users or query load increases or decreases.
When you set the Minimum and Maximum clusters to the same value (Should be >1), you’re choosing maximized mode. When the warehouse is started in this mode, Snowflake starts all of the clusters, ensuring that the warehouse has the most resources available. This mode is appropriate when you know your workload will have multiple concurrent users and will require a provisioned server to support them.
How to Create a Snowflake Virtual Warehouse?
Creating a Snowflake Virtual Warehouse is a fairly simple process. Follow the steps given below to do so:
- The SQL script given below demonstrates how to create a SMALL Snowflake Virtual Warehouse that will automatically suspend after 10 minutes and resume immediately once queries are executed.
create warehouse PROD_REPORTING with
warehouse_size = SMALL
auto_suspend = 600
auto_resume = true
initially_suspended = true
comment = 'Sample Reporting Warehouse';
- Once you have successfully created a Virtual Warehouse, use the following command to select a specific Virtual Warehouse:
use warehouse SAMPLE_REPORTING;
- Any SQL Statements/Queries executed after this point will be executed on the specified Virtual Warehouse. Using this method, different teams can be assigned dedicated hardware, and each user can be assigned a default warehouse.
How to Manage the Cost of Your Snowflake Virtual Warehouse?
Depending on the size of your data and the number of users tasked with the Data Warehousing and Data Management operations, Snowflake offers a set of computing clusters categorized by their sizes. While running, a Snowflake Virtual Warehouse consumes Snowflake Credits. The number of credits consumed is determined by the size of the warehouse and the duration of the Virtual Warehouse. Managing Credit Consumption to ensure that credits are used efficiently helps in managing the cost of a Snowflake Virtual Warehouse. There are some very simple things you can do to manage the cost of your Snowflake Virtual Warehouse. They are as follows:
1) Set up Resource Monitors
Snowflake Virtual Warehouse’s Compute Costs can be managed by triggering specific actions, such as sending notifications or suspending one or more warehouses. This can be done with the help of Resource Monitors. Hence, it is advisable to create Resource Monitors if you are an account administrator.
2) Query the Snowflake Database
The Account Administrator also has access to the Snowflake database, which is a read-only shared database provided by Snowflake. This database, among other things, can be queried to retrieve information about the Warehouses consuming the most credits. Once you’ve successfully determined the warehouses consuming the maximum credits, you can delve into further details using the History Tab.
3) Make Good Warehouse Auto-Suspension Choices
When you create a new Warehouse, the Auto-Suspend value is set to 600 seconds by default. This means that the Warehouse will automatically suspend after 10 minutes of inactivity. If your workload runs infrequently, you may want to reduce your default value to one minute (60 seconds). It is important to note that you should not set the Auto-Suspend value to NULL unless your query workloads necessitate a continuously running Warehouse, otherwise your consumption charges will be much higher than necessary. Since the Warehouse cache is lost every time a Warehouse is suspended, it may be cheaper and faster not to have the Warehouse Auto-Suspend if jobs are continuously executed.
What are the Misconceptions about Virtual Warehouses in Snowflake?
The following is a list of common honest mistakes that most of us make during the learning process:
- Virtual Warehouses can be Dynamically Resized: Even if you have running queries and other queries queued for them, you can only manually resize virtual warehouses at any time. Snowflake only allows you to auto-scale horizontally, not vertically (up or down) (in or out). In the Enterprise Edition, this last auto-scale mode is supported by a multi-cluster environment, in which each cluster is a virtual warehouse of the same size, processing queries in parallel, and can be started and stopped.
A size-4 (Medium) warehouse has been resized to a size-8 warehouse in the previous image (Large). Warehouses can be resized in powers of two (i.e. 1, 2, 4, 8, 16, 32,… 256, 512) from 1 (X-Small) to 512 (6X-Large). The warehouse can be self-contained or part of a larger cluster. When a multi-cluster warehouse is resized, the size of each individual warehouse is increased or decreased by the same amount.
An auto-scale of a multi-cluster warehouse can be seen in the previous image. Individual warehouses, each of the same size, are duplicated here. A size-4 (Medium) warehouse has been replicated in this case for a cluster of two warehouses of the same size.
- Resizing a Virtual Warehouse will Improve Concurrency: The key distinction between resizing a warehouse and resizing the number of clusters in a multi-cluster warehouse is that each has a distinct impact on performance or concurrency:
- Warehouse resizing boosts efficiency. Query size scales linearly with warehouse size for the most part, especially for larger, more complex queries. This feature allows for manual vertical scaling (up or down).
- Concurrency is improved with multi-cluster warehouses. They’re made to deal with queuing and performance issues caused by a large number of concurrent users and/or queries. Automatic horizontal scalability is enabled by this feature (in or out).
Any individual query will most likely be processed faster by the size-8 (Large) warehouse on the left than by any size-4 (Medium) warehouse on the right. If queries 1 and 2 are submitted at the same time, query 2 will most likely be queued and will have to wait for query 1 to be processed first. However, query 1 and query 2 can be executed in parallel by each individual warehouse in the multi-cluster warehouse.
- You may have Warehouses of Different Sizes in a Multiple-Cluster: When you create a virtual warehouse with multiple clusters in Enterprise Edition, each cluster must have the same number of nodes, which is referred to as a “size.” It’s not possible to start parallel clusters with different numbers of nodes for the same multi-cluster virtual warehouse.
On the left, there are two size-4 warehouses in a multi-cluster warehouse (Medium). The one on the right has warehouses of various sizes, which is impossible to accommodate.
- A Larger Warehouse will Run your Query Faster: Possibly not. One or two nodes may be enough for simple queries. There are a slew of other variables to consider as well. In a high-volume multi-user environment, a multi-cluster virtual warehouse, rather than a larger size, can prevent queries from being queued. Concurrency issues, not complexity, may make your query take longer to process.
Only two nodes may be required for the previous query. The remaining nodes simply waste credits and aren’t used. If the majority of the queries are similar, a size-2 (Small) warehouse could be more cost-effective and efficient.
- An Increase in Size will Improve the Performance of a Currently Slow Running Query without Restart: Manually resizing virtual warehouses is possible at any time, but it has no effect on existing queries. Any currently running queries are unaffected by the additional resources. They become available for use by any queued or newly submitted queries once they have been fully provisioned.
You can manually resize the virtual warehouse from size-2 (Small) to size-8 while query 1 is still running (Large). This change, however, has no effect on the execution of query 1, which is still served solely by the first two nodes. If the query could benefit from more nodes, it must be paused and restarted.
- A Query is Always Queued if the Current Warehouse is Busy: The number of queries that can be processed concurrently in single-cluster virtual warehouses is limited by the size and complexity of each query. The warehouse calculates and reserves the compute resources required to process each query as queries are submitted. Only if there are insufficient remaining resources in the warehouse to process a query will it be queued, awaiting resources to become available as other running queries complete.
A virtual warehouse of size 8 (Large) may determine that query 1 only requires two nodes. If some of the remaining nodes are idle, another incoming query may use them, and so on. If both query 1 and query 2 have enough resources, a single warehouse can run both queries in parallel.
Snowflake recommends multi-cluster warehouses to enable fully automated concurrency scaling. Multi-cluster warehouses provide essentially the same benefits as adding additional warehouses and redirecting queries, but without the need for manual intervention. Queries that would otherwise be queued could be automatically redirected to another cluster in the same warehouse and executed.
- A Large Warehouse can Provision fewer Nodes for very Simple Queries: In a single-cluster virtual warehouse, all nodes are started and provisioned simultaneously. Only in multi-cluster virtual warehouses can different clusters of nodes be started and stopped automatically and individually. This means that you waste all other running nodes and pay more for nothing for simple queries that may only require one or two nodes to process. A size-1 (X-Small) or size-2 (Small) virtual warehouse might suffice if you’re the only user on a Snowflake instance and only run simple queries one at a time.
A warehouse of size 8 (Large) will always start with and provision 8 nodes.
- Increase the Size of a Virtual Warehouse if you have one large File to Ingest: Increased warehouse-size does not always result in improved data loading performance. The number of files to be loaded and the size of each file has a greater impact on data loading performance than the warehouse’s size. A smaller warehouse is usually sufficient unless you’re bulk loading hundreds or thousands of files at the same time. Using a larger warehouse will use more credits and may not improve performance.
If you ingest a large number of files at once, the data ingestion with the size-8 (Large) warehouse may be faster. Otherwise, because of the volume of data, rather than data processing, a single ingestion of a large file may take longer. Per request, each COPY INTO command can take up to four nodes. With a size-8 (Large) warehouse, using the big file might be a waste of compute credits.
- Multi-cluster warehouses can auto-scale between a minimum and the maximum number of compute nodes: The minimum and maximum configuration numbers (accepted values between 1 and 10) refer to the number of single-cluster warehouses, not the number of compute nodes. Each virtual warehouse cluster must have the same number of compute nodes or resources, referred to as “size.” A warehouse can only be scaled up or down manually by resizing it. A multi-cluster warehouse’s number of warehouses, on the other hand, can be automatically scaled up or down.
The previous multi-cluster warehouse was expanded from one to three warehouses, each of the same size 4 (Medium). In auto-scale mode, the minimum is set to 1 and the maximum is set to 3 or 10 (maximum allowed value).
- A Multi-Cluster Warehouse will help Process a Complex Query Faster: Multiple queries can be processed in parallel using virtual warehouses with multiple clusters. Only the size of the query, not the number of clusters, matters when processing a single query by a single warehouse. A large virtual warehouse could be beneficial for a complex query. It makes no difference in this case if there are more clusters.
It makes no difference whether you have one size-4 (Medium) warehouse or a multi-cluster warehouse with two warehouses of the same size when running query 1. Each of these individual warehouses, each with four compute nodes, will still be assigned to the query.
- A Query can be Processed by more than one Warehouse at the same time: Only one virtual warehouse can process a query. A USE WAREHOUSE statement is always issued for the current session in various ways, and this will be the virtual warehouse that is running your current query. If you have a multi-cluster warehouse, your query may be run by a different cluster in this virtual warehouse. When the other clusters are busy executing other queries, queued queries can be redirected to another running cluster. Remember that a cluster is made up of N compute nodes that can be provisioned and used to run queries, with N being the “size” of the cluster. A virtual warehouse is made up of clusters like this. The virtual warehouse is defined as a collection of nodes for every single cluster. Our warehouse, on the other hand, is comprised of a single cluster of N nodes.
Even though it may require more resources for optimal processing, the previous query 1 can only be run by one warehouse. Additional nodes from a different warehouse cannot be allocated to the same query.
- Incoming Queries are Automatically Redirected to any Available Virtual Warehouse: Each query is executed by a single warehouse, which is either set as the current context — with USE WAREHOUSE — or set as the default for the current user. The process is always manual; there is no automatic redirection to another warehouse. Different clusters can be started in parallel for queries in the queue only in the Enterprise Edition, which has a multi-cluster virtual warehouse.
The difference between the two size-4 (Medium) warehouses on the left and the similar warehouses on the right is that the ones on the right are part of a multi-cluster warehouse, which is currently in use but can eventually scale out to accommodate more processing power. If the current context for queries 1 and 2 is the same separate warehouse, the second query will most likely be queued and executed after the first.
What are Auto-resumption and Auto-suspension?
A warehouse can be programmed to resume or suspend operations based on activity:
- Auto-suspension is turned on by default. If the warehouse is inactive for an extended period of time, Snowflake suspends it automatically.
- Auto-resume is activated by default. When any statement that requires a warehouse is submitted, Snowflake automatically resumes the warehouse, and the warehouse becomes the session’s current warehouse.
- Only the entire warehouse is affected by auto-suspend and auto-resume, not individual clusters.
This blog introduced you to Snowflake along with the salient features that it offers. Furthermore, it introduced you to Snowflake Virtual Warehouses and the pointers that you can leverage to manage its costs. As your business begins to grow, data is generated at an exponential rate across all of your company’s SaaS applications, Databases, and other sources. To meet this growing storage and computing needs of data, you would require to invest a portion of your Engineering Bandwidth to Integrate data from all sources, Clean & Transform it, and finally load it to a Cloud Data Warehouse such as Snowflake for further Business Analytics. All of these challenges can be efficiently handled by a Cloud-Based ETL tool such as Hevo Data.
Visit our Website to Explore Hevo
Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations such as Snowflake, with a few clicks. Hevo Data with its strong integration with 100+ sources (including 40+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at our unbeatable pricing that will help you choose the right plan for your business needs!
Share your experience of learning about Snowflake Virtual Warehouse in the comments below!