This article helps you uncover the unique features of Snowflake Data Warehouse. You will also get a quick walk-through on the implementation of some of the important features such as schemaless loading of JSON/XML, Time Travel, Cloning, Data Clustering, etc., provided by Snowflake.
Snowflake Features and Capabilities
Snowflake takes care of all the database management and assures of the best performance when tables are queried. All one needs to do is create tables, load the data, and query it. One need not create partitions, indexes like in RDBMS or run vacuum operations like in Redshift either.
The details of how Snowflake accomplishes this can be best understood by knowing the Snowflake Table structures and its principal concepts:
Overview of Snowflake Table Structures
Two principal concepts used by Snowflake to physically maintain Table Structures for best performance are:
Snowflake stores the data in a table using columnar fashion divided into a number of micro-partitions which are contiguous units of storage of size around 50 MB to 500 MB of uncompressed data. You can read more on micro-partitions here. Unlike traditional databases, partitions aren’t static and aren’t defined and maintained by users but are automatically managed by Snowflake data warehouse in a dynamic fashion.
A table can consist of thousands or millions of micro-partitions in Snowflake depending on the table size.
Let’s consider a table consisting of Student data as shown below:This data will be broken into multiple micro-partitions in Snowflake. Let’s assume 3 micro-partitions are created for this table, data will be partitioned as below in columnar fashion:As seen in the above tables, data will be partitioned into 3 parts and stored in columnar fashion. Data is also sorted and stored at each micro-partition level.
Data-Clustering is similar to the concept of sort-key available in most MPP Databases. As data is loaded into Snowflake, it co-locates column data with the same values in the same micro-partition. This helps Snowflake to easily filter out data during data scan, as entire micro-partition can be skipped if a value doesn’t exist in the range of that micro-partition.
Snowflake automatically does the job of clustering on the tables, and this natural clustering process of Snowflake is good enough for most cases and gives good performance even for big tables.
However, if a user wants to do manual clustering, there is a concept of the clustering key which can be defined on the table by user and Snowflake uses this key to do the clustering on the table. This can be useful only for very large tables.
Other Noteworthy Features of Snowflake Data Warehouse:
Time Travel allows you to track the change of data over time. This Snowflake feature is available to all accounts and enabled by default to all, free of cost. This feature allows us to access the historical data of a Table. One can find out how the table looked at any point in time within the last 90 days.
For standard accounts, a maximum of 1 day Time Travel retention period is given by default. In Enterprise edition this can be extended up to 90 days.
Another important feature of the Snowflake data warehouse is Cloning. We can use clone feature to create an instant copy of any Snowflake object such as databases, schemas, tables, and other Snowflake objects at a near real-time without much waiting time. This is possible due to Snowflake’s architecture of storing data as immutable in S3 and versioning the changes and storing the changes as Metadata.
Hence a clone of an object is actually a Metadata operation and doesn’t actually duplicate the storage data. So let’s say you want a copy of the entire Production database for some testing purpose, you can create it in seconds using the Clone feature.
A dropped object can be restored using the Undrop command in Snowflake, as long as that object is still not purged by the system.
Undrop restores the object to the state it was before dropping the object. Objects such as Databases, Schemas or Tables can be undropped.
drop database testdb; undrop database testdb; drop table some_table; undrop table some_table;
Undrop can be used to restore multiple times as well, let’s say a table has been dropped 3 times and re-created each time. Each dropped version can be undropped in reverse order of drop i.e. last dropped version is restored first. This works as long as the existing table with the same name is renamed before Undrop operation.
Undrop will fail if a table with the same name already exists in the system. Hence it is important to rename the existing table and then perform Undrop operation.
Fail-Safe ensures historical data is protected in event of disasters such as disk failures or any other hardware failures. Snowflake provides 7 days of Fail-Safe protection of data which can be recovered only by Snowflake in event of a disaster.
Fail-Safe 7-day duration starts after the Time Travel period is ended. Hence, a table with 90 days of Time Travel will have an overall of 97 days for recovery. However, Fail-Safe cannot be recovered by users or through DDLs. It is handled by Snowflake only during catastrophic disasters.
Snowflake provides various connectivity options including Native connectors (e.g. Python), JDBC/ODBC drivers, Command Line tool called “SnowSQL”, Web Interface which helps to manage Snowflake as well as to query the data.
Snowflake account can be hosted on either Amazon AWS or Microsoft Azure cloud platform. Hence depending on organizations existing Cloud service provided one among AWS or Azure can be chosen.
Snowflake Data Warehouse Pricing
Snowflake data warehouse charges for the Storage and Compute separately.
- Data stored in the Snowflake will be charged as per the average monthly usage per TB or can be paid upfront costs per TB to save storage costs.
- Compute costs are separate and will be charged at per second usage depending on the size of virtual warehouse chosen from X-Small to 4X-Large.
- Snowflake bills compute costs as credits, depending on the size of the warehouse, per hour credit cost varies from 1 credit per hour for X-Small to 128 credits per hour for 4X-Large. Note, Snowflake charges per second.
- Depending on the Cloud Platform i.e. AWS or Azure and the region costs can vary.
Getting Started with Snowflake
- Snowflake provides $400 free credits to try and explore the warehouse.
- Once registered, one needs to set-up the warehouse to use for loading the data.
- Loading data into the warehouse is one of the trickiest things in Snowflake.
There are two ways to do this:
You can create databases/schemas/tables as required and load the data using “COPY INTO” command or query a file directly on cloud storage like S3 using Select statement.
Alternatively, you can use an official Snowflake ETL Partner like Hevo (7-day free trial) to easily load data from a wide array of sources into Snowflake.
Common Use-Cases of Snowflake Data Warehouse:
Snowflake can be used to host a Data Warehouse in the cloud of any scale for analysis of various data from different sources. Snowflake supports loading from various formats including traditional compressed CSVs to semi-structured data such as JSON/XML from Cloud storages like S3 or Azure. If you need to load data from a variety of different sources, you could use a data integration platform like Hevo.
Common use cases for using Snowflake:
- Analytics on data available in Data Lake on Cloud Storage like S3.
- Rapid delivery of analytics to Business without worrying about managing the warehouse.
- Data Sharing: If an organization wants to share data outside then Snowflake can be used to quickly set up to share data without transferring or developing pipelines.
- Snowflake can be used to set up an Insights-as-a-Service platform.
If you are considering setting up a data warehouse, Snowflake is a strong contender, you can check out our article to know about Snowflake Architecture and Snowflake best practices for high-performance ETL. There are many features that work out of the box on Snowflake in comparison to Redshift and Bigquery, the other popular choices for Data Warehousing.