Snowflake Cluster: A Complete Overview

• July 27th, 2020


Are you looking to learn about the various table clustering capabilities provided by Snowflake? We have you covered. In this blog post we will show you how to leverage the Snowflake cluster facility to your advantage.

You will look at the following aspects:

What is Snowflake?

Snowflake is a SQL data warehouse that’s built for the cloud and delivered as a service to address today’s data analytics challenges. Snowflake allows users to access all their data in one place so that they can make actionable decisions.

All data in Snowflake is stored in database tables that are structured as groups of columns and rows. By default, when you create a table and insert records into a Snowflake table, Snowflake utilizes micro-partitions and data clustering in its table structure. But, as the table size grows, data in some columns may no longer cluster optimally.

Let’s take an in-depth look at what micro-partitions are and what data clustering is.

What are Micro-partitions?

All data in Snowflake tables is automatically divided into micro-partitions with each partition containing between 50 MB and 500 MB of uncompressed data.6

Each micro-partition maps to a group of rows and is organized in a columnar fashion. This size and structure allows for both optimization and efficiency in query processing. Micro-partitions allow users to perform extremely efficient DML and fine-grained pruning on very large tables, which can be composed of millions, or even hundreds of millions, of micro-partitions.

In simpler words, if a query specifies a filter predicate on a range of values that accesses 30% of the values in the range, it should ideally only scan the portion of the micro-partitions that contain the data, in this case, 30% of the values. Micro-partitions are derived automatically as data is ingested in Snowflake; they don’t need to be explicitly defined up-front or maintained by users.

What is Data Clustering?

Data clustering is a way of partitioning data to optimize read-time performance by allowing the query engine to ignore any data that is not required in the result. Typically, data stored in tables is sorted/ordered along natural dimensions (e.g. date and/or geographic regions). This “clustering” is a key factor in queries because table data that is not sorted or is only partially sorted may impact query performance, particularly on very large tables.

When you load data into a Snowflake table, metadata is collected and stored for each micro-partition created during the process. This metadata is then used to optimize queries at runtime. Snowflake leverages this clustering information to avoid unnecessary scanning of micro-partitions during querying and this significantly improves the scan efficiency of queries that reference these columns because they’ll skip a large amount of data that does not match the pre-defined filtering predicates.

When you create a clustered table in Snowflake, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The order of the specified columns determines the sort order of the data.

When data is written to a clustered table by a query job or a load job, Snowflake sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in Snowflake storage. When you submit a query that contains a clause that filters data based on the clustering columns, Snowflake uses the sorted blocks to eliminate scans of unnecessary data.

Hevo Data: An Alternative Approach to Load Data in Snowflake

Hevo Data is a No-Code Data Pipeline. It can efficiently load data in Snowflake in a few simple steps. It is a fully automated platform and it completely automates the process of data migration.

Get Started with Hevo for Free

Let’s see some unbeatable features of Hevo Data:

  1. Fully Managed: Hevo Data is a fully managed service and is straightforward to set up.
  2. Pre-Built Integrations: Hevo Data has various connectors incorporated with it, which can connect to multiple sources with ease. 
  3. Schema Management: Hevo Data automatically maps the source schema to perform analysis without worrying about the changing schema.
  4. Real-Time: Hevo Data works on the batch as well as real-time data transfer so that your data is analysis-ready always.  
  5. Advanced Monitoring: Hevo Data offers advanced monitoring that gives you a one-stop view to watch all the activities that occur within pipelines.
  6. Live Support: With 24/7 support, Hevo provides customer-centric solutions to the business use case.
Sign up here for a 14-Day Free Trial!

What is a Clustering Key?

A clustering key is a subset of columns in a table that are used to co-locate the data in the table in the same micro-partition. This is very useful for very large tables where the ordering of the column is not optimal or extensive DML operation on the table has caused the table’s natural clustering to degrade.

Clustering Partitioned Tables

In a partitioned table, data is stored in physical blocks, each of which holds one partition of data. A partitioned table maintains these properties across all operations that modify it: query jobs, Data Manipulation Language (DML) statements, Data Definition Language (DDL) statements, load jobs, and copy jobs. This requires Snowflake to maintain more metadata than an non-partitioned table. As the number of partitions increase, the amount of metadata overhead increases.

Automatic Clustering

As data is added to a clustered table, the newly inserted data can be written to blocks that contain key ranges that overlap with the key ranges in previously written blocks. These overlapping keys weaken the sort property of the table. You can use Snowflake’s Automatic Clustering feature to seamlessly and continually manage all reclustering in the background and without the need for any manual intervention. This background service monitors and evaluates your Snowflake tables to determine whether they can benefit from reclustering.

When to Use Clustering?

Snowflake supports clustering for both partitioned and non-partitioned tables.

Use clustering under the following circumstances:

  • You have fields that are accessed frequently in WHERE clauses. For example:
select * from orders 
where product = 'Kindle'
  • You have tables that contain data in the multi-terabyte (TB) range.
  • You have columns that are actively used in filter clauses and queries that aggregate data. For example, when you have queries that frequently use the date column as a filter condition, choosing the date column is a good idea. For example,
select * from orders 
where date 
between '01-01-2020' and '30-01-2020'
  • You need more granularity than partitioning alone allows. 
  • To get clustering benefits in addition to partitioning benefits, you can use the same column for both partitioning and clustering.

Snowflake Clustered Table Syntax

You can use the following syntax to create clustered tables in Snowflake.

create table <name> ... cluster by ( <expression1> [ , <expression2> ... ] )

Defining a Clustering Key for a Table

The following example creates a table with a clustering key by appending a cluster by clause to create table:

For example:

-- cluster by base columns
create or replace table t1 (c1 date, c2 string, c3 number) cluster by (c1, c2);
-- cluster by expressions
create or replace table t2 (c1 timestamp, c2 string, c3 number) cluster by (to_date(c1), substring(c2, 0, 10));
-- cluster by paths in variant columns
create or replace table t3 (t timestamp, v variant) cluster by (v:"Data":id::number);

At any point in time, you can change or add a clustering key to an existing table using the alter table clause:

For example:

-- cluster by base columns
alter table t1 cluster by (c1, c3);

-- cluster by expressions
alter table t2 cluster by (substring(c2, 5, 15), to_date(c1));

-- cluster by paths in variant columns
alter table t3 cluster by (v:"Data":name::string, v:"Data":id::number);

You can also drop the clustering key for a table using the alter table clause.

For example:

alter table t1 drop clustering key;

Next Steps

Visit our Website to Explore Hevo

Hevo Data is a No-code Data Pipeline product that will help you move data from multiple data sources to your destination. It is consistent and reliable. It has pre-built integrations from 150+ data sources.

So, give it a try! Sign Up here for a 14-day free trial.

Have any further queries? Get in touch with us in the comments section below.

No-Code Data Pipeline for Snowflake