Are you looking to learn about the various table clustering capabilities provided by Snowflake? We have you covered. In this blog post we will show you how to leverage the Snowflake cluster facility to your advantage.

What is Snowflake?

Snowflake logo

Snowflake is a SQL data warehouse that’s built for the cloud and delivered as a service to address today’s data analytics challenges. Snowflake allows users to access all their data in one place so that they can make actionable decisions.

All data in Snowflake is stored in database tables that are structured as groups of columns and rows. By default, when you create a table and insert records into a Snowflake table, Snowflake utilizes micro-partitions and data clustering in its table structure. But, as the table size grows, data in some columns may no longer cluster optimally.

Let’s take an in-depth look at what micro-partitions are and what data clustering is.

What are Micro-partitions?

All data in Snowflake tables is automatically divided into micro-partitions with each partition containing between 50 MB and 500 MB of uncompressed data.6

Each micro-partition maps to a group of rows and is organized in a columnar fashion. This size and structure allows for both optimization and efficiency in query processing. Micro-partitions allow users to perform extremely efficient DML and fine-grained pruning on very large tables, which can be composed of millions, or even hundreds of millions, of micro-partitions.

n simpler words, if a query specifies a filter predicate on a range of values that accesses 30% of the values in the range, it should ideally only scan the portion of the micro-partitions that contain the data, in this case, 30% of the values.

Micro-partitions are derived automatically as data is ingested in Snowflake; they don’t need to be explicitly defined up-front or maintained by users.

What is Data Clustering?

Data clustering is a way of partitioning data to optimize read-time performance by allowing the query engine to ignore any data that is not required in the result. Typically, data stored in tables is sorted/ordered along natural dimensions (e.g. date and/or geographic regions).

This “clustering” is a key factor in queries because table data that is not sorted or is only partially sorted may impact query performance, particularly on very large tables.

When you load data into a Snowflake table, metadata is collected and stored for each micro-partition created during the process. This metadata is then used to optimize queries at runtime.

Snowflake leverages this clustering information to avoid unnecessary scanning of micro-partitions during querying and this significantly improves the scan efficiency of queries that reference these columns because they’ll skip a large amount of data that does not match the pre-defined filtering predicates.

When you create a clustered table in Snowflake, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The order of the specified columns determines the sort order of the data.

When data is written to a clustered table by a query job or a load job, Snowflake sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in Snowflake storage. When you submit a query that contains a clause that filters data based on the clustering columns, Snowflake uses the sorted blocks to eliminate scans of unnecessary data.

What is a Clustering Key?

A clustering key is a subset of columns in a table that are used to co-locate the data in the table in the same micro-partition. This is very useful for very large tables where the ordering of the column is not optimal or extensive DML operation on the table has caused the table’s natural clustering to degrade.

Clustering Partitioned Tables

In a partitioned table, data is stored in physical blocks, each of which holds one partition of data. A partitioned table maintains these properties across all operations that modify it: query jobs, Data Manipulation Language (DML) statements, Data Definition Language (DDL) statements, load jobs, and copy jobs.

This requires Snowflake to maintain more metadata than an non-partitioned table. As the number of partitions increase, the amount of metadata overhead increases.

Automatic Clustering

As data is added to a clustered table, the newly inserted data can be written to blocks that contain key ranges that overlap with the key ranges in previously written blocks. These overlapping keys weaken the sort property of the table.

You can use Snowflake’s Automatic Clustering feature to seamlessly and continually manage all reclustering in the background and without the need for any manual intervention. This background service monitors and evaluates your Snowflake tables to determine whether they can benefit from reclustering.

When to Use Clustering?

Snowflake supports clustering for both partitioned and non-partitioned tables.

Use clustering under the following circumstances:

  • You have fields that are accessed frequently in WHERE clauses. For example:
select * from orders 
where product = 'Kindle'
  • You have tables that contain data in the multi-terabyte (TB) range.
  • You have columns that are actively used in filter clauses and queries that aggregate data. For example, when you have queries that frequently use the date column as a filter condition, choosing the date column is a good idea. For example,
select * from orders 
where date 
between '01-01-2020' and '30-01-2020'
  • You need more granularity than partitioning alone allows. 
  • To get clustering benefits in addition to partitioning benefits, you can use the same column for both partitioning and clustering.

Snowflake Clustered Table Syntax

You can use the following syntax to create clustered tables in Snowflake.

create table <name> ... cluster by ( <expression1> [ , <expression2> ... ] )

Defining a Clustering Key for a Table

The following example creates a table with a clustering key by appending a cluster by clause to create table:

For example:

-- cluster by base columns
create or replace table t1 (c1 date, c2 string, c3 number) cluster by (c1, c2);
-- cluster by expressions
create or replace table t2 (c1 timestamp, c2 string, c3 number) cluster by (to_date(c1), substring(c2, 0, 10));
-- cluster by paths in variant columns
create or replace table t3 (t timestamp, v variant) cluster by (v:"Data":id::number);

At any point in time, you can change or add a clustering key to an existing table using the alter table clause:

For example:

-- cluster by base columns
alter table t1 cluster by (c1, c3);

-- cluster by expressions
alter table t2 cluster by (substring(c2, 5, 15), to_date(c1));


-- cluster by paths in variant columns
alter table t3 cluster by (v:"Data":name::string, v:"Data":id::number);

You can also drop the clustering key for a table using the alter table clause.

For example:

alter table t1 drop clustering key;

Next Steps

Hevo Data: An Alternative Approach to Load Data in Snowflake

Ensure Data Integrity with Hevo’s No-code Data Pipeline

Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready.

Want to take Hevo for a spin? Sign Up or a 14-day free trial and experience the feature-rich Hevo suite firsthand. Also checkout our unbeatable pricing to choose the best plan for your organization.

Have any further queries? Get in touch with us in the comments section below.

Talha
Software Developer, Hevo Data

Talha is a Software Developer with over eight years of experience in the field. He is currently driving advancements in data integration at Hevo Data, where he has been instrumental in shaping a cutting-edge data integration platform for the past four years. Prior to this, he spent 4 years at Flipkart, where he played a key role in projects related to their data integration capabilities. Talha loves to explain complex information related to data engineering to his peers through writing. He has written many blogs related to data integration, data management aspects, and key challenges data practitioners face.