As data continues to drive modern-day business decisions, the need for interoperable engines with open-source table formats becomes paramount. Addressing this need, Snowflake introduced the Polaris catalog for Apache Iceberg at their summit on June 3rd, 2024. Snowflake Polaris Catalog is a new cataloging solution for data stored in Apache Iceberg format.
It supports many cloud solutions, including AWS, GCP, Azure, Snowflake, Confluent, and more. With Polaris, customers can continue to query data with their choice of engines, enabling them to run and build analytics on their existing systems seamlessly.
What is the Need for Polaris Catalog for Iceberg?
Before understanding the Polaris Catalog, let’s first understand Apache Iceberg. It is a high-performance, open-table format for analytics needs. Open table formats enable us to use SQL tables on top of data files to easily query them with processing engines like Spark, Hive, etc. Catalogs on top of table formats act as a data management layer where you can choose to manage your data into catalogs, namespaces, and tables.
Before Snowflake Polaris Catalog, enterprises struggled to find cross-engine compatible options for accessing data. For example, if a data engineer chooses to set up a catalog and Iceberg in Snowflake, but the analytics engineers want to use it via Trino or the Data Science team wants access to this data in Sagemaker notebooks, no options were available for such data sharing. So, current cataloging options with Iceberg had many challenges including
- Cross engine compatibility
- Vendor Lock-in
- No cross-region data-sharing support
- Cross-engine write capability
All these limitations required an open-source cataloging option that could cater to the above basic needs. While developing a solution to these limitations, Snowflake built the Polaris Catalog and decided to open-source it for community needs.
What is Polaris Catalog?
Polaris Catalog is a vendor-agnostic, interoperable and centralized open source catalog for Apache Iceberg. It provides the Iceberg community and enterprise users with flexibility, enhanced choice, and full control over their data. Since it is open source, users can deploy it with their own infrastructure.
As you dive into Snowflake Polaris Catalog, ensure your data is clean and ready with Hevo’s automated ETL processes. Integrate, transform, and load your data into Snowflake.
Why choose Hevo?
- Automate data flows from 150+ sources to Snowflake
- Transform and clean data with our no-code platform
- Ensure real-time synchronization for up-to-date analytics
Trusted by 2000+ companies, Hevo complements your Snowflake ecosystem with 4.3/5 stars on G2. Read how industry leaders like Harmoney achieved 100% data accuracy with Hevo.
Get Started with Hevo for Free
Core Capabilities of Snowflake Polaris Catalog
1. Cross-engine read and write Interoperability
Polaris Catalog leverages Iceberg’s open REST API to implement interoperability with various engines like Apache Flink, Spark, Trino and many more. This capability empowers data engineers to read and write data via multiple engines while still maintaining a single, unified copy of data.
2. Centralized access across multiple engines
Polaris catalog allows central data teams to govern data access from a centralized place i.e. all the requests to access data will get routed via Polaris Catalog. This will empower data teams to simultaneously work on the same datasets seamlessly.
3. No Vendor lock
With Polaris catalog’s capability to run agnostic of vendors, it enables engineering teams to focus on business requirements. Since the Polaris catalog is open source, organizations can easily switch to hosting infra of their choice.
4. Seamless integration with Snowflake’s ecosystem
Polaris Catalog extends Snowlfake’s Horizon capability to implement robust data governance. Horizon’s governance features like row access, column masking policies, object tagging and sharing, work on top of the Polaris Catalog. This means that whether the Iceberg table is created in Polaris Catalog by Snowflake or another engine, (like Flink or Spark), Snowflake Horizon’s governance features can be applied to these tables.
Why do we need Polaris Catalog?
Iceberg is an open-source table format for storage that was initially built by Netflix to address challenges faced with Hadoop projects. Since Iceberg handles how data is consistently stored and can be easily accessed, it cannot register schemas and tables. So it needs an external cataloging tool to register schemas and tables.
External cataloging tools like AWS Glue and Snowflake catalog for Iceberg already enable access to Iceberg data providing seamless reading capabilities. They did provide the capability to integrate with the Snowflake ecosystem but other engines and Iceberg used on self-hosted infrastructure did not have an open source solution to do so. Teams could access data via Open standard REST API exposed by Iceberg but they couldn’t integrate and access it via their choice of engines. Also, the main limitation was that till now, Snowflake, with their existing Horizon catalog only had a way to read data and not write into it.
With the introduction of the Polaris Catalog not only these limitations of vendor lock-in and dependency were resolved but it also provided a way for users to seamlessly read and write while maintaining consistency, atomicity and centralized access to their data, thus building a reliable system.
How does Snowflake Polaris Catalog Work?
In a nutshell, the Polaris Catalog provides users with four capabilities.
- Manage Catalogs: In managing catalogs users can create and organize catalogs, namespaces and tables and manage privileges on them according to policies.
- Setup Connections: Setup connections allow way to establish connections from various computing and querying engines to Polaris Catalog
- Setup Storage: Connection to Cloud storage integration that will be used by Polaris catalog to manage data can be set up.
- Manage Users: Create, Organize and Manage catalog roles and associated privileges to work with catalog tables can be set up.
Polaris Catalog can access Iceberg by implementing API wrappers on top of the Open Rest API exposed by Iceberg. Iceberg exposes certain configurations to pass custom-built implementers to read and update technical metadata stored in its files. Metadata files store all the information related to data files like column statistics such as MIN and Max Values from data files. This information is used while reading data files to understand partition pruning thus giving a performance boost. These files also track the state of metadata files, each time a change is made to a data file, a metadata file creates pointers to those changes. These serve as a snapshot of data at a time.
Use Cases of Polaris Catalog
With the ability to integrate with various querying tools and processing engines, Polaris Catalog ensures that data remains consistent and accessible across various platforms. It can be leveraged to support a wide range of use cases. Below, we explore some key use cases where the Polaris catalog can significantly enhance data management and analytics.
Supports multiple Querying Tools
Data Platform teams often get requests from multiple teams to enable access to data to their choice of data analytics tools. Some teams might want to use Snowflake, some might want to create their analytics dashboard on superset requiring access via Trino while data scientists might require access via Athena to use in their Sagemaker notebooks. Hence it becomes important for cataloging tools to support multiple querying tools.
Polaris Catalog is ideal for scenarios where organizations use a variety of querying tools. With its interoperability and vendor-agnostic nature, Polaris allows seamless data querying via multiple querying engines. This ensures that teams can use their preferred tools without compromising on data consistency or performance.
Support to various Processing Engines
Execution of workloads may be distributed in organizations across various processing engines w.r.t velocity, volume of data or skillset of a processing engine. For example, data engineers might want to implement ETL over Snowflake while data scientists might want to build their scripts on Sagemaker due to its easy integration with various models.
Polaris Catalog excels in providing robust support for use cases involving diverse data processing engines. Whether using Apache Flink, Spark, or other processing engines, Polaris ensures smooth integration. It leverages Iceberg’s open REST API to facilitate read and write operations across these engines while maintaining a single, consistent copy of the data.
Cross-platform Data Sharing
Imagine a global retail company that operates across multiple regions. The Data Engineering team sitting in Asia prefers Azure for its seamless integration with local partners. In contrast, the data engineering team in Europe would heavily rely on GCP due to its compliance with strict data privacy laws.
Polaris Catalog enables efficient cross-platform data sharing by being vendor-agnostic. In multi-cloud environments, data often needs to be accessed and shared across different platforms such as AWS, GCP, Azure, Snowflake, Confluent, and Salesforce. Polaris simplifies this process by providing a centralized and interoperable data catalog, ensuring that data remains accessible and manageable regardless of the underlying platform.
Cross-region Data Sharing
Consider a European e-commerce company. Due to strict data policies, they are required to keep their data in their continent/region, but what if an engineering team sitting in India sets up Trino in the Asia region and wants to access data residing in Europe?
This is possible with the Polaris Catalog because it works cloud agnostic and can be connected with an engine of your choice. This provides engineering teams with great flexibility to be able to use tools according to their needs without caring about what region or location their data resides in.
Cataloging needs
Catalogs are a very basic need for data engineering teams. Data engineers often get requests from multiple engineering teams to check which data resides where, what the schema name is, what the table name is, etc. This becomes more difficult for enterprises that deal with a vast amount and variety of data.
For example, a company may have a diverse range of products and solutions like IOT devices, Mobile applications etc. With increasing volume and variety of data, they may face challenges in cataloging, managing, and utilizing data effectively. They might want to use Iceberg for their analytics needs because of its high performance and well-known features. By using the Polaris Catalog, they will now be able to cater to their analytics needs removing hesitation to adopt Iceberg.
Polaris Catalog provides a way to define catalogs, namespaces, and tables, helping teams easily and effectively manage their data.
Impact on Industry and What’s Next?
In recent years, the Data industry has shown significant interest and enthusiasm in open-source table formats and tools due to their potential to be vendor-agnostic and interoperable. Open-source technologies offer the flexibility to operate together, enhancing compatibility across different platforms safely. Interoperability not only reduces complexity but also reduces risks of vendor lock-in and lowers associated costs.
Industry leaders and the open-source community have expressed strong support and appreciation for Snowflake’s Polaris Catalog. Major industry players have shown enthusiasm and openness for this collaboration with Snowflake indicating the importance of open standards and an interconnected data ecosystem. The enthusiastic reception of the Polaris Catalog highlights the growing need for flexible, scalable, and interoperable data solutions in the modern data landscape.
When Polaris Catalog is released to the community for use, we will have better clarity. Developers will come up with limitations, bugs, and new features that can improve the cataloging experience for Iceberg. In the initial release, mostly major processing engine players and cloud players are being supported.
Still, the community will be able to build new integrations for tools/clouds they use. It will also help unleash any security concerns and get it fixed. Enterprise users will check it for the possibility of running it on production workloads and challenges. Polaris Catalog will also open possibilities for enterprises/users who were thinking of adopting Iceberg for their needs but were hesitant because of limitations that existed before the Catalog’s introduction. It will also open a new scope of opportunities for improvement of Iceberg.
All in all, as we look ahead, the adoption of Polaris Catalog is expected to open paths for further innovation and collaboration in the industry.
Want to understand the Snowflake Data Catalog? Explore our guide to see how it can help you manage and organize your data more effectively.
Frequently Asked Questions on Snowflake Polaris Catalog
1. What is Polaris Catalog Snowflake?
Polaris Catalog is a vendor neutral, open catalog layer for Apache Iceberg. It provides the Iceberg community and enterprise users with flexibility, enhanced choice and full control over their data. Since it is open source, users can deploy with their own choice of infrastructure.
2. How much will the Polaris Catalog cost?
Polaris Catalog is open-source, it is free to use.
3. Does Snowflake have a data catalog?
The native data catalog in Snowflake is called Horizon. Polaris Catalog acts as a technical catalog for Apache Iceberg.
4. What is the catalog in Iceberg?
As of now, no native data catalog of Iceberg exists, though AWS Glue, Snowflake, etc. support data cataloging for Iceberg, but not all engines can use them. Iceberg does support out-of-the-box catalog implementations, e.g., REST, Hive Metastore, JDBC, etc.
5. Is Polaris Catalog available for general use?
The Polaris catalog is now available on GitHub and is now under public preview for Snowflake customers.
Neha is an experienced Data Engineer and AWS certified developer with a passion for solving complex problems. She has extensive experience working with a variety of technologies for analytics platforms, data processing, storage, ETL and REST APIs. In her free time, she loves to share her knowledge and insights through writing on topics related to data and software engineering.