Unravel Databricks Snowflake Connector: A Comprehensive Guide 101

on AI Driven, Big Data, Data Driven, Data Driven Strategies, Data Integration, Data Warehouse, Data Warehouses, Databricks, ETL Tutorials, Snowflake, Tutorials • October 6th, 2021 • Write for Hevo

Databricks can consistently analyze enormous amounts of data, as well as construct scalable Artificial Intelligence (AI) projects and Data Warehousing capabilities. With Collaborative Notebooks, Machine Learning Runtime, and managed ML Flow, Databricks provides a complete Data Science workspace. It has the unrivaled benefit of being built on a mature distributed Big Data processing and AI-enabled platform that can integrate with practically any technology. On the other hand, Snowflake provides the scalability of a Cloud-based Data Warehouse while centralizing data access. It provides a Data Warehouse that is substantially more configurable, quicker, and easier to use than traditional Data Warehouse systems.

Customers can now use an optimized, built-in Databricks Snowflake Connector to read from and write data to Snowflake using Databricks seamlessly. This connector combines the best tools and services, enabling you to benefit from the market leaders in ETL, Data Warehousing, and Machine Learning without having to worry about configuration and setup.

This article gives an overview of Databricks and Snowflake’s key features. It also highlights the need for Databricks Snowflake Connector. Moreover, you will understand the important steps to perform Databricks Snowflake Pushdown. At the end of the article, you explore some of the benefits offered by the Databricks Snowflake Connector. So, read along to gain insights and understand how you can use the Databricks Snowflake Connector.

Table of Contents

Prerequisites

To successfully set up Databricks Snowflake Connection, you need to meet the following requirements:

  • A Snowflake account.
  • A Databricks Account.
  • Understanding of Snowflake.

Introduction to Snowflake

Databricks Snowflake - Snowflake Logo
Image Source

Snowflake is a prominent Cloud Data Warehouse offered as Software-as-a-Service (SaaS). It is entirely managed, so the customer doesn’t have to worry about any of the maintenance tasks that come with setting up a Data Warehouse. Snowflake has a decoupled Storage and Computation layer, with distinct pricing for each. When compared to other Data Warehouse choices on the market, it is noted for its scalability and relative ease of use. Snowflake makes use of a proprietary SQL Database engine with a Cloud-specific architecture. 

Snowflake is a SQL-based Data Warehouse that works with both Structured and Semi-Structured Data. Its main advantage is the separation of Storage, Compute, and Metadata Services. Database Administrators (DBA) can manage logical assets from a central location using Snowflake Primitives. While the Data Warehouse is running, Snowflake can spin up more nodes (CPU and Storage) in seconds. In comparison to traditional Data Warehouse systems, it also saves a lot of money.

Key Features of Snowflake

The Snowflake platform is a SaaS-based analytical, versatile, and intuitive Data Warehouse service. It offers a Data Warehouse that is faster, easier to use, and significantly more customizable than traditional Data Warehouse solutions. Let’s discuss some of the key features of Snowflake.

  • High Processing Speed: Each Virtual Warehouse in Snowflake is linked to an MPP (Massive Parallel Processing) cluster, which allows jobs to run in parallel without affecting the performance of other clusters.
  • Scalable: Snowflake is a secure, scalable architecture that allows you to create an unlimited number of Virtual Warehouses in a couple of minutes. You can run multiple jobs in parallel without having to worry about performance or memory management.
  • Separate Storage and Compute Layer: Snowflake employs separate storage and compute layers that can be scaled up and down independently of one another.
  • Disaster Recovery: Snowflake replicates data three times (by default) across availability zones and regions, ensuring that the system is completely fail-safe and fault-tolerant.
  • Pay Per Use Model: Snowflake is available on a pay-per-use basis, which means you only pay for the time you utilize the service. To save money, you might shut down the Warehouses once the execution is finished.

To explore more about Snowflake, visit the official website here.

Introduction to Databricks

Databricks Snowflake - Databricks Logo
Image Source

Databricks is an Apache Spark-powered unified Cloud-based data platform. It focuses on Big Data collaboration and Analytics. With Collaborative Notebooks, Machine Learning Runtime, and managed ML Flow, Databricks provides a complete Data Science workspace for Data Scientists, Data Engineers, and Business Analysts to collaborate. It features Spark SQL and Dataframes, which are libraries that allow you to interact with structured data.

Databricks allow you to gain insights from your existing data while also assisting you in the development of Artificial Intelligence solutions. It also includes Machine Learning libraries for creating and training Machine Learning Models, such as PyTorch, TensorFlow, and others. Many enterprise customers are using Databricks to conduct large-scale production operations across a variety of industries and use cases, including Financial Services, Healthcare, Retail, Media & Entertainment, and many more.

Key Features of Databricks

Databricks is an industry-leading solution for Data Scientists and Analysts due to its capacity to handle and transform large amounts of data. Let’s understand the key features of Databricks.

  • Optimized Spark Engine: Databricks provides the most recent versions of Apache Spark and enables you to integrate Open Source libraries effortlessly. With the global scalability and availability of other Cloud service providers, you can instantly set up clusters and build in a fully managed Apache Spark environment. Clusters are set up, configured, and fine-tuned without the requirement for monitoring to assure reliability and performance.
  • Collaborative Notebooks: With the languages and tools of your choice, you can instantly access and analyze the data, discover and share new insights, and collectively build models. You can code in any language you like, including Python, Scala, R, and SQL.
  • Delta Lake: With an open-source transactional storage layer intended for the whole data lifecycle, you can introduce data reliability and scalability to your existing Data Lake.
  • Machine Learning Capabilities: Databricks provides you with one-click access to preconfigured machine learning environments using cutting-edge frameworks like PyTorch, TensorFlow, and scikit-learn. From a central repository, you can track and share experiments, reproduce runs, and manage models collaboratively.

To learn more about Databricks, visit the website here.

Introduction to the Databricks Snowflake Connector

Databricks Snowflake Integration
Image Source

The Databricks Snowflake Connector for Spark has been included in the Databricks Unified Analytics Platform to allow native Spark-Snowflake communication. The native Snowflake Connector in Databricks version 4.2 allows you to read data from Snowflake and publish data to it without having to load any libraries whereas the earlier versions of Databricks required you to load the Spark Connector libraries into your Databricks clusters.

The Databricks Snowflake Connector distributes processing between Spark and Snowflake automatically, without the need for the user to specify the processing systems. Snowflake’s automated Query Pushdown optimization also helps push certain queries into the database. Without using the connector, customers had to deal with sophisticated data transformation pipelines to transport datasets back and forth from Snowflake, resulting in data duplication and redundancy. However, Snowflake can leverage Databricks API to build real-time pipelines using Apache Spark and other Machine Learning frameworks like TensorFlow.

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline, helps load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ Data Sources including 30+ Free Sources. Hevo is an Official Snowflake Partner for Data Integration. It is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo loads the data onto the desired Data Warehouse/destination and enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

GET STARTED WITH HEVO FOR FREE

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled securely and consistently with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.

Simplify your Data Analysis with Hevo today! 

SIGN UP HERE FOR A 14-DAY FREE TRIAL!

Steps to Set Up the Databricks Snowflake Connector

Databricks Snowflake Connector
Image Source

Now that you have gained a basic understanding of Databricks Snowflake Connector, in this section, you will learn the steps to configure, read and write data to Snowflake using Databricks Snowflake Connector. So follow the steps listed below.

Step 1: Set Up Databricks Snowflake Connector

Log into Databricks and make sure that you have created a cluster in it using the Databricks Runtime version 4.2. Moreover, also ensure that you have created a Virtual Warehouse (Compute Cluster) in Snowflake.

Note: Make sure that Databricks and Snowflake accounts are set to the same region to achieve the best performance at low costs.

Hence, you don’t need to install or load any libraries and there’s no requirement to manage your configurations.

Step 2: Configure the Snowflake Databricks Connection

In this step, you will learn how to connect Snowflake with Databricks using Databricks Secrets API. This API will encrypt and store your Snowflake credentials securely. You can enter the following commands to get your Snowflake credentials either using Scala or Python Script as shown below.

For Python

# Use secrets DBUtil to get Snowflake credentials.
user = dbutils.secrets.get("data-warehouse", "<snowflake-user>")
password = dbutils.secrets.get("data-warehouse", "<snowflake-password>")

# snowflake connection options
options = {
  "sfUrl": "<snowflake-url>",
  "sfUser": user,
  "sfPassword": password,
  "sfDatabase": "<snowflake-database>",
  "sfSchema": "<snowflake-schema>",
  "sfWarehouse": "<snowflake-cluster>"
}

For Scala

// Use secrets DBUtil to get Snowflake credentials.
val user = dbutils.secrets.get("data-warehouse", "<snowflake-user>")
val password = dbutils.secrets.get("data-warehouse", "<snowflake-password>")


// Set options for reading and writing the data
val options = Map(
  "sfUrl" -> "<snowflake-url>",
  "sfUser" -> user,
  "sfPassword" -> password,
  "sfDatabase" -> "<snowflake-database>",
  "sfSchema" -> "<snowflake-schema>",
  "sfWarehouse" -> "<snowflake-cluster>"
)

Step 3: Perform ETL on Snowflake Data

Now let’s learn how you can read and write to Snowflake using write and read commands as shown below using Python and Scala. Here, you are trying to create a simple dataset having 5 values and then you write this dataset to Snowflake. After this, you read this dataset and display it.

For Python

  1. Write to Snowflake
# Generate a simple dataset containing five values and write the dataset to Snowflake.
spark.range(5).write 
  .format("snowflake") 
  .options(**options) 
  .option("dbtable", "<snowflake-database>") 
  .save()
  1. Read and Display Dataset from Snowflake
# Read the data written by the previous cell back.
df = spark.read 
  .format("snowflake") 
  .options(**options) 
  .option("dbtable", "<snowflake-database>") 
  .load()

display(df)

For Scala

  1. Write to Snowflake
// Generate a simple dataset containing five values and write the dataset to Snowflake.
spark.range(5).write
  .format("snowflake")
  .options(options)
  .option("dbtable", "<snowflake-database>")
  .save()
  1. Read and Display Dataset from Snowflake
// Read the data written by the previous cell back.
val df: DataFrame = spark.read
  .format("snowflake")
  .options(options)
  .option("dbtable", "<snowflake-database>")
  .load()

display(df)

The output for the above codes is shown below. It displays 5 numbers in random order.

Databricks Snowflake - Output
Image Source

Step 4: Query Data into Snowflake

After loading the data in Snowflake, you can perform simple to complex queries. Enable the Snowflake Virtual Warehouse and open the Snowflake worksheet, now you can query the data easily. Otherwise, you can also perform queries directly by writing the Scala and Python Scripts as shown below.

df = spark.read 
  .format("snowflake") 
  .options(**options) 
  .option("query",  "select 1 as my_num union all select 2 as my_num") 
  .load()

df.show()

To learn how you can train a Machine Learning model in Snowflake using Databricks Snowflake Connector, refer to this link. In case you want to work in the R language, you can learn more about it over here.

Bravo!! You have learned a lot more today about Databricks Snowflake Connector including how you can read, write, and query data in Snowflake.

Benefits of Databricks Snowflake Connector

Databricks and Snowflake offer the best solution for bringing together Big Data and AI, by reducing all the complexities associated with integration and automating price performance through automatic Query Pushdown.  Customers benefit immensely from Databricks Snowflake Integration as they get started faster with less set-up complexity and are kept up to date on both products and upgrades automatically.

Spark handles difficult workloads with this optimal Connector, while Snowflake processes workloads that can be converted to SQL. This can help in performance boost and cost savings without requiring any manual tasks or ongoing maintenance. Furthermore, Snowflake’s automatic Query Pushdown feature can push certain queries into the database. This eliminates all of the uncertainty and complication associated with determining what processing should take place where.

Conclusion

This article introduced you to key features of Databricks and the leading Data Warehouse – Snowflake. You also gained a basic understanding of Databricks Snowflake Connector. Moreover, you understood the benefits and steps to read and write data to Snowflake using Databricks Snowflake Connector. Snowflake and Databricks collaboration using Databricks Snowflake Connector combine a Cloud Data Warehouse with predictive Analytics tools to provide the best of both worlds to you.

However, extracting complex data from a diverse set of data sources like CRMs, Streaming Services, Databases, etc. to Snowflake can be quite challenging and cumbersome. This is where an easier alternative like Hevo saves your day! 

Hevo Data is an Official Snowflake Partner. It is a No-Code Data Pipeline that offers a faster way to move data from 100+ Data Sources including 30+ Free Sources, into your Data Warehouse such as Snowflake to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.

VISIT OUR WEBSITE TO EXPLORE HEVO

Want to take Hevo for a spin?

SIGN UP and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Share your experience with Databricks Snowflake Connector in the comments section below!

No-code Data Pipeline For Your Snowflake