Automation of pipelines in the data analytics field is an important task and a point of discussion in every architecture design as to which automation tool will suit the purpose. There are several processes associated with an ETL, and manual execution of these processes would be a cumbersome task to do. The pipeline can have several dependencies on internal and external factors such as the status of the previous job or the value of the environment variable that may affect the next set’s execution. This article talks about setting up Airflow Snowflake Connection.

A fundamental example of a pipeline is online ordering. Consider that you’ve ordered a product/service online, so there will be an automated pipeline that continuously runs at the backend to:

  • Check the payment status.
  • Sends a notification to the seller to pack the product on successful payment.
  • Alert the courier service to ship the order.
  • Notify the customer about their order status.
  • Update the order quantity at the backend and so on.

Snowflake is a fully managed SaaS (software as a service) that offers a single platform for data warehousing, data lakes, data engineering, data science, data application development, and secure sharing and consumption of real-time / shared data. It was first released in 2012. To meet the demanding needs of growing companies, Snowflake includes out-of-the-box capabilities such as storage and compute separation, on-the-fly scaling computation, data sharing, data cloning, and third-party tool support.

Apache Airflow is an open-source workflow authoring, scheduling, and monitoring application. It’s one of the most reliable systems for orchestrating processes or pipelines that Data Engineers employ. You can quickly see the dependencies, progress, logs, code, trigger tasks, and success status of your data pipelines.

In this blog post, you will learn about Airflow, and how to use Snowflake Airflow combination for efficient ETL.

Table of Contents

What is Airflow?

Airflow Snowflake - Apache Airflow Logo | Hevo Data
Image Source

Apache Airflow is an open-source workflow automation and scheduling platform that programmatically authors, schedules, and monitors workflows. Organizations use Airflow to orchestrate complex computational workflows, create data processing pipelines, and perform ETL processes. Airflow uses DAG (Directed Acyclic Graph) to construct the workflow, and each DAG contains nodes and connectors. Nodes connect to other nodes via connectors to generate a dependency tree.

Key Features of Airflow

  • Dynamic Integration: Airflow uses Python as the backend programming language to generate dynamic pipelines. Several operators, hooks, and connectors are available that create DAG and ties them to create workflows.
  • Extensible: Airflow is an open-source platform, and so it allows users to define their custom operators, executors, and hooks. You can also extend the libraries so that it fits the level of abstraction that suits your environment.
  • Elegant User Interface: Airflow uses Jinja templates to create pipelines, and hence the pipelines are lean and explicit. Parameterizing your scripts is a straightforward process in Airflow.
  • Scalable: Airflow is designed to scale up to infinity. You can define as many dependent workflows as you want. Airflow creates a message queue to orchestrate an arbitrary number of workers. 

Airflow can easily integrate with all the modern systems for orchestration. Some of these modern systems are as follows:

  1. Google Cloud Platform
  2. Amazon Web Services
  3. Microsoft Azure
  4. Apache Druid
  5. Snowflake
  6. Hadoop ecosystem
  7. Apache Spark
  8. PostgreSQL, SQL Server
  9. Google Drive
  10.  JIRA
  11. Slack
  12. Databricks

You can find the complete list here

What is Snowflake?

Airflow Snowflake - Snowflake Logo | Hevo Data
Image Source

Snowflake is completely cloud-oriented and truly a SaaS (Software-as-a-Service) offering, that is used to load, transform, and report massive data volume. Snowflake uses AWS, Azure, or GCP to host its services and provides an intuitive user interface that allows customers to perform analytics. It uses SQL to query the data that runs on its virtual machines.

With Snowflake, you can seamlessly run your data solution across multiple regions and Clouds for a consistent experience. Snowflake makes it possible by abstracting the complexity of underlying Cloud infrastructures.

Key Features of Snowflake

  • Scalable: Snowflake provides a secure, scalable architecture that can spin up an unlimited number of independent virtual warehouses in a few minutes. You can execute several parallel jobs without worrying about performance and memory management.
  • Pay Per Use Model: Snowflake is available as pay per usage; i.e., you only have to pay for the time you use the services offered by Snowflake. You can shut down the warehouses once the execution gets completed, to save costs.
  • High Processing Speed: Each Virtual Warehouse in Snowflake is associated with the MPP cluster (Massive Parallel Processing) that performs the parallel execution of a job without degrading other clusters’ performance.
  • Separate Storage and Compute Layer: Snowflake uses different storage and computes layers that can scale up or down without affecting the other.
  • Disaster Recovery: In Snowflake, data is replicated three times (by default) across the availability zones and regions, hence providing a complete fail-safe and fault-tolerant system.
Simplify ETL to Snowflake Using Hevo’s No-code Data Pipeline

Hevo Data, an Automated No Code Data Pipeline can help you ETL your data swiftly from a multitude of sources to Snowflake in real-time without any assistance from your side. You can set up the Snowflake Destination on the fly, as part of the Pipeline creation process, or independently.

Hevo’s fault-tolerant ETL Pipeline offers you a secure option to unify data from 100+ other sources (including 40+ free sources) and store it in Snowflake or any other Data Warehouse of your choice without writing a single line of code. You can entrust us with your data transfer process and enjoy a hassle-free experience. This way, you can focus more on Data Analysis, instead of data consolidation.

Get Started with Hevo for Free

Challenges While Moving Data To Snowflake

Here are a few challenges that you might face while moving data to Snowflake:

  • Transitioning to new technology: Although Snowflake has a lot to offer in terms of features and functionality, in order to get the most out of your Snowflake migration, you need to know what’s new and what’s available.
  • Architecture and Pricing: Because of Snowflake’s on-demand pricing, you may need to change your usage behavior from your previous data warehouse solution. Meanwhile, it will take some time to get used to Snowflake’s “unique multi-cluster shared data architecture”.
  • Measuring Success: Choosing the right metrics and KPIs will help you evaluate the migration’s success and focus on the areas where you need to make the most progress.
  • Making preparations in advance: It will be critical to work with the best Snowflake managed services provider.

Hevo Data is a No-code Data Pipeline that helps you transfer data from multiple sources to Snowflake in real-time in an effortless manner. After using Hevo you can easily carry out Snowflake Create Users Tasks.

Prerequisites

  • Use Python 3.6 or later and have a working knowledge of Python.
  • Amazon AWS account with reading/writes access to buckets.
  • Snowflake account, with access to perform read and write.
  • Access to Apache Airflow 1.10 and later, with dependencies installed. To install Apache Airflow, you can have a look here.
  • Basic understanding of workflows and programming language.

Demonstrating the Working of Airflow Snowflake Operator

You can see how Airflow Snowflake links in the Code view. The snowflake conn id is defined for this purpose:

Airflow Snowflake - Airflow Code View | Hevo Data
Image Source

By navigating to “Admin” and then “Connections,” you can define the Airflow Snowflake connection.

Airflow Snowflake: define | Hevo Data
Image Source

You have an Airflow Snowflake connection example here, as you can see:

Airflow Snowflake - List Connection | Hevo Data
Image Source

You can see all of the information that is needed to connect to Airflow Snowflake instance if you click on it. You can then return to the DAG view and run it as follows:

Airflow Snowflake: DAG | Hevo Data
Image Source

You may also look at it in your Snowflake instance to see how it looks in Airflow Snowflake Integration.

Airflow Snowflake: Snowflake view | Hevo Data
Image Source
What makes Hevo’s Data Integration Experience Unique?

Integrating data can be a tiresome task without the right set of tools. Hevo’s Data Integration platform empowers you with everything you need to have a smooth Data Collection, Processing, and integration experience. Our platform has the following in store for you!

  • Built-in Connectors: Support for 100+ Data Sources (link to integrations page here), including Databases, SaaS Platforms, Files & More to destinations like Snowflake. Native Webhooks & REST API Connector available for Custom Sources.
  • Exceptional Security: A Fault-tolerant Architecture that ensures Zero Data Loss.
  • Built to Scale: Exceptional Horizontal Scalability with Minimal Latency for Modern-data Needs.
  • Data Transformations: Best-in-class & Native Support for Complex Data Transformation at fingertips. Code & No-code Flexibility ~ designed for everyone.
  • Smooth Schema Mapping: Fully-managed Automated Schema Management for incoming data with the desired destination.
  • Blazing-fast Setup: Straightforward interface for new customers to work on, with minimal setup time.
Get Started with Hevo for Free

Setting Up Airflow Snowflake Integration

Here is the outline that you’ll be covering while traversing ahead in Airflow Snowflake Integration: 

Step 1: Connection to Snowflake

  1. In this step of Airflow Snowflake Integration to connect to Snowflake, you have to create a connection with the Airflow. On the Admin page of Apache Airflow, click on Connections, and on the dialog box, fill in the details as shown below. (Assuming Snowflake uses AWS cloud as its cloud provider).
Conn Id: <CONNECTION_ID>
Conn Type: Snowflake
Host: <YOUR_SNOWFLAKE_HOSTNAME>
Schema: <YOUR_SNOWFLAKE_SCHEMA>
Login: <YOUR_SNOWFLAKE_USERNAME>
Password: <YOUR_SNOWFLAKE_PASSWORD>
Extra: {
        "account": <YOUR_SNOWFLAKE_ACCOUNT_NAME>,
        "warehouse": <YOUR_SNOWFLAKE_WAREHOUSE_NAME>,
         "database": <YOUR_SNOWFLAKE_DB_NAME>,
         "region": <YOUR_SNOWFLAKE_HOSTED_REGION>
    }
Airflow Snowflake: Connection  Details | Hevo Data
Image Source

Step 2: Creation of DAG

DAG stands for Directed Acyclic Graph, and it represents the collection of tasks that you want to run. Each task runs on different workers at different points in time. DAG contains several operators that perform the tasks on the worker, like PythonOperator to perform python tasks, BashOperator to perform Bash tasks, and so on.

To create a DAG for Airflow Snowflake Integration that will perform operations on Snowflake, you’ll need to use the Snowflake operator and Snowflake hooks provided by Airflow:

Snowflake Operators are used when you want to perform a task without expecting output. These operators can execute – create, insert, merge, update, delete, copy into, and truncate operations as output result is not required in such cases.

Snowflake Hook is used when you expect a result from a query. Hooks are mostly used with select queries as they extract Snowflake results and pass them to Python for further processing.

Let’s create a sample DAG to automate the tasks in Airflow Snowflake Integration:

  1. To create a DAG for Airflow Snowflake Integration, you need to organize Python imports by using the following code.
import logging
from datetime import datetime, timedelta
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.hooks.snowflake_hook import SnowflakeHook
from airflow.contrib.operators.snowflake_operator import SnowflakeOperator
  1. Initializing a DAG object for Airflow Snowflake Integration is very simple, as it requires a DAG id and the default parameters with schedule intervals. There are many optional parameters provided by Airflow for additional functionalities. You can refer to the complete list of parameters here.
dag = DAG(
    dag_id="snowflake_connector3", default_args=args, schedule_interval=None
)
  1. Create functions in Python to create tables, insert some records, and get row count from Snowflake for Airflow Snowflake Integration:
query1 = [
    """select 1;""",
    """show tables in database abcd_db;""",
]


def count1(**context):
    dwh_hook = SnowflakeHook(snowflake_conn_id="snowflake_conn")
    result = dwh_hook.get_first("select count(*) from abcd_db.public.test3")
    logging.info("Number of rows in `abcd_db.public.test3`  - %s", result[0])
  1. Create DAG with SnowflakeOperator and PythonOperator to incorporate the functions created above for Airflow Snowflake Integration.
with dag:
    query1_exec = SnowflakeOperator(
        task_id="snowfalke_task1",
        sql=query1,
        snowflake_conn_id="snowflake_conn",
    )

    count_query = PythonOperator(task_id="count_query", python_callable=count1)
query1_exec >> count_query
  1. Now that you have created the task, you need to connect them with the use of a (>>) operator to create a pipeline for Airflow Snowflake Integration.
Airflow Snowflake: Connect DAG to form pipeline

After completion of the above script, you need to upload the script into the Airflow home for the Airflow Snowflake Integration. After the refresh, the DAG will appear on the user interface and will look as shown:

Airflow Snowflake:interface | Hevo Data
Image Source

Below is the complete example of the DAG for the Airflow Snowflake Integration:

import logging

import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.hooks.snowflake_hook import SnowflakeHook
from airflow.contrib.operators.snowflake_operator import SnowflakeOperator

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

args = {"owner": "Airflow", "start_date": airflow.utils.dates.days_ago(2)}

dag = DAG(
    dag_id="snowflake_automation", default_args=args, schedule_interval=None
)

snowflake_query = [
    """create table public.test_employee (id number, name string);""",
    """insert into public.test_employee values(1, “Sam”),(2, “Andy”),(3, “Gill”);""",
]


def get_row_count(**context):
    dwh_hook = SnowflakeHook(snowflake_conn_id="snowflake_conn")
    result = dwh_hook.get_first("select count(*) from public.test_employee")
    logging.info("Number of rows in `public.test_employee`  - %s", result[0])

with dag:
    create_insert = SnowflakeOperator(
        task_id="snowfalke_create",
        sql=snowflake_query ,
        snowflake_conn_id="snowflake_conn",
    )

    get_count = PythonOperator(task_id="get_count", python_callable=get_row_count)

create_insert >> get_count

In the above DAG, the Snowflake operator creates a table and inserts data into the table. The Snowflake hook is then used to query the table created by the operator and return the result to the Python operator, which logs the result to the console completing the Airflow Snowflake Integration.

How is Airflow Snowflake Connection Beneficial?

Here are a few reasons why Airflow Snowflake Integration is beneficial:

  • User-friendly monitoring interface: Airflow has a monitoring and management interface that allows you to get a rapid overview of the status of various tasks as well as trigger and clear charges or DAGs runs. Snowflake databases are extremely user-friendly, allowing users to organize data in any way they see fit. Snowflake is built to be a highly responsive platform that runs at peak performance without the need for regular monitoring by an expert.
  • Airflow pipelines are configuration-as-code (Python) and hence allow for dynamic pipeline generation. This makes it possible to write code that generates pipeline instances on the fly. We do not process data in a linear or static manner.
  • Airflow has a variety of operators set up to run code, so we can automate our queries or Python code. Airflow is written in Python and provides an operator for almost every database. It has a PythonOperator that makes migrating Python code to production a breeze.
  • Maximum Server Capacity: While traditional data warehouses required considerable investments in servers and other infrastructure, Snowflake provides significantly more capacity without the need to upgrade hardware. Everything is cloud-based with Snowflake, so you may deploy it on a small scale that can later be scaled up or down depending on your company’s needs.
  • Many firms deal with sensitive data, which must be securely protected. Snowflake’s backend has IP whitelisting, which restricts data access to only trusted, authorized users. Snowflake provides high-quality data security thanks to two-factor authentication, SSO authentication, and AES 256 encryption, as well as the fact that it encrypts data in transit and at rest.
  • Recovering from a disaster: Some businesses are concerned about not having physical access to the servers where their data is housed in the case of a disaster. Snowflake databases have disaster recovery measures in place, ensuring that your data is replicated and accessible across several data centers in the event of a disaster.

Conclusion

In this blog, you have learned about Airflow, Snowflake, and how to use Airflow Snowflake combinations for efficient ETL. Airflow with Snowflake helps us in automating the data transfer by forming an automated ETL. This article also talks about the different features, benefits, and use cases of Airflow and Snowflake before diving into the steps involved in establishing a connection from Airflow to Snowflake. However, the manual approaches of connecting Airflow to Snowflake can be complex and time-consuming. Moreover, they are error-prone and a significant amount of technical expertise is required to implement them successfully.

Snowflake has a list of tools that can be integrated into it by simply accessing its tools page and selecting the platform you need. Hevo Data is a good data tool to integrate with Snowflake as it helps you to create efficient datasets and transforms your data into insightful actionable leads.

Visit our Website to Explore Hevo

Hevo Data, with its strong integration with 100+ Sources & BI tools, allows you to not only export data from sources & load data in the destinations such as Snowflake, but also transform & enrich your data, & make it analysis-ready so that you can focus only on your key business needs and perform insightful analysis using BI tools. In short, Hevo can help you store your data securely in Snowflake.

Give Hevo Data a try andSign Up for a 14-day free trial today. Hevo offers plans & pricing for different use cases and business needs!

Vishal Agrawal
Freelance Technical Content Writer, Hevo Data

Vishal has a passion towards the data realm and applies analytical thinking and a problem-solving approach to untangle the intricacies of data integration and analysis. He delivers in-depth researched content ideal for solving problems pertaining to modern data stack.

Get Started with Hevo