Apache Airflow Databricks Integration: 2 Easy Steps

on Apache Airflow, Data Integration, Databricks, Python, Tutorials • November 11th, 2021 • Write for Hevo

Airflow Databricks

With the ever-growing data, more and more organizations are adopting Cloud Solutions as they provide the On-demand scaling of both computational and storage resources without any extra expense to you on the infrastructure part.

The well-established Cloud Data Warehouses offer scalability and manageability, and Cloud Data lakes offer better storage for all types of data formats including Unstructured Data. Data Lakehouses like Databricks are Cloud platforms that incorporate the functionalities of both these Cloud solutions and Airflow Databricks Integration becomes a must for efficient Workflow Management.

Databricks is a scalable Cloud Data Lakehousing solution with better metadata handling, high-performance query engine designs, and optimized access to numerous built-in Data Science and Machine Learning Tools. To efficiently manage, schedule, and run jobs with multiple tasks, you can utilise the Airflow Databricks Integration.

Using the robust integration, you can describe your workflow in a Python file and let Airflow handle the managing, scheduling, and execution of your Data Pipelines. In this article, you will learn to successfully set up Apache Airflow Databricks Integration for your business.

Table of Contents

What is Databricks?

Airflow Databricks - Databricks Logo
Image Source

Databricks is a flexible Cloud Data Lakehousing Tool that allows you to prepare, process data, train models in a self-service manner, and manage the complete Machine Learning Lifecycle from experimentation to production.

Databricks is built on top of Apache Spark, a fast and general engine for large-scale data processing, and it provides reliable and best-in-class performance. Adding to its ease of use, it also supports programming languages such as Python, R, Java, and SQL.

Databricks is also integrated with leading Cloud Service Providers like Amazon Web Services, Microsoft Azure, and Google Cloud Platform. This allows you to start using Databricks from your choice of Cloud Storage Platform giving you more control over your data as it remains in your Cloud Account and data sources.

Acting as a one-stop solution for all your Data Science, Machine Learning, and business teams, it provides features such as MFlow for managing complete ML lifecycle, BI Reporting on Delta Lake for real-time business analytics, and Databricks Workspaces promoting workplace collaboration collaborative workspace in which different teams can interact and work at the same time.

What is Apache Airflow?

Airflow Databricks - Airflow Logo
Image Source

Apache Airflow is an elegant Workflow Engine that can effortlessly manage, schedule, and monitor your workflows programmatically. It is a popular Open-Source tool preferred by Data Engineers for orchestrating workflows or pipelines.

Airflow refers to Data pipelines as Directed Acyclic Graphs (DAGs) of operations where you can seamlessly identify your Data Pipelines’ dependencies, progress, logs, code, trigger tasks, and success status.

Airflow’s modular architecture is made to scale allowing you to enjoy its lightning-fast data processing speed. Defined in Python, you can instantiate Data Pipelines dynamically, create workflows, build ML models, transfer data, etc. Its easy-to-use UI helps you keep a tab on the status and logs of completed and ongoing tasks.

With Robust Integrations like Airflow Databricks Integration, you can easily execute your tasks on various Cloud Platforms.     

Simplify Databricks ETL and Analysis with Hevo’s No-code Data Pipeline

Hevo Data is a No-code Data Pipeline that offers a fully-managed solution to set up data integration from 100+ data sources (including 40+ free data sources) and will let you directly load data to Databricks or a Data Warehouse/Destination of your choice.

It will automate your data flow in minutes without writing any line of code. Its Fault-Tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully automated solution to manage data in real-time and always have analysis-ready data.

Its completely automated Data Pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Get Started with Hevo for Free

Check out some of the cool features of Hevo:

  • Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
  • Connectors: Hevo supports 100+ Integrations to SaaS platforms, Files, Databases, BI tools, and Native REST API & Webhooks Connectors. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake Data Warehouses; Amazon S3 Data Lakes; Databricks; and MySQL, SQL Server, TokuDB, DynamoDB, PostgreSQL Databases to name a few.  
  • Real-Time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
  • 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (Including 40+ Free Sources) that can help you scale your data infrastructure as required.
  • 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Prerequisites

  • An active Databricks account.
  • Knowledge of Python.

Steps to Set up Apache Airflow Databricks Integration

In the Airflow Databricks Integration, each ETL Pipeline is represented as DAG where dependencies are encoded into the DAG by its edges i.e. the downstream task is only scheduled if the upstream task is completed successfully.

Each task in Airflow is termed as instances of the “operator” class that are executed as small Python Scripts. In the example given below, spark_jar_task will only be triggered if the notebook_task is completed first.

notebook_task = DatabricksSubmitRunOperator(
    task_id='notebook_task',
    …)

spark_jar_task = DatabricksSubmitRunOperator(
    task_id='spark_jar_task',
    …)
notebook_task.set_downstream(spark_jar_task)
Airflow Databricks - DAG Structure
Image Source

For setting up the Apache Airflow Databricks Integration, you can follow the 2 easy steps:

A) Configure the Airflow Databricks Connection

To begin setting up the Apache Airflow Databricks Integration, follow the simple steps given below:

  • Step 1: Open a terminal and run the following commands to start installing the Airflow Databricks Integration. Here the 2.1.0 version of apache-airflow is being installed.
mkdir airflow
cd airflow
pipenv --python 3.8
pipenv shell
export AIRFLOW_HOME=$(pwd)
pipenv install apache-airflow==2.1.0
pipenv install apache-airflow-providers-databricks
mkdir dags
airflow db init
airflow users create --username admin --firstname <firstname> --lastname <lastname> --role Admin --email your@email.com
  • Step 2: Open your Databricks Web page. Navigate to User Settings and click on the Access Tokens Tab. 
Airflow Databricks - Databricks User Settings
Image Source
  • Step 3: Click on the Generate New Token button and save the token for later use.
Airflow Databricks - Generate New Token Dialog Box
Image Source
  • Step 4: Go to your Airflow UI and click on the Admins option at the top and then click on the “Connections” option from the dropdown menu. A list of all your current connections will be displayed.
Airflow Databricks - Airflow Admin Option
Image Source
  • Step 5: By default, the  databricks_conn_id parameter is set to “databricks_default,” therefore for this example, the same value is used. Type in the URL of your Databricks Workspace in the Host field. In the Extra field, enter the token key saved earlier.
Airflow Databricks - Airflow Connection Parameter Settings
Image Source

B) Creating a DAG

Airflow has defined an operator named DatabricksSubmitRunOperator for a fluent Airflow Databricks Integration. This operator executes the Create and trigger a one-time run (POST /jobs/runs/submit) API request to submit the job specification and trigger a run.

You can also use the DatabricksRunNowOperator but it requires an existing Databricks job and uses the Trigger a new job run (POST /jobs/run-now) API request to trigger a run. In this example for simplicity, the DatabricksSubmitRunOperator is used.

For creating a DAG, you need:

  • To configure a cluster (Cluster version and Size).
  • Python script specifying the job.    

In this example, AWS keys are passed that are stored in an Airflow environment over into the ENVs for the DataBricks Cluster to access files from Amazon S3. After running the following code, your Airflow DAG will successfully call over into your DataBricks account and run a job based on a script you have stored in S3.

from airflow import DAG
from airflow.contrib.operators.databricks_operator import DatabricksSubmitRunOperator
from airflow.utils.dates import days_ago
from airflow.models import Variable

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 0,
}

with DAG(
        dag_id='my_first_databricks_operator',
        default_args=default_args,
        schedule_interval=None,
        start_date=days_ago(1),
) as dag:
    my_cluster = {
        'spark_version': '8.2.x-scala2.12',
        'node_type_id': 'r3.xlarge',
        'aws_attributes': {'availability': 'ON_DEMAND'},
        'num_workers': 1,
        'spark_env_vars': {'AWS_SECRET_ACCESS_KEY': Variable.get("AWS_SECRET_ACCESS_KEY"),
                           'AWS_ACCESS_KEY_ID': Variable.get("AWS_ACCESS_KEY_ID"),
                           'AWS_REGION': 'us-east-1'}
    }

    task_params = {
        'new_cluster': my_cluster,
        'spark_python_task': {
            'python_file': 's3://some-wonderful-bucket/my_pyspark_script.py',
        },
    }

    first_databricks_job = 
        DatabricksSubmitRunOperator(task_id='my_first_job_job',
                                    json=task_params
                                    )

    first_databricks_job 

Conclusion

In this article, you have learned how to effectively set up your Airflow Databricks Integration. The effortless and fluid Airflow Databricks Integration leverages the optimized Spark engine offered by Databricks with the scheduling features of Airflow.

Airflow provides you with a powerful Workflow Engine to orchestrate your Data Pipelines. The standard Python Features empower you to write code for Dynamic Pipeline generation. Setting up the Airflow Databricks Integration allows you to access data via Databricks Runs Submit API to trigger the python scripts and start the computation on the Databricks platform.

Managing and Monitoring the jobs on Databricks become efficient and smooth using Airflow. However, as your business grows, massive amounts of data is generated at an exponential rate.

Effectively handling all this data across all the applications used across various departments in your business can be a time-consuming and resource-intensive task. You would require to devote a portion of your Engineering Bandwidth to Integrate, Clean, Transform and Load your data into a Data Warehouse or a destination of your choice for further Business analysis.

This can be effortlessly automated with a Cloud-Based ETL Tool like Hevo Data.

Visit our Website to Explore Hevo

Hevo Data is a No-code Data Pipeline that assists you in seamlessly transferring data from a vast collection of sources into a Data Lake like Databricks, Data Warehouse, or a Destination of your choice to be visualized in a BI Tool. It is a secure, reliable, and fully automated service that doesn’t require you to write any code!

If you are using Databricks as a Data Lakehouse and Analytics platform in your business and searching for a stress-free alternative to Manual Data Integration, then Hevo can effectively automate this for you.

Hevo with its strong integration with 100+ Data Sources & BI tools (Including 40+ Free Sources), allows you to not only export & load Data but also transform & enrich your Data & make it analysis-ready.

Want to Take Hevo for a ride? Sign Up for a 14-day free trial and simplify your Data Integration process. Check out the pricing details to get a better understanding of which plan suits you the most.

Share with us your experience of setting up Airflow Databricks Integration. Let us know in the comments section below!  

No-code Data Pipeline for Databricks