Airflow and Azure Data Factory are both wonderful tools for workflow orchestration, and building & monitoring your ETL pipelines.
But have you ever wondered if you could use them together? Azure Airflow deployment overcomes the native integration challenges and lets you create DAG runs that execute your Azure Data Factory pipeline jobs.
This guide will help you understand the precursors to deploy an Azure Airflow environment and the steps you need to integrate Airflow on Azure. We’ll talk about the advantages you gain when you combine Azure Airflow and a process to build your own PythonOperater that connects Airflow to Azure. Let’s dive right in.
Table of Contents
What is Airflow?
When working with large teams or big projects, you would have recognized the importance of Workflow Management. It’s essential to keep track of activities and not get haywire in the sea of tasks. Workflow Management Tools help you solve those concerns by organizing your workflows, campaigns, projects, and tasks. Not only do they coordinate your actions, but also the way you manage them.
Apache Airflow is one such Open-Source Workflow Management tool to improve the way you work. It is used to programmatically author, schedule, and monitor your existing tasks. Comprising a systemic workflow engine, Apache Airflow can:
- Schedule and run your core jobs.
- Manage your data pipelines.
- Safeguard jobs placement based on dependencies.
- Allocate scarce resources.
- Track the state of jobs and recover from failure.
The current so-called Apache Airflow is a revamp of the original project “Airflow” which started in 2014 to manage Airbnb’s complex workflows. It was written in Python and uses Python scripts to manage workflow orchestration. Since 2016, when Airflow joined Apache’s Incubator Project, more than 200 companies have benefitted from Airflow, which includes names like Airbnb, Yahoo, PayPal, Intel, Stripe, and many more.
Here are some informative blogs on Apache Airflow features and use cases:
Key Features of Apache Airflow
- Easy to Use: If you are already familiar with standard Python scripts, you know how to use Apache Airflow. It’s as simple as that.
- Open Source: Apache Airflow is open-source, which means it’s available for free and has an active community of contributors.
- Dynamic: Airflow pipelines are defined in Python and can be used to generate dynamic pipelines. This allows for the development of code that dynamically instantiates with your data pipelines.
- Extensible: You can easily define your own operators and extend libraries to fit the level of abstraction that works best for your environment.
- Elegant: Airflow pipelines are simple and to the point. To parameterize your scripts Jinja templating engine is used.
- Scalable: Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. You can expand Airflow indefinitely.
- Robust Integrations: Airflow can readily integrate with your commonly used services like Google Cloud Platform, Amazon Web Services, Microsoft Azure, and many other third-party services.
What is Azure Data Factory (ADF)?
Microsoft Azure Data Factory is a fully managed cloud service within Microsoft Azure to build ETL pipelines. It enables organizations to ingest, prepare, and transform their data from different sources- be it on-premise or cloud data stores.
Using Azure Data Factory (ADF), your business can create and schedule data-driven workflows (called pipelines) and complex ETL processes. Azure Data Factory transforms your data using native compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database, which can then be pushed to data stores such as Azure Synapse Analytics for business intelligence (BI) applications to consume.
With ADF, you also get the ability to use its rich graphical interface to monitor your data flows and use automation tools for routine tasks. For more information on Azure Data Factory, do give a read here- Azure Data Factory vs Databricks: 4 Critical Key Differences.
Key Features of Azure Data Factory
- Easy-to-Use: Azure Data Factory enables your organization to prepare data, build ETL & ELT pipelines, and monitor pipelines code-free. It also features intelligent mapping tools that can automatically copy your data from source to target.
- Cost-Effective: Azure Data Factory is a fully managed serverless cloud service that lets you pay only for what you need with the ability to scale on demand.
- Built-in Connectors: Azure Data Factory offers more than 90 built-in connectors for data warehouses like Amazon Redshift and Google BigQuery, business data warehouses like Oracle Exadata and Teradata, SaaS applications like Salesforce, Marketo, and ServiceNow, and all Azure data services.
- Effective Graphical User Interface: Since the introduction of Azure Data Factory V2, Microsoft has given users a whole new browser-based user interface with drag and drop functionality. This provides Azure Data Factory (ADF) an upper hand over other ETL platforms that are either scripting-based or UI-based.
Hevo offers a faster way to move data from databases or SaaS applications like HubSpot, Google Ads, Zendesk & 100+ Sources (40+ free source connectors) into your Data Warehouses like Redshift, Google BigQuery, Snowflake and Firebolt to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.
Get Started with Hevo for Free
Check out some of the cool features of Hevo:
Sign up here for a 14-Day Free Trial!
- Completely Automated: The Hevo Platform can be set up in just a few minutes and requires minimal maintenance.
- Real-time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
- 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
- Scalable Infrastructure: Hevo has in-built integrations for 100+ sources with 40+ as free source connectors that can help you scale your data infrastructure as required.
- Tremendous Connector Availability: Hevo houses a diverse set of connectors that authorize you to bring data in from multiple data sources such as Google Analytics, HubSpot, Asana, Trello, Amplitude, Jira, and Oracle, and even Data-Warehouses such as Redshift and Snowflake in an integrated and analysis-ready format.
- 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
Azure Airflow Symphony: Why Use Airflow on Azure Data Factory?
Azure Airflow integration is a perfect harmony to build and orchestrate your data pipelines. Along with the ease of monitoring and building ADF pipelines, Azure Airflow integration allows you to create multiple pipelines across multiple teams, and structure their dependencies smoothly.
The evident problem with ADF, as most users point out, is that most of its in-built connections are with Azure’s Native Services. This means integrations with services outside of Azure are hard to implement. Azure Data Factory also lacks orchestration capabilities and becomes complex to manage when you use custom packages and dependencies.
Using Airflow on Azure overcomes all of these problems, giving your company complete Airflow orchestration capabilities beyond what ADF can provide. When your business uses Apache Airflow Azure combination, your teams get to work in a variety of scenarios, effectively. If your business teams, for example, prefer to work individually or lack the time or scale to share their intentions with others, Azure Airflow can help them to execute the following operations without any hitches:
- Perform operational checkpoints.
- Create ADF data pipelines.
- Keep an eye on the entire execution process.
How to Deploy Airflow on Azure?
Here we discuss some considerations to take into account before designing your Apache Airflow Azure deployment.
One of the easiest ways to run your Airflow components is to use Azure’s managed container services. As an example, you can run Airflow webserver and scheduler components using Azure Container Instances (ACI) or Azure Kubernetes Service (AKS) for your requirements.
For running Airflow metastore with convenience, you can use Azure SQL Database. To host your Airflow DAGs, you can use Azure File Storage as your top option. The benefit of using Azure File Storage, among the rest, is that file storage volumes can be mounted directly into the containers running in App Service and ACI. Moreover, it’s easy to access data using supporting user applications such as the Azure Storage Explorer.
This creates the following setup for your Airflow Azure deployment:
- App Service for the Airflow webserver
- ACI for the Airflow scheduler
- Azure SQL database for the Airflow metastore
- Azure File Storage for storing DAGs
- Azure Blob Storage for data and logs
Designing the Network
The next consideration in Azure Airflow deployment is to design network connectivity between your Airflow and Azure components.
Airflow webserver requires access to the internet, through which your teams can access it remotely. On the other hand, Airflow metastore and Airflow scheduler would need private access to avoid any potential threats.
Azure’s App Service makes it easy to expose your Airflow webserver as a web application, including a firewall that prevents unwanted access. For Airflow scheduler and metastore, you can create a virtual net (vnet) with a private subnet, like the one shown below:
Scaling with the CeleryExecutor
Once you line up your Airflow Azure services and their respective network connections, you would want to strengthen scalability for your Azure Airflow deployment. This can be done by switching from the LocalExecutor mode to CeleryExecutor mode.
Local Executor is designed for small to medium-sized workloads and allows for parallelization. Celery Executor, on the other hand, is the ideal deployment mechanism for production deployments and one of the methods for scaling out the number of Airflow workers.
The CeleryExecutor runs workers in separate compute processes, which are run as individual container instances on Azure Container Instances. Since there are no fluid integrable solutions in Azure Airflow, you can prefer open-source tools like RabbitMQ and Redis for relaying jobs between the scheduler and the workers.
Azure Container Instances (ACI) run a Redis or RabbitMQ instance as a message broker for passing tasks to workers after they have been scheduled. This can be understood from the diagram below:
Although we have presented a competitive arrangement, please keep in mind that this is not a production-ready setup. Any production-ready solution will still require extra steps, such as setting up proper firewalls, access restrictions, a strong approach to logging, auditing, tracking metrics, raising alarms, and many other things.
At the application level, we propose investigating corresponding Azure services such as Azure Log Analytics, App Insights, and so on. At the Airflow level, you should also consider how you want to secure Airflow (e.g., using Airflow’s RBAC mechanism, etc.), and so on.
Azure Airflow Hooks and Operators
For Azure Airflow integration, Airflow provides Azure-specific hooks and operators to integrate Apache Airflow on Azure cloud.
Azure Airflow Hooks
Airflow hooks are mediums that enable you to interact with external systems like S3, HDFC, MySQL, PostgreSQL, etc. As an alternative, your software team can also use ADF API directly to run a pipeline or perform some other operations.
If you would like to learn in detail about Airflow hooks, and the process of using them, visit our helpful guide here- Airflow Hooks Explained 101: A Complete Guide.
Airflow provides Azure Data Factory hook to interact, and execute with an ADF pipeline. There are many more kinds that allow you to link Airflow with Azure’s various storage services (e.g., Blob, File Share, and Data Lake Storage). Explore the list in the table below:
|Azure Blob Storage||Blob storage service||WasbHook (Windows Azure Storage Blob)||Uploading/downloading files|
|Azure Container Instances||Managed service for running containers||AzureContainerInstanceHook||Running and monitoring containerized jobs|
|Azure Cosmos DB||Multi-modal database service||AzureCosmosDBHook||Inserting and retrieving database documents|
|Azure Data Lake Storage||Data lake storage for big-data analytics||AzureDataLakeHook||Uploading/downloading files to/from Azure Data Lake Storage|
|Azure File Storage||NFS-compatible file storage service||AzureFileShareHook||Uploading/downloading files|
Azure Airflow Operators
Currently, Airflow doesn’t offer you the option of Azure Airflow operator. We can expect Airflow to release one, given they already offer Airflow Azure hooks. While they’re still busy creating one, you can develop and use one of your own using the PythonOperator.
To do so, follow these steps:
Step 1: Create an ADF Pipeline
Create a new Data Factory resource in your ADF dashboard, by visiting the resources group. You can do so by clicking on “add resource” and searching for Data Factory.
Once the resource has been created, click on it to see an overview of the current runs. Next, select ‘Author and Monitor’ to build your own pipeline.
You can copy data from a REST API and create a Copy Activity pipeline using the option “Copy from REST or HTTP using OAuth”. When you’ve built your pipeline, you can run it by entering the parameters.
Step 2: Connect App with Azure Active Directory
To make your ADF pipeline available in Apache Airflow, you must first register an App with Azure Active Directory in order to obtain a Client ID and Client Secret (API Key) for your Data Factory.
To begin, navigate to Azure Active Directory and choose ‘Registered Apps’ to view a list of registered apps. If you have established a resource group, you will find an app with the same name registered.
Click on the app to find your Client ID under the Essentials tab. To obtain Client Secret (API Key), click on Certificates and Servers from the left pane, and then click New client secret to make one. This will be used to connect Data Factory in Airflow.
Now, head to your Access Control (IAM) settings > Add role assignments and enter your Client ID and Client Secret (API Key). You’ll be asked to specify role assignments for your users.
Click Save to make necessary changes. You can now be able to establish an Azure Airflow connection.
Step 3: Build a DAG Run for ADF Job
Lastly, you can describe a DAG run to implement your ADF job. You can pass ADF parameters to the DAG run which will eventually get executed.
Your DAG run for ADF job will look something like this. In this sample DAG code, azure_data_factory_conn is used to connect DAG to your Azure instance and Azure Data factory. For your use cases, this might differ, and you’ll have to define your settings accordingly.
We hope this blog piece clarified the concepts of Azure Airflow deployment, and the steps to achieve so. We also shared with you some considerations while deploying the Azure Airflow environment, and tips to make it a production-ready and scalable solution.
Today, a plethora of organizations rely on Airflow and Azure Data Flow for orchestrating their business processes. Data is sent into and retrieved from a number of systems, and it becomes important to consolidate data into one source of truth. Migrating data from Airflow and other Data Sources into a Cloud Data Warehouse or a destination of your choice for further Business Analytics is a good solution and this is where Hevo comes in.
Hevo Data with its strong integration with 100+ Sources & BI tools such as Airflow, allows you to not only export data from sources & load data in the destinations, but also transform & enrich your data, & make it analysis-ready so that you can focus only on your key business needs and perform insightful analysis using BI tools.
Visit our Website to Explore Hevo
Hevo lets you migrate your data from your database, SaaS Apps to any Data Warehouse of your choice like Amazon Redshift, Snowflake, Google BigQuery, or Firebolt within minutes with just a few clicks.
Why not try Hevo and see the magic for yourself? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also check out our unbeatable pricing and make a decision on your best-suited plan.
If you have any questions on Apache Airflow Azure integration, do let us know in the comment section below. Also, share any other topics you’d want to use to cover. We’d be happy to know your opinions.