Airflow is a Task Automation tool. It helps organizations to schedule their tasks so that they are executed when the right time comes. This relieves the employees from doing tasks repetitively. When using Airflow, you will want to access it and perform some tasks from other tools.

This means that you will need a way to connect to Airflow from other tools. Kubernetes is an open-source system, built on top of 15 years of experience of running production workloads at Google in tandem with the best ideas and practices from the community. 

This blog talks about the different steps involved in installing Airflow on Kubernetes in a seamless fashion. It also gives a brief introduction to Airflow and Kubernetes before diving into the benefits of leveraging Airflow Kubernetes operator and setup steps.

What is Apache Airflow?

Airflow Kubernetes: Airflow Logo

Apache Airflow is an open-source workflow automation and scheduling platform that programmatically authors, schedules, and monitors workflows. Organizations use Airflow to orchestrate complex computational workflows, create data processing pipelines, and perform ETL processes.

Apache Airflow uses DAG (Directed Acyclic Graph) to construct the workflow, and each DAG contains nodes and connectors. Nodes connect to other nodes via connectors to generate a dependency tree.

Key Features of Apache Airflow

  • Dynamic Integration: Airflow uses Python as the backend programming language to generate dynamic pipelines. Several operators, hooks, and connectors are available that create DAG and tie them to create workflows.
  • Extensible: Airflow is an open-source platform, and so it allows users to define their custom operators, executors, and hooks. You can also extend the libraries so that it fits the level of abstraction that suits your environment.
  • Elegant User Interface: Airflow uses Jinja templates to create pipelines, and hence the pipelines are lean and explicit. Parameterizing your scripts is a straightforward process in Airflow.
  • Scalable: Airflow is designed to scale up to infinity. You can define as many dependent workflows as you want. Airflow creates a message queue to orchestrate an arbitrary number of workers.

What is Kubernetes?

Airflow Kubernetes: Kubernetes Logo

Kubernetes has made a name for itself in the marketplace as an open-source container orchestration system that can be leveraged for scaling up operations, automating software development, and management. In layman terms, Kubernetes collates the containers that make up an application into logical units for easy discovery and management.

Irrespective of whether you are running a global enterprise or testing locally, Kubernetes flexibility grows with your enterprise to deliver your applications easily and consistently no matter how sophisticated your need is.

Key Features of Kubernetes

Here are a few key features of Kubernetes that make it an indispensable tool for your workplace:

  • Storage Orchestration: With Kubernetes, you can automatically mount the storage system of your choice. You can pick from network storage systems like Gluster, NFS, Cinder, Ceph, Flocker, iSCGI; local storage; or public cloud providers such as AWS or GCP. 
  • Batch Execution: Along with its comprehensive set of services, Kubernetes can also manage your CI and batch workloads, replacing containers that fail, if needed.
  • Load Balancing and Service Discovery: With Kubernetes, you no longer need to modify your application to use an unfamiliar service discovery mechanism. Kubernetes provides pods with their IP address along with a single DNS name for a set of pods. Kubernetes can also load balance across them.
  • Automatic Rollbacks and Rollouts: Kubernetes gradually rolls out changes to your application or its configuration, while keeping track of application health to make sure that it doesn’t kill all of your instances simultaneously. If something goes haywire, Kubernetes will roll back the change for you.
  • Configuration and Secret Management: You can update and deploy secrets and application configuration without rebuilding your image without having to expose the secrets within your stack configuration. 
  • Horizontal Scaling: Kubernetes allows you to scale your application down and up with a simple command, with a user interface, or automatically based on CPU usage.

Why do you need to run Airflow on Kubernetes?

Ever since its inception, Airflow’s USP has been its flexibility. Airflow is known for offering a wide range of integrations for services ranging from HBase and Spark to services on various other Cloud providers. Airflow also provides easy extensibility via its plug-in network.

However, one limitation of the project is that Airflow users are limited to the clients and frameworks that are present on the Airflow worker at the moment of execution. A single organization may have various Airflow workflows ranging from application deployments to Data Science pipelines.

This difference in use cases might create issues in dependency management as both teams might leverage vastly different libraries for their workflows.

This is where Kubernetes comes into the picture. You can use Kubernetes to allow users to launch arbitrary Kubernetes configurations and pods. Airflow users can now have complete autonomy over their run-time environments, secrets, and resources. This allows you to turn Airflow into an “any job you desire” workflow orchestrator.   

Benefits of Airflow Kubernetes Operator

Here are a few benefits offered by the Airflow Kubernetes Pod Operator:

  • Flexibility in Dependency and Configuration Management: Custom Docker images enable users to ensure task configuration, environment, and dependencies are consistent and idempotent, overcoming challenges with static Airflow workers and complex dependency management.
  • Enhanced Deployment Flexibility: Airflow’s plugin API allows engineers to test new features in Directed Acyclic Graphs (DAGs) easily. With Docker containers, any task runnable in a container can be accessed through the same operator without the need for additional Airflow code maintenance.
  • Improved Security with Kubernetes Secrets: DevOps engineers can securely manage sensitive data like database passwords and API keys by leveraging Kubernetes secrets with the Airflow Kubernetes operator. This ensures that sensitive information is isolated and only accessible to authorized pods, enhancing overall security posture.

How does the Airflow Kubernetes Operator work?

The Airflow Kubernetes operator leverages the Kubernetes Python client to create a request that gets processed by the API Server. Next, Kubernetes will launch your pod with whatever specifications you’ve defined. Following this, the images will be loaded with all the necessary environment variables, dependencies, and secrets, enacting a single command.

Once the job gets launched, the operator only has to monitor the health of tracklogs. Users will then have the option of gathering logs locally to either the scheduler or any distributed logging service currently in their Kubernetes cluster.   

Understanding Airflow Kubernetes Setup Configuration

Here are the steps you can follow for Airflow Kubernetes installation on your system in a seamless fashion:

Kubernetes Configuration

  • Step 1: For configuration, you need to have a Kubernetes deployment running a pod running both scheduler and webserver containers that look like this:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: airflow
namespace: airflow-example
spec:
replicas: 1
template:
metadata:
labels:
name: airflow
spec:
serviceAccountName: airflow
containers:
- name: webserver
...
- name: scheduler
...
volumes:
...
...
  • Step 2: You also need a service whose external IP is mapped to Airflow’s web server as follows:
apiVersion: v1
kind: Service
metadata:
name: Airflow
spec:
type: LoadBalancer
ports:
- port: 8080
selector:
name: airflow
  • Step 3: Next, you need a service account that when used with Role can help you delete and spin up new pods. These are integral to provide permissions to the Airflow scheduler to spin up the worker pods:
apiVersion: v1
kind: ServiceAccount
metadata:
name: airflow
namespace: airflow-example
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: airflow-example
name: airflow
rules:
- apiGroups: [""] # "" indicates the core API group
resources: ["pods"]
verbs: ["get", "list", "watch", "create", "update", "delete"]
- apiGroups: ["batch", "extensions"]
resources: ["jobs"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
  • Step 4: Here’s what two persistent volumes for storing logs and DAGs would look like for Kubernetes configuration:
kind: PersistentVolume
apiVersion: v1
metadata:
name: airflow-dags
spec:
accessModes:
- ReadOnlyMany
capacity:
storage: 2Gi
hostPath:
path: /airflow-dags/
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: airflow-dags
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 2Gi
  • Step 5: Next, an Airflow config file will have to be created as a Kubernetes config map linked to the pod. Apart from this, you need to make sure that the Postgres configuration has been handled through a separate deployment. The secrets such as the Postgres password can be easily created using Kubernetes secrets. You can use Kubernetes configmap to gain access to additional env variables. 

Building the Docker Image

  • Step 1: The primary component of building a Docker image is executing a pip install as follows:
RUN pip install --upgrade pip
​
RUN pip install apache-airflow==1.10.10
RUN pip install 'apache-airflow[kubernetes]'
  • Step 2: You also need a script that would run the webserver or scheduler based on the Kubernetes container or pod. You can use the bootstrap.sh file for the same:
if [ "$1" = "webserver" ]
then
   exec airflow webserver
fi
​
if [ "$1" = "scheduler" ]
then
   exec airflow scheduler
fi
  • Step 3: In this step, you will be adding the same to the Docker file as well:
COPY bootstrap.sh /bootstrap.sh
RUN chmod +x /bootstrap.sh
ENTRYPOINT ["/bootstrap.sh"]
  • Step 4: You can then push and build the image through the following code snippet:
docker build -t <image-repo-url:tag> .
docker push <image-repo-url:tag>

Deploying Airflow on Kubernetes

  • Step 1: You can deploy the Airflow pods in the following two modes:
    • Get use git to pull dags from.
    • You can also use persistent volume to store DAGs.
  • Step 2: Now, to set up the pods, you will have to run a deploy.sh script that can perform the following operations:
    • Delete existing deployments and pods if present within your namespace.
    • Convert the templatized configuration under the templates option to Kube config files under the build option.
    • Generate new deployments, pods, and any other Kube resources.

Here’s the code snippet for the same:

export IMAGE=<IMAGE REPOSITORY URL>
export TAG=<IMAGE_TAG>
cd airflow-kube-setup/scripts/kube
./deploy.sh -d persistent_mode

Deployment Verification

This Airflow Kubernetes Setup copies all the examples into the DAGs by default. You can just run one of them and check if everything is working fine.

  • Step 1: Extract the Airflow URL by running kubectl get services.
  • Step 2: Next, you need to log into the Airflow by leveraging airflow and airflow. You can modify this value within airflow-test-init.sh.
  • Step 3: After having modified the value, you can pick one of the DAG files listed.
  • Step 4: Open up your terminal and run kubectl get pods –watch to observe when the worker pods are getting generated.
  • Step 5: Click on the TriggerDag option to trigger one of the jobs. You can then see the tasks running in the Graph view. Also, you can see that the new pods were created and shut down after completing the tasks on your terminal.

Modification and Maintenance

  • Step 1: Now that you’ve deployed Airflow on Kubernetes, you no longer need to run this script every single time. You can simply leverage basic kubectl commands to restart or delete pods. Here’s the code snippet for the same:
kubectl get pods --watch
kubectl logs <POD_NAME> <Container_name>
kubectl exec -it $pod_name --container webserver -- /bin/bash

Conclusion

This blog talks about Airflow Kubernetes Operator and configuration in detail. It also gives a brief introduction to the key features of Kubernetes and Airflow before diving into the Airflow Kubernetes Configuration setup.

Extracting complex data from a diverse set of data sources to carry out an insightful analysis can be challenging, and this is where Hevo saves the day! Hevo offers a faster way to move data from 150+ Data Sources including Databases or SaaS applications into a destination of your choice or a Data Warehouse to be visualized in a BI tool.

While Airflow provides a good solution for ETL, Hevo takes away all the complexity of hard coding and maintaining pipelines req with Airflow. Hevo is fully automated and hence does not require you to code.

Want to take Hevo for a spin? Sign Up or a 14-day free trial and experience the feature-rich Hevo suite firsthand. Also checkout our unbeatable pricing to choose the best plan for your organization.

Suraj Poddar
Principal Frontend Engineer, Hevo Data

Suraj has over a decade of experience in the tech industry, with a significant focus on architecting and developing scalable front-end solutions. As a Principal Frontend Engineer at Hevo, he has played a key role in building core frontend modules, driving innovation, and contributing to the open-source community. Suraj's expertise includes creating reusable UI libraries, collaborating across teams, and enhancing user experience and interface design.

No-code Data Pipeline for Your Data Warehouse