Airflow is a Task Automation tool. It helps organizations to schedule their tasks so that they are executed when the right time comes. This relieves the employees from doing tasks repetitively. When using Airflow, you will want to access it and perform some tasks from other tools. This means that you will need a way to connect to Airflow from other tools. Kubernetes is an open-source system, built on top of 15 years of experience of running production workloads at Google in tandem with the best ideas and practices from the community.
This blog talks about the different steps involved in installing Airflow on Kubernetes in a seamless fashion. It also gives a brief introduction to Airflow and Kubernetes before diving into the benefits of leveraging Airflow Kubernetes operator and setup steps.
Table of Contents
- What is Apache Airflow?
- What is Kubernetes?
- Why do you need to run Airflow on Kubernetes?
- Benefits of Airflow Kubernetes Operator
- How does the Airflow Kubernetes Operator work?
- Understanding Airflow Kubernetes Setup Configuration
What is Apache Airflow?
Apache Airflow is an open-source workflow automation and scheduling platform that programmatically authors, schedules, and monitors workflows. Organizations use Airflow to orchestrate complex computational workflows, create data processing pipelines, and perform ETL processes. Apache Airflow uses DAG (Directed Acyclic Graph) to construct the workflow, and each DAG contains nodes and connectors. Nodes connect to other nodes via connectors to generate a dependency tree.
Key Features of Apache Airflow
- Dynamic Integration: Airflow uses Python as the backend programming language to generate dynamic pipelines. Several operators, hooks, and connectors are available that create DAG and tie them to create workflows.
- Extensible: Airflow is an open-source platform, and so it allows users to define their custom operators, executors, and hooks. You can also extend the libraries so that it fits the level of abstraction that suits your environment.
- Elegant User Interface: Airflow uses Jinja templates to create pipelines, and hence the pipelines are lean and explicit. Parameterizing your scripts is a straightforward process in Airflow.
- Scalable: Airflow is designed to scale up to infinity. You can define as many dependent workflows as you want. Airflow creates a message queue to orchestrate an arbitrary number of workers.
What is Kubernetes?
Kubernetes has made a name for itself in the marketplace as an open-source container orchestration system that can be leveraged for scaling up operations, automating software development, and management. In layman terms, Kubernetes collates the containers that make up an application into logical units for easy discovery and management. Irrespective of whether you are running a global enterprise or testing locally, Kubernetes flexibility grows with your enterprise to deliver your applications easily and consistently no matter how sophisticated your need is.
Key Features of Kubernetes
Here are a few key features of Kubernetes that make it an indispensable tool for your workplace:
- Storage Orchestration: With Kubernetes, you can automatically mount the storage system of your choice. You can pick from network storage systems like Gluster, NFS, Cinder, Ceph, Flocker, iSCGI; local storage; or public cloud providers such as AWS or GCP.
- Batch Execution: Along with its comprehensive set of services, Kubernetes can also manage your CI and batch workloads, replacing containers that fail, if needed.
- Load Balancing and Service Discovery: With Kubernetes, you no longer need to modify your application to use an unfamiliar service discovery mechanism. Kubernetes provides pods with their IP address along with a single DNS name for a set of pods. Kubernetes can also load balance across them.
- Automatic Rollbacks and Rollouts: Kubernetes gradually rolls out changes to your application or its configuration, while keeping track of application health to make sure that it doesn’t kill all of your instances simultaneously. If something goes haywire, Kubernetes will roll back the change for you.
- Configuration and Secret Management: You can update and deploy secrets and application configuration without rebuilding your image without having to expose the secrets within your stack configuration.
- Horizontal Scaling: Kubernetes allows you to scale your application down and up with a simple command, with a user interface, or automatically based on CPU usage.
Why do you need to run Airflow on Kubernetes?
Ever since its inception, Airflow’s USP has been its flexibility. Airflow is known for offering a wide range of integrations for services ranging from HBase and Spark to services on various other Cloud providers. Airflow also provides easy extensibility via its plug-in network. However, one limitation of the project is that Airflow users are limited to the clients and frameworks that are present on the Airflow worker at the moment of execution. A single organization may have various Airflow workflows ranging from application deployments to Data Science pipelines. This difference in use cases might create issues in dependency management as both teams might leverage vastly different libraries for their workflows.
This is where Kubernetes comes into the picture. You can use Kubernetes to allow users to launch arbitrary Kubernetes configurations and pods. Airflow users can now have complete autonomy over their run-time environments, secrets, and resources. This allows you to turn Airflow into an “any job you desire” workflow orchestrator.
Benefits of Airflow Kubernetes Operator
Here are a few benefits offered by the Airflow Kubernetes Pod Operator:
- Flexibility of Dependencies and Configurations: For operators that have to be run within static Airflow workers, dependency management can become difficult. If a developer wishes to run a task that needs NumPy and another one that needs SciPy, the developer would either have to offload the task to an external machine (which might cause bugs if that external machine gets modified in an untracked manner) or maintain both dependencies within all Airflow workers. Custom Docker images let users ensure that the tasks configuration, environment, and dependencies are completely idempotent.
- Increased Flexibility for Deployments: Airflow’s plugin API has always provided a significant boon to engineers wishing to test new features in their DAGs. On the other side of the spectrum, whenever a developer wants to generate a new Airflow operator, they would have to develop an entirely new plugin from scratch. Now, any task that can be run in a Docker container can be accessed through the same operator with no extra Airflow code to maintain.
- Usage of Kubernetes Secrets to add an Extra Layer of Security: Tackling sensitive data is a core responsibility for a DevOps engineer. At every juncture, Airflow users wish to isolate database passwords, API keys, and login credentials on a need-to-know business. With the help of the Airflow Kubernetes operator, users can leverage the Kubernetes vault technology to store all the sensitive data. This means that the Airflow workers will never have access to all this information, and can simply focus on requesting that pods be built with the specific secrets that they need.
Simplify Data Analysis with Hevo’s No-code Data Pipeline
A fully managed No-code Data Pipeline platform like Hevo helps you integrate data from 100+ data sources (including 40+ Free Data Sources) to a destination of your choice like in real-time in an effortless manner. While Airflow provides a good solution for ETL, Hevo takes away all the complexity of hard coding and maintaining pipelines req with Airflow. Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. Its strong integration with umpteenth sources provides users with the flexibility to bring in data of different kinds, in a smooth fashion without having to code a single line.GET STARTED WITH HEVO FOR FREE
Check Out Some of the Cool Features of Hevo:
- Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
- Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the Data Pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use.
- Real-Time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
- Connectors: Hevo supports 100+ Integrations from sources to SaaS platforms, files, databases, analytics, and BI tools. It supports various destinations including Amazon Redshift, Firebolt, Snowflake Data Warehouses; Databricks, Amazon S3 Data Lakes, SQL Server, TokuDB, DynamoDB databases to name a few.
- 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
- Scalable Infrastructure: Hevo has in-built integrations for 100+ sources, that can help you scale your data infrastructure as required.
- 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
Simplify your Data Analysis with Hevo today!SIGN UP HERE FOR A 14-DAY FREE TRIAL!
How does the Airflow Kubernetes Operator work?
The Airflow Kubernetes operator leverages the Kubernetes Python client to create a request that gets processed by the API Server. Next, Kubernetes will launch your pod with whatever specifications you’ve defined. Following this, the images will be loaded with all the necessary environment variables, dependencies, and secrets, enacting a single command. Once the job gets launched, the operator only has to monitor the health of tracklogs. Users will then have the option of gathering logs locally to either the scheduler or any distributed logging service currently in their Kubernetes cluster.
Understanding Airflow Kubernetes Setup Configuration
Here are the steps you can follow for Airflow Kubernetes installation on your system in a seamless fashion:
- Airflow Kubernetes Setup: Kubernetes Configuration
- Airflow Kubernetes Setup: Building the Docker Image
- Airflow Kubernetes Setup: Deploying Airflow on Kubernetes
- Airflow Kubernetes Setup: Deployment Verification
- Airflow Kubernetes Setup: Modification and Maintenance
Airflow Kubernetes Setup: Kubernetes Configuration
- Step 1: For configuration, you need to have a Kubernetes deployment running a pod running both scheduler and webserver containers that look like this:
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: airflow namespace: airflow-example spec: replicas: 1 template: metadata: labels: name: airflow spec: serviceAccountName: airflow containers: - name: webserver ... - name: scheduler ... volumes: ... ...
- Step 2: You also need a service whose external IP is mapped to Airflow’s web server as follows:
apiVersion: v1 kind: Service metadata: name: Airflow spec: type: LoadBalancer ports: - port: 8080 selector: name: airflow
- Step 3: Next, you need a service account that when used with Role can help you delete and spin up new pods. These are integral to provide permissions to the Airflow scheduler to spin up the worker pods:
apiVersion: v1 kind: ServiceAccount metadata: name: airflow namespace: airflow-example --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: namespace: airflow-example name: airflow rules: - apiGroups: [""] # "" indicates the core API group resources: ["pods"] verbs: ["get", "list", "watch", "create", "update", "delete"] - apiGroups: ["batch", "extensions"] resources: ["jobs"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] ---
- Step 4: Here’s what two persistent volumes for storing logs and DAGs would look like for Kubernetes configuration:
kind: PersistentVolume apiVersion: v1 metadata: name: airflow-dags spec: accessModes: - ReadOnlyMany capacity: storage: 2Gi hostPath: path: /airflow-dags/ --- kind: PersistentVolumeClaim apiVersion: v1 metadata: name: airflow-dags spec: accessModes: - ReadOnlyMany resources: requests: storage: 2Gi
- Step 5: Next, an Airflow config file will have to be created as a Kubernetes config map linked to the pod. Apart from this, you need to make sure that the Postgres configuration has been handled through a separate deployment. The secrets such as the Postgres password can be easily created using Kubernetes secrets. You can use Kubernetes configmap to gain access to additional env variables.
Airflow Kubernetes Setup: Building the Docker Image
- Step 1: The primary component of building a Docker image is executing a pip install as follows:
RUN pip install --upgrade pip RUN pip install apache-airflow==1.10.10 RUN pip install 'apache-airflow[kubernetes]'
- Step 2: You also need a script that would run the webserver or scheduler based on the Kubernetes container or pod. You can use the bootstrap.sh file for the same:
if [ "$1" = "webserver" ] then exec airflow webserver fi if [ "$1" = "scheduler" ] then exec airflow scheduler fi
- Step 3: In this step, you will be adding the same to the Docker file as well:
COPY bootstrap.sh /bootstrap.sh RUN chmod +x /bootstrap.sh ENTRYPOINT ["/bootstrap.sh"]
- Step 4: You can then push and build the image through the following code snippet:
docker build -t <image-repo-url:tag> . docker push <image-repo-url:tag>
Airflow Kubernetes Setup: Deploying Airflow on Kubernetes
- Step 1: You can deploy the Airflow pods in the following two modes:
- Get use git to pull dags from.
- You can also use persistent volume to store DAGs.
- Step 2: Now, to set up the pods, you will have to run a deploy.sh script that can perform the following operations:
- Delete existing deployments and pods if present within your namespace.
- Convert the templatized configuration under the templates option to Kube config files under the build option.
- Generate new deployments, pods, and any other Kube resources.
Here’s the code snippet for the same:
export IMAGE=<IMAGE REPOSITORY URL> export TAG=<IMAGE_TAG> cd airflow-kube-setup/scripts/kube ./deploy.sh -d persistent_mode
Airflow Kubernetes Setup: Deployment Verification
This Airflow Kubernetes Setup copies all the examples into the DAGs by default. You can just run one of them and check if everything is working fine.
- Step 1: Extract the Airflow URL by running kubectl get services.
- Step 2: Next, you need to log into the Airflow by leveraging airflow and airflow. You can modify this value within airflow-test-init.sh.
- Step 3: After having modified the value, you can pick one of the DAG files listed.
- Step 4: Open up your terminal and run kubectl get pods –watch to observe when the worker pods are getting generated.
- Step 5: Click on the TriggerDag option to trigger one of the jobs. You can then see the tasks running in the Graph view. Also, you can see that the new pods were created and shut down after completing the tasks on your terminal.
Airflow Kubernetes Setup: Modification and Maintenance
- Step 1: Now that you’ve deployed Airflow on Kubernetes, you no longer need to run this script every single time. You can simply leverage basic kubectl commands to restart or delete pods. Here’s the code snippet for the same:
kubectl get pods --watch kubectl logs <POD_NAME> <Container_name> kubectl exec -it $pod_name --container webserver -- /bin/bash
This blog talks about Airflow Kubernetes Operator and configuration in detail. It also gives a brief introduction to the key features of Kubernetes and Airflow before diving into the Airflow Kubernetes Configuration setup.
Extracting complex data from a diverse set of data sources to carry out an insightful analysis can be challenging, and this is where Hevo saves the day! Hevo offers a faster way to move data from 100+ Data Sources including Databases or SaaS applications into a destination of your choice or a Data Warehouse to be visualized in a BI tool. While Airflow provides a good solution for ETL, Hevo takes away all the complexity of hard coding and maintaining pipelines req with Airflow. Hevo is fully automated and hence does not require you to code.VISIT OUR WEBSITE TO EXPLORE HEVO
Want to take Hevo for a spin?SIGN UP and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.