Apache Airflow is a popular open-source platform for workflow management. It allows you to create workflows using standard Python, making it possible for anyone with Python knowledge to deploy a workflow. This is an improvement over other platforms that rely on the command line or XML for workflow deployments. Airflow provides several simple operators that allow you to execute your tasks on cloud platforms like AWS, GCP, Azure, and others. Airflow uses Directed Acyclic Graphs (DAGs) for orchestrating the workflow. The DAGs can run on external triggers, or a schedule (hourly, daily, etc.). The tasks are defined in Python, and the execution along with scheduling is managed by Airflow.
In this article, we will walk through the Airflow User Interface its web view and understand the important items. Before getting started, let’s have a look at the prerequisites.
In this article, we will walk through the user interface of the web view of Apache Airflow and understand the important items. Before getting started, let’s have a look at the prerequisites.
Table of Contents
You need to install Apache Airflow on your machine. This is especially involved on a Windows or a Mac machine. Please go through the Quickstart Guide for the steps. Once you are done, open the web view (http://localhost:8080), and follow along.
Note: If running Airflow in docker on a Windows machine, do define AIRFLOW_UID in a .env file in the same folder where your docker-compose.yaml file exists. While not defining AIRFLOW_UID only results in a warning, it can lead to unhealthy workers in some cases.
What is Airflow?
Apache Airflow is an open-source workflow automation and scheduling platform that programmatically authors, schedules, and monitors workflows. Organizations use Airflow to orchestrate complex computational workflows, create data processing pipelines, and perform ETL processes. Apache Airflow uses DAG (Directed Acyclic Graph) to construct the workflow, and each DAG contains nodes and connectors. Nodes connect to other nodes via connectors to generate a dependency tree.
Key Features of Apache Airflow
- Dynamic Integration: Airflow uses Python as the backend programming language to generate dynamic pipelines. Several operators, hooks, and connectors are available that create DAG and tie them to create workflows.
- Extensible: Airflow is an open-source platform, and so it allows users to define their custom operators, executors, and hooks. You can also extend the libraries so that it fits the level of abstraction that suits your environment.
- Elegant User Interface: Airflow uses Jinja templates to create pipelines, and hence the pipelines are lean and explicit. Parameterizing your scripts is a straightforward process in Airflow.
- Scalable: Airflow is designed to scale up to infinity. You can define as many dependent workflows as you want. Airflow creates a message queue to orchestrate an arbitrary number of workers.
A fully managed No-code Data Pipeline platform like Hevo Data helps you integrate data from 100+ data sources (including 40+ Free Data Sources) to a destination of your choice in real-time in an effortless manner. Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. Its strong integration with umpteenth sources provides users with the flexibility to bring in data of different kinds, in a smooth fashion without having to code a single line.
GET STARTED WITH HEVO FOR FREE
Check Out Some of the Cool Features of Hevo:
- Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
- Real-Time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
- Connectors: Hevo supports 100+ Integrations to SaaS platforms, files, databases, analytics, and BI tools. It supports various destinations including Amazon Redshift, Firebolt, Snowflake Data Warehouses; Databricks, Amazon S3 Data Lakes; MySQL, SQL Server, TokuDB, DynamoDB, PostgreSQL databases to name a few.
- 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
- Scalable Infrastructure: Hevo has in-built integrations for 100+ sources, that can help you scale your data infrastructure as required.
- 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
Simplify your Data Analysis with Hevo today!
SIGN UP HERE FOR A 14-DAY FREE TRIAL!
Understanding Airflow User Interface
Here are the components present in the Airflow User Interface:
Airflow User Interface Components: DAGs
As soon as you open the web view and sign in, you will be greeted by the Airflow User Interface that looks like the following:
By default, the first menu item (DAGs) is selected. It, as the name suggests, lists all the DAGs, active and paused. Let’s examine the row corresponding to one DAG:
As you can see, you have a toggle switch, to pause or activate the DAG, followed by the name (along with tags at the bottom) and owner of the DAG. Next, you have the status of all previous DAG runs. Each circle represents a status, which you can derive by hovering on the status:
As you can see, for this particular DAG, there were 2 successful runs, and no queued, running, or failed runs. Next, you have status information, like the schedule of the DAG, the last run time, and the next run time. Next, you have a status of tasks from recent DAG runs. Over here as well, each circle represents one state (queued, scheduled, skipped, etc.). The number of states is more, and therefore the number of circles. You can, again, derive the state corresponding to a circle by hovering on it. Finally, you have the Action buttons that help you either Run or Delete a DAG.
If you click on a DAG, you will see a screen like the one below:
As you can see, the Tree view, showing the DAG and the individual tasks has been opened up. The columnar view besides the tree represents DAG runs (each column corresponds to one run). Each square in a column represents a task. The color code legend is provided at the top. The items with black borders represent scheduled runs, whereas the ones without borders represent manually triggered runs.
This view helps visualize the tasks and dependencies in your DAG. It also provides their current status for a specific run (that you can choose from a dropdown)
The border around each node in the graph indicates the status of that task in that particular run (with the color legend provided at the top). If you click on a task, you will be able to see more details (including logs, and historical view of instances) related to the task and perform some actions on it. We can consider the example of marking a task as successful thereby allowing downstream dependent tasks to run. Not only can you mark the current task as successful, but you can also mark past/future task instances as successful or upstream/downstream tasks as successful.
This is a more broad-level view of the performance of your DAG over time. The several gradients of green and red (as can be seen in the legend), allow you to determine the fraction of successful/ failed runs on a particular day.
As the name suggests, it shows a Gannt chart displaying the duration of each task in a particular run (the run can be selected from a dropdown).
This view displays the Python code used for deploying the DAG. The code cannot be directly edited here, but it can provide insights into what is happening in the DAG.
Some other DAG views are briefly discussed below:
- Task Duration: This shows a line chart of the time taken by each task to execute. The X-axis represents the time of the DAG run.
- Task Tries: Again, a line chart showing the number of tries for each task. The X-axis, again, represents the time of the DAG run.
- Landing Times: Again, a line chart. As described by Airflow’s author, it is the number of hours between the job completion time and the time when the job should have started. See this StackOverflow thread.
- Details: As the name suggests, shows details related to the DAG (scheduled interval, concurrency, etc.)
Airflow User Interface Components: Security
Airflow follows a role-based access control (RBAC) system for managing users. The Security tab essentially helps you review and manage RBAC.
Below are the various options within this tab of the Airflow User Interface:
- List Users: View and manage users and their roles.
- List Roles: View and manage the roles and the permissions associated with them.
- User’s Statistics: See login-related statistics of users.
- Base Permissions: View a list of base permissions (like can_read, can_edit, etc.).
- Views/Menus: View a list of all the views and menu items in the Airflow User Interface.
- Permissions on Views/Menus: View a list of permissions on each view/menu. For example, as you can see in the image below, on the DAGs menu, the Admin role has 3 permissions.
Airflow User Interface Components: Browse
This tab provides additional views related to DAGs, jobs, and tasks. It also provides details related to the Service Level Agreement (SLA) misses. You can view logs (list of events that occurred in your environment), triggers, and rescheduled tasks. You can also see a graphical view of the cross-DAG dependencies in the ‘DAG Dependencies’ tab.
Airflow User Interface Components: Admin
As the name suggests, this tab is for all the administrator-related stuff. It isn’t specific to a DAG. Here’s a guide to the options within this tab:
- Variables: Helps you manage Airflow variables.
- Configurations: Shows contents of the airflow.cfg file, unless disabled by the admin.
- Connections: This shows all the Airflow connections stored in your environment.
- Plugins: View the plugins defined in your Airflow environment.
- Providers: Provides guides to providers that help third-party integrations (Google, Amazon, HTTP, Sendgrid, MySQL, etc.)
- Pools: This allows you to view/manage pools.
- XComs: Shows a list of all XComs (cross communications) and allows you to delete them.
Airflow User Interface Components: Docs
As you would have guessed, this tab provides links to external resources, including the Airflow website, GitHub repo, and API reference.
The Airflow Web UI can get a little overwhelming if you are just starting. This article will hopefully help you navigate through it, and get your tasks done faster. We saw the different tabs in the Airflow Web UI and explored the different views within the DAG tab in detail.
Extracting complex data from a diverse set of data sources to carry out an insightful analysis can be challenging, and this is where Hevo saves the day! Hevo offers a faster way to move data from Databases or SaaS applications into your Data Warehouse or a tool to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.
VISIT OUR WEBSITE TO EXPLORE HEVO
Want to take Hevo for a spin?
SIGN UP and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.
I hope you liked this article. Thanks for reading.