Metrics for Airflow Monitoring and Best Practices

on Apache Airflow, DAG, Trigger Airflow DAGs • February 21st, 2022 • Write for Hevo

Airflow Monitor | Hevo Data

Apache Airflow is a well-known Open-Source Workflow Management application. It allows you to create workflows using standard Python, allowing anyone with a minimal understanding of the language to deploy one. This is a significant improvement over earlier platforms, which relied on the Command Line or XML to deploy workflows. Airflow includes a set of easy operators that allow you to run your processes on Cloud platforms like AWS, Google Cloud Platform, Azure, and others. Using Directed Acyclic Graphs (DAGs), Airflow orchestrates the workflow. DAGs can be run using external triggers or a schedule (hourly, daily, etc.). Airflow controls the execution and scheduling of the tasks, which are written in Python.

Apache Airflow Monitoring has recently gained a lot of traction in enterprises that deal with significant amounts of Data Collection and Processing, making Airflow one of the most widely used data monitoring solutions.

In this article, you will get to know everything about Airflow Monitoring and understand the important terms and mechanisms related to Airflow Monitoring.

Table of Contents

What is Apache Airflow?

Apache Airflow is an Open-source process automation and scheduling application that allows you to programmatically author, schedule, and monitor workflows. In organizations, Airflow is used to organize complex computing operations, create Data Processing Pipelines, and run ETL processes. The workflow is made up of nodes and connectors in Apache Airflow’s DAG (Directed Acyclic Graph). Connecting nodes with connectors form a Dependency Tree.

Apache Airflow is used for workflow authoring, scheduling, and monitoring application. It’s one of the most reliable systems for orchestrating processes or Pipelines that Data Engineers employ. You can quickly see the dependencies, progress, logs, code, trigger tasks, and success status of your Data Pipelines.

Key Features of Apache Airflow

  • Versatile: Users can design their own unique Operators, Executors, and Hooks because Airflow is an open-source platform. You can also tailor the libraries to your specific needs by changing the level of abstraction.
  • User Interface: Airflow’s user interface produces pipelines using Jinja templates, resulting in lean and expressive pipelines. Parameterizing your scripts in Airflow is a simple process.
  • Dynamic Integration: Airflow uses Python as the backend programming language for creating dynamic pipelines. To generate DAG and connect it to form processes, you can use a variety of operators, hooks, and connectors.
  • Scalable: Airflow has been designed to scale endlessly. You can set up as many dependent workflows as you want. Airflow creates a message queue to orchestrate an arbitrary number of workers.

To get further information on Apache Airflow, check out the official website here.

What are Callbacks?

The usage of Task Callbacks is to act on changes in the state of a single job or across all tasks in a DAG is an important part of Airflow Monitoring and Logging. For example, you might want to be notified when particular jobs fail or have the last task in your DAG Trigger a callback when it succeeds.

There are four types of events that can trigger a Callback:

NameDescription
on_success_callbackInvoked when the task succeeds
on_failure_callbackInvoked when the task fails
sla_miss_callbackInvoked when a task misses its defined SLA
on_retry_callbackInvoked when the task is up for retry
Table Source

Simplify ETL Using Hevo’s No-code Data Pipeline

Hevo Data is a No-code Data Pipeline that offers a fully managed solution to set up Data Integration for 100+ Data Sources (including 40+ Free sources) and will let you directly load data from sources to a Data Warehouse or the Destination of your choice. It will automate your data flow in minutes without writing any line of code. Its fault-tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully automated solution to manage data in real-time and always have analysis-ready data. 

Get Started with Hevo for Free

Let’s look at some of the salient features of Hevo:

  • Fully Managed: It requires no management and maintenance as Hevo is a fully automated platform.
  • Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer. 
  • Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
  • Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
  • Connectors: Hevo supports 100+ Integrations to SaaS platforms FTP/SFTP, Files, Databases, BI tools, and Native REST API & Webhooks Connectors. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake, Firebolt, Data Warehouses; Amazon S3 Data Lakes; Databricks; MySQL, SQL Server, TokuDB, MongoDB, PostgreSQL Databases to name a few.  
  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Live Monitoring: Advanced Airflow Monitoring gives you a one-stop view to watch all the activities that occur within Data Pipelines.
  • Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-Day Free Trial!

Understanding the Basics of Apache Airflow Monitoring

1) Logging

The majority of Airflow’s Runs are planned without the need for Manual Intervention. This emphasizes the need of keeping track of what transpired during the run. Fortunately, Airflow has some excellent logging features built in. This allows developers to pinpoint the source of problems in their settings.

The standard logging package in Python is used to implement all of the logging in Airflow. Airflow Logs files from the WebServer, Scheduler, and Workers doing tasks into a local system file by default. This means that when a user leverages the Web UI to read a log file, the action starts a GET request to retrieve the contents. Airflow has handlers contributed by the community for logging to Cloud storage such as AWS, Google Cloud, and Azure for Cloud installations.

To access your logs using Airflow’s UI, go to the Tree View and click the “View Log” button on the task you’re interested in.

Airflow Monitor: logging | Hevo Data
Image Source

2) Lineage

In terms of Airflow features, Data Lineage is quite new. However, a lot of recent work has gone into improving Lineage Support and making it a lot easier to use. This tool can assist you in tracking data’s origins, what happens to it, and where it goes overtime. This provides an Audit Trail, as well as the ability to assess Airflow’s adherence to your Data Governance policies and Debug Data flows.

When you have a lot of data to read and write into storage, this capability comes in handy. Each job requires the user to define the input and output Data Sources, and Apache Atlas generates a graph displaying the link between the various Data Sources. However, it can be inconvenient to use. Airflow’s documentation can be used to learn more about how it operates. There are some Third-party tools (including ours) that make integrating this capability straightforward if you don’t want all the problems.

3) User Interface

A) Tree View

The Tree View allows you to drill down into a specific DAG at a lower level. You can see how the DAG’s tasks are organized and the status of each connected run. This allows you to track the state of Run and Task over time. The Graph View is a little easier on the eyes, but it’s recommended since you can examine multiple Runs at once and discover problems more quickly.

Airflow Monitor: tree view | Hevo Data
Image Source

While this View is useful, it gets tough to manage when the DAG is large and there are numerous Runs, especially when the Status colors and boundaries are difficult to distinguish.

To make things a little easier, the TaskInstance and DagRunState colors in Airflow’s Webserver can be customized. 

To do so, write an airflow local setting.py file and place it on $PYTHONPATH or inside the $AIRFLOW HOME/config folder (NOTE: When Airflow is established, $AIRFLOW HOME/config is added to $PYTHONPATH). After that, according to Airflow’s instructions, you need to add the following to the airflow local setting.py file:

STATE_COLORS = {
  "queued": 'darkgray',
  "running": '#01FF70',
  "success": '#2ECC40',
  "failed": 'firebrick',
  "up_for_retry": 'yellow',
  "up_for_reschedule": 'turquoise',
  "upstream_failed": 'orange',
  "skipped": 'darkorchid',
  "scheduled": 'tan',
}

You can change the colors to suit your team’s tastes. Simply restart the Webserver to see the modifications take effect.

B) DAGs View

The DAGs View displays a list of DAGs in your environment as well as shortcuts to other built-in Airflow Monitoring tools. Here you can see the names of your DAGs, their owners, the statuses of recently conducted runs and tasks, and some fast actions.

Airflow Monitor: DAGs view1 | Hevo Data
Image Source

If you’re dealing with many teams in this setting, you should tag the pipeline to make Airflow Monitoring easier. In your DAG file, you can add Team Tags to the DAG Object. Here’s an example of what that could look like, according to Airflow’s documentation:

dag = DAG(
  dag_id='example_dag_tag',
  schedule_interval='0 0 * * *',
  tags=['example']
)

This filter is then kept as a cookie, and you can use it to add it to the DAG View by typing it into the filter input form at the top.

Airflow Monitor: DAGs view1 | Hevo Data
Image Source

C) Code View

While the code for your pipeline technically lives in source control, you can examine the User Code through Airflow’s UI, which can help you uncover logic flaws provided you have appropriate context.

Airflow Monitor: Code View | Hevo Data
Image Source

How to Check the Health Status in Airflow?

HTTP Checks and CLI Checks are two methods for checking the health of components in Airflow. Due to the nature of the component being checked and the technologies used to monitor the deployment, only some checks are accessible over HTTP.

For example, when running on the Kubernetes software deployment platform, use a Liveness probe (livenessProbe property) with CLI checks. The scheduler deployment restarts the process when it fails. You can use Health Check Endpoint to configure the readiness probe (readinessProbe property) for the webserver.

For further information on Health Status, check out the official documentation here.

What are the available Airflow Monitoring Services?

With our effective designs, any prominent organization’s services strive to boost pipeline performance. The following are basic Airflow Monitoring services that cover a variety of critical factors:

  • Management: The Apache management monitoring services provide high-level management and Airflow Monitoring of the DAG Nodes, Servers, and Schedule Logs, ensuring that all airflow pipelines are operational.
  • Tracking: Airflow allows users to keep track of their information. All of the crucial details about the data, such as its origins, how it is handled, and so on, are continuously tracked.
  • Sensors: Sensors in Airflow allow users to activate a job based on a pre-condition that the user must set.
  • Security: The airflow services aim to provide customers with a high-security platform that does not necessitate the deployment of any additional security software.

What are the Types of Airflow Monitoring?

1) Basic Monitoring

A BI development team’s primary priority should be the Accuracy and Performance of its airflow jobs (dagruns). Airflow provides excellent support for simple job monitoring:

  • Missed SLAs: Airflow can send an email with all SLA misses for a given scheduling interval.
  • When a specified point in a DAG is reached, the EmailOperator can send out emails.
  • Notifications can be sent to popular online platforms such as Slack or HipChat.
  • The Airflow Dashboard, which you may manually refresh to see which jobs have just succeeded or failed.
  • A PythonOperator that can follow a task’s output and make better-educated decisions based on the context in which it is run, as well as branch to a specific branch of execution for notification purposes.

This form of AIrflow Monitoring is particularly beneficial for tracking execution problems, but in some circumstances, such as Data Quality Monitoring, more advanced methods are required. This is when Datadog and other excellent Internet services like it come in handy.

2) Data Quality Monitoring

The number of orders that are closed within an interval in online retail is not constant across all intervals. There are typically some peaks between breakfast and when people initially arrive at work, a minor decline over lunchtime, and a much larger surge when they go home when they decide to make purchases based on email brochures they received or suggestions from coworkers.

Let’s use a hypothetical dataset and a 1-hour period for ingesting orders. You may find substantial changes in the number of orders actually ingested per hour, as may be seen in the graph below:

When you use a simple threshold rule with a minimum and maximum threshold, it’s clear that the allowable error band for any particular day might be rather large; in fact, it may completely defeat the purpose of having one in the first place. 

In the case of highly variable data, such as the case above, and when you didn’t ingest as much data from a source system as usual but yet want to put a very tight error boundary to detect potential data quality issues, Airflow will feature a connection with DataDog. In the DatadogHook as of version 1.8, which is a helpful service that can receive all types of Metrics from whichever source system you choose, including an airflow system set up to execute ETL. 

You can use the hook by creating a Python Operator that calls a python function.

Datadog is fantastic at detecting data quality concerns and showing data trends on a dashboard of your choosing. 

It can help you in Post Mortems that are used to determine the true cause of specific situations. You can capture everything about the incident in a post-mortem paper as part of this process, which is a wonderful approach to express the full impact of ETL Failures, Data Quality Concerns, and other issues. If you use Datadog services to track data quality, there’s a tool called “Notebooks” that can assist you to supplement these post-mortem documents with datadog notebooks.

What are the Benefits of an Airflow Monitoring Dashboard?

  • You’d have a dashboard where you could examine operational and dataset metadata Performance Metrics and Trends. 
  • You’d be able to create elaborate warnings based on these patterns, allowing you to anticipate potential SLA Failures. 
  • Your Airflow Monitoring capabilities would be extensible and efficient if you could centralize your Metadata, Metric Visualization, Logs, and Alerts in one place.

The ideal Airflow Monitoring Dashboard would be able to do the polar opposite of the three points listed below.

  • Data Awareness
  • Limited Monitoring and Alerting Capabilities
  • Complex Integration with Operational Workflows

How to set up an Airflow Monitoring System?

1) Prometheus

Prometheus is a popular tool for storing and alerting Metrics. You’ll use Prometheus as the main storage for our Metrics because it’s often used to collect Metrics from other sources like RDBMSes and Web servers. You’ll use Prometheus StatsD Exporter to collect Metrics and convert them into a Prometheus-readable format because Airflow doesn’t have a Prometheus Integration. It connects StatsD and Prometheus by converting StatsD Metrics into Prometheus Metrics using custom mapping rules.

You’ll need to write two files to set up statsd-exporter:prometheus.yml” and “mapping.yml,” which you can find on our GitHub repository linked below. These files should be saved in a new folder called “.prometheus” in your Airflow directory. After you’ve copied these two files into the folder, open an Ubuntu terminal and type the following command:

docker run --name=prom-statsd-exporter 
-p 9123:9102 
-p 8125:8125/udp 
-v $PWD/mapping.yml:/tmp/mapping.yml 
prom/statsd-exporter 
   --statsd.mapping-config=/tmp/mapping.yml 
   --statsd.listen-udp=:8125 
   --web.listen-address=:9102

You’re good to go if you see the following line in the output:

level=info ts=2021-10-04T04:38:30.408Z caller=main.go:358 msg="Accepting Prometheus Requests" addr=:9102

Keep this terminal open and type a new command in a new Ubuntu terminal:

docker run --name=prometheus 
-p 9090:9090 
-v $PWD/prometheus.yml:prometheus.yml 
prom/prometheus 
   --config.file=prometheus.yml 
   --log.level=debug 
   --web.listen-address=:9090 

If you see the following line in your output, your Prometheus has been correctly set up:

level=info ts=2021-10-02T21:09:59.717Z caller=main.go:794 msg="Server is ready to receive web requests."

2) StatsD

StatsD is a popular tool for gathering and aggregating data from a variety of sources. Airflow includes functionality for transmitting measurements to the StatsD server. After Airflow has been configured, it will send Metrics to the StatsD server, where you will be able to visualize them. To find Statsd, open the airflow.cfg file and look for it. The following variables will be visible:

[metrics]
statsd_on = False
statsd_host = localhost
statsd_port = 8125
statsd_prefix = airflow

Set the value of statsd_on = True. You have the option of changing the port where we can listen to the measurements. Now launch a DAG in airflow and type the following command in another terminal:

nc -l -u localhost 8125

This command simply returns the Airflow Metrics that StatsD has exposed and captured. Since Metrics are collected at regular intervals, the output will be continuous.

3) Grafana

Grafana is one of the best visualization tools for Metrics. It supports native Prometheus and uses it to set up the Cluster Airflow Monitoring Dashboard. Here you can see all the important Metrics such as Scheduler Heartbeat, Dagbag Size, Number of Queued / Running Tasks, and currently running DAG aggregated by task.  You can see DAG-related Metrics such as Successful DAG Execution Period, Unsuccessful DAG Execution Period, DAG Execution Dependency-Check Time, and DAG Execution Schedule Delay. To set up Grafana on your system, enter the following command in your Ubuntu terminal: 

docker run -d --name=grafana -p 3000:3000 grafana/grafana

After a successful setup, you can access https://localhost:3000/ to access the Gafana Dashboard. On the dashboard, click on the icon below the + icon to create a new Data Source, and select Prometheus Data Source in the URL section, type:

host.docket.internal:9090

Then create a new Dashboard and give it any name you like. You will be taken to the Edit Panel.

The Metrics Browser allows you to enter the Metrics you want to monitor. Check the Scheduler Heartbeat, so try entering the following command: 

rate(airflow{host="airflowStatsD", 
instance="host.docker.internal:9123", job="airflow", metric="scheduler_heartbeat"}[1m])

A Chart (in this case a time series) is generated for the above command and the chart is updated every minute (indicated by 1m).

You can also format the chart with panel options by giving the chart a Xaxis Name and Title. 

What are some metrics you should monitor?

The following is a list of some of the most critical areas to keep an eye on while Airflow Monitoring, which can also help you troubleshoot and detect resource bottlenecks:

  • Checks for the Scheduler, Webserver, Workers, and other custom processes to see if they’re up and running. 
  • How long do they stay online?
  • How many workers are on the move?
  • Are the Customized Metrics and settings reflected in Metrics?
  • DAG Parsing Time and the number of active DAGs
  • Pool Utilization is on the rise.
  • Job Execution Status (started/completed) is on the rise.
  • Executor Task Status (running/queued/open slots) is on the rise.
  • Operator-by-operator Execution Status (success/failure)
  • Status of Task Instances (successes/failures)
  • Time spent on Critical Tasks and Sensors is on the rise.
  • DAGs takes an average of time to reach an end state.
  • DAGs are delaying their Schedules on a regular basis.
  • Time spent by DAGs on Dependency Checks is on the rise.

It’s critical to keep track of these KPIs at various levels, including the overall organization, individual tasks, and the DAG level. You should also keep track of your individual operators and tasks that you believe are more likely to fail or require more resources.

Challenges in Airflow Monitoring

Apache Airflow excels at orchestration, which is exactly what it was designed to do. However, without some finessing, Airflow Monitoring can be difficult to handle.

The truth is that keeping track of Airflow can be Time-consuming. When something goes wrong, you’re thrown back and forth between Airflow’s UI, Operational Dashboards, Python Code, and Pages of Logs (It’s advisable to have more than one monitor to manage all of this). That’s why, in Airflow’s 2020 user survey, “logging, monitoring, and alerting” tied for second place. It is difficult to keep track of airflow mostly because of three interconnected reasons:

1) No Data Awareness

Your Data Pipelines are extremely familiar to Airflow. It understands all there is to know about your Tasks, including their statuses and how long they take to complete. It is aware of the Execution Process. However, it has no knowledge of the data flowing through your DAGs.

There are a lot of things that can go wrong with your data that isn’t shown by execution metadata. 

  • If for whatever reason, your data source fails to supply any data, the Airflow Webserver’s UI would be all green, but your data consumer’s warehouse would be full of Stale data. 
  • If data is supplied, but one or more columns are blank, Airflow will claim that everything is fine but your data consumers are working with incomplete data
  • If the data is complete but the transformation is unexpected, a job would not fail as a result of this but erroneous data will be sent.

You might be able to set some alerts based on Run & Task duration to assist you to know when something is wrong. However, you wouldn’t have the flexibility you’d need to cover all of your blind spots, and you’d still have to spend time determining the source of the problem. This leads us to the following point.

2) Limited Monitoring and Alerting Capabilities

Airflow is ideal for Task Orchestration, as previously indicated. That’s exactly why it was created. Understandably, the Airflow community does not place a high priority on developing a full-featured Airflow Monitoring and observability solution. It strays too far from the project’s intended scope. However, it couldn’t be completely devoid of Airflow Monitoring capabilities, so there are some surrounding the Pipeline and Tasks themselves.

Airflow gives you a High-level view of your Operational Metadata, such as Run and Task States, as well as the ability to set up simple alerting and get logs. While this is excellent, it lacks the background provided in the first point. As a result, you’ll need to create Operational Dashboards to visualize Metrics across time to see how your data evolves. To pull metadata about your datasets, you’ll need to add data quality listeners to your DAGs (Deequ, Great Expectations, Cluster Policies, Callbacks, and so on). Once you have metadata and trends to work with, you may generate Custom Alerts. That brings us to our last point.

3) Complex Integration with Operational Workflows

Just to keep track of your Airflow Situations, you now have a lot of moving parts. You have Email Alerts, Airflow UI operating metadata and logs, and different Dashboards for Metrics Reporting. If your Airflow environments are limited in scope, this technique may work for you, but if you’re working with 100s of DAGs across several teams, it’s an issue. 

Through a single pane of glass, you won’t be able to see the health of your Airflow Surroundings. Different teams will use different Dashboards, and Notifications that don’t travel to your organization’s preferred receiver may go ignored. Because of the Operational Debt, it will be difficult for your engineers to detect issues early and prioritize solutions before your data SLAs are breached.

Conclusion

This article has given you an understanding of Apache Airflow, its key features along with a deep understanding of Airflow Monitoring. You are now ready to start monitoring your Airflow Tasks. In case you want to integrate Data into your desired Database/destination, then Hevo Data is the right choice for you! 

Visit our Website to Explore Hevo

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage Data transfer between a variety of sources and destinations with a few clicks. Hevo with its strong integration with 100+ sources & BI tools allows you to not only export Data from your desired Data sources & load it to the destination of your choice, but also transform & enrich your Data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools. 

While Airflow is a good solution for Data Integration, It requires a lot of Engineering Bandwidth & Expertise. This can be challenging, resource-intensive & costly in the long run. Hevo offers a much simpler, scalable, and economical solution that allows people to create Data Pipeline without any code in minutes & without depending on Engineering teams.

Want to take Hevo for a spin? Sign Up for a 14-day free trial. You may also have a look at the amazing price, which will assist you in selecting the best plan for your requirements.

Share your experience of understanding the concept of Airflow Monitoring in the comment section below!

No-code Data Pipeline for Your Data Warehouse