In today’s data-driven world, enterprises extensively use data pipelines to enable quick data exploration for essential business insights.
Pipelines created around extract, transform, and load (ELT) processes pose a challenge to businesses. A single error can lead to difficulties like low data quality, reducing consumer confidence, and tedious maintenance.
To mitigate such challenges, companies can use tools like Apache Airflow to monitor, schedule, and stop data pipelines.
In this article, you will learn how to stop or kill Airflow tasks via the user interface of Apache Airflow.
Prerequisites
Understanding of data pipelines and workflows.
What are DAGs?
DAGs are commonly known as directed acyclic graphs in computer science and mathematics. It is a directed graph, but without any cycles, it connects to all the other edges, meaning it is impossible to traverse through the entire graph starting at one edge.
As the direction of the edges in the graph is unidirectional, you can go only one way. DAGs are primarily used for topological sorting, where each node in the graph is in a specific order.
For example, a spreadsheet represents a DAG, each cell is a vertex, and the edge is connected to the cell when a formula reference of another cell is utilized. Directed acyclic graphs are mainly used in circuit design, Bayesian network, compiler structure, and scheduling.
In Apache Airflow, DAGs are used for project management to plan, design, structure, and implement complex tasks. It is a collection of the tasks you want to execute and is organized in a method that helps understand the relationships and dependencies of each task.
DAGs are defined in Python code and are placed in DAG_FOLDER. Apache Airflow will run each code in every file dynamically to build DAG objects. There is no upper limit; you can create as many DAGs as you want and describe an arbitrary number of tasks to them. However, ideally, each DAGs should represent one single logical data pipeline or workflow.
What are Airflow Tasks?
In Apache Airflow, a task is a basic execution unit, and they are drafted into DAGs. They have upstream and downstream dependencies sets between them to decide the order in which they should be executed.
Upstream is a task that must reach a particular status before a dependent task can be executed. In contrast, downstream is a task that cannot be executed until the upstream task reaches a particular state.
There are no limits to the number of tasks that can be added to a single DAG. Users can set the concurrency limit for the execution time, the limit for concurrent DAG runs for a particular DAG, and the maximum number of parallel tasks.
Three types of Apache Airflow tasks exist – operators, sensors, and task flow decorators. While operators are predefined task templates that can be combined to create the majority of the DAGs, sensors are the operator’s unique subclass that is completed about waiting for an external trigger to happen.
On the other hand, a task flow decorated @task is customizable Python functions bundled up as a task. All of these tasks are subclasses of BaseOperator, Operators, and Sensors are easy-to-go templates you can call in a DAG file.
Looking for the best ETL tools to connect your data sources? Rest assured, Hevo’s no-code platform helps streamline your ETL process. Try Hevo and equip your team to:
- Integrate data from 150+ sources(60+ free sources).
- Utilize drag-and-drop and custom Python script features to transform your data.
- Risk management and security framework for cloud-based systems with SOC2 Compliance.
Try Hevo and discover why 2000+ customers have chosen Hevo to upgrade to a modern data stack.
Get Started with Hevo for Free
How to Stop or Kill Airflow Tasks?
Apache Airflow has no data input or output concept but primarily focuses on data flow. You can manage workflow processes, trigger tasks, check current status, and schedule tasks through code and visualization. You can manage your project and related data with Apache Airflow.
It is also helpful in systematically creating parts of machine learning workflows. The code can be reused for various models or datasets to solve complex problems and get meaningful insights.
Apache Airflow tasks are structures in the form of DAGs, but there are some scenarios where you might need to kill or Airflow stop DAG tasks.
- Using DAGs Screen
- Setting the Airflow Task to a Failed State
Method 1: Using DAGs Screen
- Go to the DAGs screen, where you can see the currently running tasks.
- Click on the running icon under the Recent Task section.
- Airflow will automatically run the search query with the appropriate filters for the select DAG Id and state. However, you can manually do so by going to the task instances under the tab browser.
- The task will be displayed on the task instances screen.
- Select the task you want to delete.
Method 2: Setting the Airflow Task to a Failed State
- Select the task you want to kill.
- Mark the task as a “Failed” state.
- All the subsequent tasks will also be marked as failed.
- Click on the Okay button.
Limitations
Ideally, tasks and DAGs should not be stopped or deleted in Apache Airflow, as once a task’s execution has started, stopping it is a tedious process. If you stop a DAG and clear the task from the UI, the running tasks in the executor will not stop. When the task is in a running state, you can click on CLEAR, and it will call job.kill() function on the task. This function will set the task’s status to shut_down, which will be shifted to up_for_retry immediately.
Conclusion
In this article, you learned about the fundamental aspects of how to stop or kill Apache Airflow Tasks. That said, Apache Airflow is not an ELT tool, but it can manage and organize ELT pipelines via DAGs to help enterprises with the different steps involved in batch jobs. With Apache Airflow, businesses can monitor their workflow’s developed and current status, giving them insights into areas they need to improve.
Hevo Data will automate your data transfer process, hence allowing you to focus on other aspects of your business like Analytics, Customer Management, etc. Hevo provides a wide range of sources – 150+ Data Sources (including 60+ Free Sources) – that connect with over 15+ Destinations and load them into a destination to analyze real-time data at transparent pricing and make Data Replication hassle-free. Sign up for a 14-day free trial and experience the feature-rich Hevo suite firsthand.
FAQs
1. What is a task in Airflow?
A task in Airflow represents a unit of work or operation to be executed, such as running a script or querying a database.
2. What is Airflow used for?
Airflow automates, schedules, and monitors workflows, enabling efficient management of data pipelines and complex processes.
3. What is the difference between operator and task in Airflow?
An operator defines what a task does (e.g., a Python function), while a task is an instance of an operator in a workflow.
Vidhi is a data science enthusiast with two years of experience in the field. She specializes in writing about data, software architecture, and integration, leveraging her profound understanding of these domains to create insightful and tailored content. She stays updated with the latest industry trends and technologies, ensuring her content remains relevant and valuable for her audience. Through her work, she aims to empower data professionals with the knowledge and tools they need to succeed in an ever-evolving landscape.