A step forward from the previous platforms that rely on the Command Line or XML to deploy workflows, Apache Airflow — a popular open-source workflow management tool — allows users to develop workflows using standard Python code.
And to better understand Apache Airflow, it’s a must to know how Airflow Tasks and Airflow Task Instances work. Tasks in Apache Airflow are defined as the most basic unit of execution which is represented as nodes in the DAG graph. At the same time, an Airflow Task Instance is a particular run of the Task. We’ll discuss them in detail later.
In this blog post, we will parse through the basics of Airflow Tasks and dig a little deeper into how Airflow Task Instances work with examples. But, before we continue, let’s learn more about Apache Airflow in brief. Let’s begin.
What is Apache Airflow?
Apache Airflow is an open-source workflow management platform/tool to manage data engineering pipelines.
Programmed in Python and utilized with standard features of the Python framework, Airflow enables its users to schedule data processing for engineering pipelines efficiently. Airflow platform works as a building block, allowing its users to stitch together the modern data stack.
Apache Airflow’s major features are as follows:
- Extensibility: It’s easy to define operators extend libraries to fit the level of abstraction which suits your business requirements.
- Dynamic in nature: Configured as code, Airflow allows dynamic pipeline generation which enables the users to restart from the point of failure, that, too, without restarting the entire workflow.
- Sleek in design: Airflow pipelines are straightforward and easy to maintain, and the rich scheduling semantics enable users to run pipelines regularly.
- Scalable as the business grows: Having a modular design, Airflow provides a general-purpose orchestration framework with a manageable set of features to learn.
Looking for the best ETL tools to connect your data sources? Rest assured, Hevo’s no-code platform helps streamline your ETL process. Try Hevo and equip your team to:
- Integrate data from 150+ sources(60+ free sources).
- Simplify data mapping with an intuitive, user-friendly interface.
- Instantly load and sync your transformed data into your desired destination.
Try Hevo and discover why 2000+ customers like Ebury have chosen Hevo over tools like Fivetran and Stitch to upgrade to a modern data stack.
Get Started with Hevo for Free
What are Airflow Tasks?
In its documentation, Apache Airflow defines Tasks as “a unit of work with a DAG.” So, if we will see a DAG graph, the nodes actually represent Tasks.
A Task is written in Python and represents a Python Operator’s execution. Every time the Task also implements an operator with values to define that particular operator. For instance, a “PythonOperator” is used to run a Python code, and a “BashOperator,” by default, is a Bash command.
And, if we are digging deeper into the hows and the whys, we will also fall past the concept of “Relations between Tasks,” which simply relates if a task is upstream or downstream. Consider the code example given below:
with DAG('my_dag', start_date=datetime(2016, 1, 1)) as dag:
task_1 = DummyOperator('task_1')
task_2 = DummyOperator('task_2')
task_1 >> task_2 # Define dependencies
The above-written code represents a DAG with two tasks, and a dependency exists; from task 1 to task 2. Hence logically, we can say task_1 is upstream of task_2, and task_2 is downstream of task_1.
Concept Refresh: DAG (Direct Acyclic Graphs) is a pictorial representation of tasks in a pecking order. Each Task is shown in the diagram with the execution flow from one Task to another. A DAG Run works as an extension of the DAG in time. All DAG Runs have a schedule to abide by, but DAG might or might not have a schedule.
Types of Task
There are three basic kinds of Task:
- A TaskFlow-decorated @task, which is a custom Python function packaged up as a Task.
- Operators, predefined task templates that you can string together quickly to build most parts of your DAGs.
- Sensors, a special subclass of Operators which are entirely about waiting for an external event to happen.
What are Airflow Task Instances?
Airflow Task Instances are defined as a representation for, a specific run of a Task and a categorization with a collection of, ‘a DAG, a task, and a point in time.’ Each Airflow Task Instances have a follow-up loop that indicates which state the Airflow Task Instance falls upon.
Task Instance Lifecycle
There are various states that a Task Instance goes through during its lifecycle.
Some states are as follows: running state, success state, failed state, skipped state, and so on. More are listed below:
none | The none stage defines that no dependencies have been met and no task has been queued for execution. |
scheduled | In the scheduled stage, the scheduler actually has concluded that the dependencies of a Task have been met or is it should run or not. |
queued | In the queued stage, an Executor has been assigned to the Task, and it is awaiting a worker. |
running | The Task is now being executed on a worker (or a local/synchronous executor). |
success | The Task was completed successfully and without faults. |
shutdown | The shutdown stage says, when a Task was running, it was requested to shut down the Task from the outside. |
restarting | The Restarting stage says, when the job was running, an external request was made for it to restart. |
failed | The Task encountered an error during execution and was unable to complete. |
skipped | The Task was skipped because of branching, LatestOnly, or something similar. |
upstream_failed | An upstream job failed, despite the Trigger Rule stating that it was required. |
up_for_retry | The up_for_retry stage says that the Task has failed, and still retries are available, therefore the Tasks will be rescheduled. |
up_for_reschedule | A Sensor in rescheduling mode is the Task. |
sensing | The Task is to use a Smart Sensor. |
deferred | The Task has been postponed until a trigger is found. |
removed | Since the run began, the Task has vanished from the DAG. |
Integrate AWS Elasticsearch to BigQuery
Integrate Azure Blob Storage to Databricks
Integrate Confluent Cloud to MS SQL Server
To sum up, a Task is defined in a DAG. And both, Task and DAG, are written in Python code. On the other hand, an Airflow Task Instances is associated with DAG Runs and execution date. Airflow Task Instances are “instantiated” and are runnable entities.
Now, to better understand Airflow Task Instances, let’s take a look at one example. Go through the sample code given below, defined for some DAG:
with DAG('my_dag', start_date=datetime(2016, 1, 1)) as dag:
task_1 = DummyOperator('task_1')
task_2 = DummyOperator('task_2')
task_1 >> task_2 # Define dependencies
When a DAG is enabled, the scheduler responds by creating several many DAG Runs. Correlating the above-said and the above-mentioned sample code, DAG Runs will be created until the current date. The execution date (execution_date) will be defined as 2016–01-01, 2016-01-02, …… , 2022-2-23.
Concept Refresh: Scheduling helps developers schedule tasks and assign Airflow Task Instances for a DAG Run at a scheduled interval.
For each DAG Run, task_1 instance and task_2 instance will be present. The execution date (execution_date) for each instance will be determined by the value of the execution date of that particular DAG Run
Note: Every task_2 will be downstream of task_1.
On a side note, we can also look at task_1 for 2016-01-01 as the logical previous value, or upstream, to task_1 for 2016-01-02. Or the DAG Run for 2016-01-01 upstream to DAG Run for 2016-01-02.
Example on How to Create a Task Instance
To create a task instance, you must first define a task within a DAG (Directed Acyclic Graph). Below is an example using the BashOperator to execute a simple bash command:
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
define_dag = DAG(
'example_dag',
default_args={'start_date': datetime(2021, 1, 1)},
schedule_interval=None
)
task_instance = BashOperator(
task_id='example_task',
bash_command='echo "Hello from Airflow!"',
dag=define_dag
)
Now, To execute a task instance, you can use the Airflow CLI command airflow tasks test, as shown in the official documentation snippet provided. This allows you to test a single task instance of a DAG without affecting the metadata database.
Enhance Your Data Migration Game!
No credit card required
Conclusion
In this blog post, we learned the basics of the Apache Airflow Task Instances, and also talked about Airflow Tasks and how they work. To better understand Airflow Task Instances, you can click on the topics given below. The following topics will refresh your concepts and provide a deeper understanding while using Airflow Task, Airflow Task Instances, DAGs, and DAG Runs.
- About Airflow Tasks
- About Airflow Task Instances
- 2 Easy Ways to Trigger Airflow DAGs in Apache Airflow
- How to Stop or Kill Airflow Tasks
Hevo can help you Integrate your data from numerous sources and load them into a destination to Analyze in real-time from 150+ data sources such as SaaS applications or Databases into your Redshift Data Warehouse to be visualized in a BI tool. It will make your life easier and data migration hassle-free. and it’s user-friendly, reliable, and secure.
Sign up for a 14-day free trial and see the difference!
Frequently Asked Questions
1. What is a task instance in Airflow?
A task instance in Airflow represents a specific run of a task defined in a Directed Acyclic Graph (DAG) for a particular execution date. It contains information about the task’s state (e.g., success, failure, running), start and end times, and any logs or metadata associated with that execution.
2. What are Airflow instances?
Airflow instances generally refer to the operational components of an Airflow deployment, which include:
–DAG instances: Instances of the defined DAGs that represent scheduled runs.
-Task instances: Specific runs of tasks within those DAGs.
-Scheduler: The component responsible for scheduling the tasks.
–Workers: The instances executing the tasks.
3. How to run a task in Airflow?
Trigger it manually from the Airflow UI by selecting the DAG and choosing the task, then clicking on “Run” for the desired execution date.
Yash is a Content Marketing professional with over three years of experience in data-driven marketing campaigns. He has expertise in strategic thinking, integrated marketing, and customer acquisition. Through comprehensive marketing communications and innovative digital strategies, he has driven growth for startups and established brands.