Apache Airflow Task Instances: A Complete 101 Guide

on Airflow Task Instances, Apache Airflow, DAG • February 23rd, 2022 • Write for Hevo

Airflow Task Instances | Cover

A step forward from the previous platforms that rely on the Command Line or XML to deploy workflows, Apache Airflow — a popular open-source workflow management tool — allows users to develop workflows using standard Python code.

And to better understand Apache Airflow, it’s a must to know how Airflow Tasks and Airflow Task Instances work. Tasks in Apache Airflow are defined as the most basic unit of execution which is represented as nodes in the DAG graph. At the same time, an Airflow Task Instance is a particular run of the Task. We’ll discuss them in detail later.

In this blog post, we will parse through the basics of Airflow Tasks and dig a little deeper into how Airflow Task Instances work with examples. But, before we continue, let’s learn more about Apache Airflow in brief. Let’s begin.

Table of Contents

What is Apache Airflow?

Apache Airflow Task Instance: Airflow logo
Image Source

Apache Airflow is an open-source workflow management platform/tool to manage data engineering pipelines.

Programmed in Python and utilized with standard features of the Python framework, Airflow enables its users to schedule data processing for engineering pipelines efficiently. Airflow platform works as a building block, allowing its users to stitch together the modern data stack.

Apache Airflow’s major features are as follows:

  • Extensibility: It’s easy to define operators extend libraries to fit the level of abstraction which suits your business requirements.
  • Dynamic in nature: Configured as code, Airflow allows dynamic pipeline generation which enables the users to restart from the point of failure, that, too, without restarting the entire workflow.
  • Sleek in design: Airflow pipelines are straightforward and easy to maintain, and the rich scheduling semantics enable users to run pipelines regularly.
  • Scalable as the business grows: Having a modular design, Airflow provides a general-purpose orchestration framework with a manageable set of features to learn.

What are Airflow Tasks?

In its documentation, Apache Airflow defines Tasks as “a unit of work with a DAG.” So, if we will see a DAG graph, the nodes actually represent Tasks.

A Task is written in Python and represents a Python Operator’s execution. Every time the Task also implements an operator with values to define that particular operator. For instance, a “PythonOperator” is used to run a Python code, and a “BashOperator,” by default, is a Bash command.

And, if we are digging deeper into the hows and the whys, we will also fall past the concept of “Relations between Tasks,” which simply relates if a task is upstream or downstream. Consider the code example given below:

with DAG('my_dag', start_date=datetime(2016, 1, 1)) as dag:
    task_1 = DummyOperator('task_1')
    task_2 = DummyOperator('task_2')
    task_1 >> task_2 # Define dependencies

The above-written code represents a DAG with two tasks, and a dependency exists; from task 1 to task 2. Hence logically, we can say task_1 is upstream of task_2, and task_2 is downstream of task_1.

Concept Refresh: DAG (Direct Acyclic Graphs) is a pictorial representation of tasks in a pecking order. Each Task is shown in the diagram with the execution flow from one Task to another. A DAG Run works as an extension of the DAG in time. All DAG Runs have a schedule to abide by, but DAG might or might not have a schedule.

Simplify your Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services and simplifies the ETL process.

Hevo supports 100+ data sources (including 40+ free data sources) like Asana and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

GET STARTED WITH HEVO FOR FREE[/hevoButton]

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
SIGN UP HERE FOR A 14-DAY FREE TRIAL

What are Airflow Task Instances?

Airflow Task Instances are defined as a representation for, “a specific run of a Task” and a categorization with a collection of, “a DAG, a task, and a point in time.” Each Airflow Task Instances have a follow-up loop that indicates which state the Airflow Task Instance falls upon. Some states are as follows: running state, success state, failed state, skipped state, and so on. More are listed below:

noneThe none stage defines that no dependencies have been met and no task has been queued for execution.
scheduledIn the scheduled stage, the scheduler actually has concluded that the dependencies of a Task have been met or is it should run or not.
queuedIn the queued stage, an Executor has been assigned to the Task, and it is awaiting a worker.
runningThe Task is now being executed on a worker (or a local/synchronous executor).
successThe Task was completed successfully and without faults.
shutdownThe shutdown stage says, when a Task was running, it was requested to shut down the Task from the outside.
restartingThe Restarting stage says, when the job was running, an external request was made for it to restart.
failedThe Task encountered an error during execution and was unable to complete.
skippedThe Task was skipped because of branching, LatestOnly, or something similar.
upstream_failedAn upstream job failed, despite the Trigger Rule stating that it was required.
up_for_retryThe up_for_retry stage says that the Task has failed, and still retries are available, therefore the Tasks will be rescheduled.
up_for_rescheduleA Sensor in rescheduling mode is the Task.
sensingThe Task is to use a Smart Sensor.
deferredThe Task has been postponed until a trigger is found.
removedSince the run began, the Task has vanished from the DAG.

To sum up, a Task is defined in a DAG. And both, Task and DAG, are written in Python code. On the other hand, an Airflow Task Instances is associated with DAG Runs and execution date. Airflow Task Instances are “instantiated” and are runnable entities.

Now, to better understand Airflow Task Instances, let’s take a look at one example. Go through the sample code given below, defined for some DAG:

with DAG('my_dag', start_date=datetime(2016, 1, 1)) as dag:
    task_1 = DummyOperator('task_1')
    task_2 = DummyOperator('task_2')
    task_1 >> task_2 # Define dependencies

When a DAG is enabled, the scheduler responds by creating several many DAG Runs. Correlating the above-said and the above-mentioned sample code, DAG Runs will be created until the current date. The execution date (execution_date) will be defined as 2016–01-01, 2016-01-02, …… , 2022-2-23.

Concept Refresh: Scheduling helps developers schedule tasks and assign Airflow Task Instances for a DAG Run at a scheduled interval.

For each DAG Run, task_1 instance and task_2 instance will be present. The execution date (execution_date) for each instance will be determined by the value of the execution date of that particular DAG Run 

Note: Every task_2 will be downstream of task_1.

On a side note, we can also look at task_1 for 2016-01-01 as the logical previous value, or upstream, to task_1 for 2016-01-02. Or the DAG Run for 2016-01-01 upstream to DAG Run for 2016-01-02.

Conclusion

In this blog post, we learned the basics of the Apache Airflow Task Instances, and also talked about Airflow Tasks and how they work. To better understand Airflow Task Instances, you can click on the topics given below. The following topics will refresh your concepts and provide a deeper understanding while using Airflow Task, Airflow Task Instances, DAGs, and DAG Runs.

  1. About Airflow Tasks
  2. About Airflow Task Instances
  3. 2 Easy Ways to Trigger Airflow DAGs in Apache Airflow
visit our website to explore hevo

Hevo can help you Integrate your data from numerous sources and load them into a destination to Analyze in real-time from 100+ data sources such as SaaS applications or Databases into your Redshift Data Warehouse to be visualized in a BI tool. It will make your life easier and data migration hassle-free. and it’s user-friendly, reliable, and secure.

Hevo Product Video

SIGN UP for a 14-day free trial and see the difference!

Share your experience of learning about Airflow Task Instances in the comments section below.

No Code Data Pipeline For Your Data Warehouse