DAG scheduling may seem difficult at first, but it really isn’t. And if you are a data professional who wants to learn more about data scheduling and how to run a DAG in airflow, you’re at the right place. But, why do you need to trigger airflow DAGs in the first place?

The answer lies in processing the correct data at the right time. With its flexibility and freedom, Airflow provides users with the ability to schedule their Airflow Trigger DAG Runs. And with the help of the Airflow Scheduler, users can ensure future, as well as the past DAG Runs, to be performed at the right time and in the correct order.

In this guide, we will discuss, in detail, the concept of scheduling and DAG Runs in Airflow. And, moving forward, we’ll be examining two methods to Trigger Airflow DAGs. Let’s begin.

Two Methods to Trigger Airflow DAGs

This section of the blog post will discuss the two ways available to trigger Airflow DAGs:

  1. Trigger Airflow DAGs on a Schedule
  2. Trigger Airflow DAGs Manually
Simplify your Data Analysis with Hevo’s No-code Data Pipeline

Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready.

Start for free now!

Get Started with Hevo for Free

Trigger Airflow DAGs on a Schedule

To continue triggering Airflow DAGs on a schedule, it’s first required to specify the “start_date” and the “schedule_interval” parameters. Second, it’s required to upload the DAG file to your environment, too.

To specify the scheduling parameters, you define how many times the DAG will run by placing values in the “schedule_interval” parameter. And, to specify when Airflow should schedule DAG tasks, place the values in the “ start_date” parameter. 

Note: Airflow schedules DAG Runs based on the minimum start date for tasks, as defined in the “schedule_interval” parameter which is the argument for DAG.

Look at the Airflow Trigger with Config example given below. It shows a DAG frequently running, in a periodic manner. The DAG runs every hour, from 15:00 on April 5, 2021. According to the parameters and working principles of DAG, Airflow schedules the first DAG Run at 16:00 April, 2021.

from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator

with DAG(
    dag_id='example_dag_schedule',
    # At 15:00 on 5 April, 2021
    start_date=datetime(2021, 4, 5, 15, 0),
    # At minute 0 of every hour
    schedule_interval='0 * * * *') as dag:

    # Output the current date and time
    t1 = BashOperator(
        task_id='date',
        bash_command='date',
        dag=dag)

    t1

Trigger Airflow DAGs Manually

While we trigger Airflow DAGs manually, Airflow performs the subsequent DAG Run. That means, if you independently execute Airflow DAGs trigger, already running on their respective schedule, then Airflow will go ahead and run that DAG, not affecting the scheduled DAGs. Now, to execute manual Airflow trigger DAG with config, you can use the following two methods:

  1. Trigger Airflow DAGs Using Airflow UI, and
  2. Trigger Airflow DAGs Using Google cloud

Method 1: Trigger Airflow DAGs manually using Airflow UI in GCC:

Step 1: In GCC, open the Environment page. Click here to open the Environment page.

Step 2: Now, in the Airflow webserver column, an Airflow link will be present for your environment. Click on that.

Step 3: While on the Airflow web interface, find the DAGs page. Now on the Links column for the DAG, click on the “Trigger Dag” button.

Step 4: You can also specify the DAG run configuration, but it’s optional.

Step 5: Click on the “Trigger” button.

Method 2: Trigger Airflow DAGs manually in GCC:

For Airflow 1.10.*, the CLI command, “trigger_dag” is used to trigger the DAG Run. Look at the code given below:

  gcloud composer environments run ENVIRONMENT_NAME 
    --location LOCATION 
    trigger_dag -- DAG_ID

For Airflow 2, the CLI command, “dags trigger” is used to trigger the DAG Run. Look at the code given below:

  gcloud composer environments run ENVIRONMENT_NAME 
    --location LOCATION 
    dags trigger -- DAG_ID

Note: You can replace the following parameter names with the following:

  1. “ENVIRONMENT_NAME” with the environment variable.
  2. “LOCATION” with the Compute Engine region, the location of the environment variable.
  3. “DAG_ID” with the name of the DAG.

External Triggers

DAG Runs can also be created manually through the CLI while running an airflow trigger_dag command. Here, you can define a specific run_id. The DAG Runs created externally to the scheduler get associated to the trigger’s timestamp. They will be displayed in the UI alongside scheduled DAG runs.

What To Keep in Mind

  • The first DAG Run is created on the basis of the minimum start_date for the tasks in your DAG.
  • The scheduler process creates subsequent DAG Runs on the basis of your DAG’s schedule_interval, sequentially.
  •  It is important to keep in mind the DAG Run’s state, too, when clearing a set of tasks’ state in the hope of getting them to re-run, as it defines whether the scheduler should look into triggering tasks for that run.

To unblock tasks, you can:

  • You can clear (as in delete the status of) individual task instances from the task instances dialog from the UI. This can be done while defining whether you want to include the past/future and the upstream/downstream dependencies. A confirmation window comes next, and you can see the set you are about to clear. All task instances associated with the dag can be cleared. 
  • For clearing task instance states, the CLI command airflow clear -h has lots of options. It also includes specifying date ranges, targeting task_ids by specifying a regular expression, flags for including upstream and downstream relatives, and targeting task instances in specific states (failed, or success)
  • The task instance record will not be deleted by clearing a task instance. Instead, it updates max_tries and sets the current task instance state to be None.
  • UI can be used for marking task instances as successful. This is usually done to fix false negatives, or for instance, when the fix has been applied outside of Airflow.
  • The airflow backfill CLI subcommand has a flag to --mark_success. It enables selecting subsections of the DAG as well as specifying date ranges.

Trigger Rules

Airflow will wait for all upstream (direct parents) tasks for a task to be successful before it runs that task.

However, this is just the default behavior, and you can apply trigger_rule argument to a Task to control it. The options for trigger_rule are:

  • all_success (default): All upstream tasks have succeeded
  • all_failed: All upstream tasks are in the state of failed or upstream_failed.
  • all_done: All upstream tasks are done with their execution
  • all_skipped: All upstream tasks are in a skipped state
  • one_failed: Failure of at least one upstream task (does not wait for all upstream tasks to be done)
  • one_success: At least one upstream task has succeeded (does not wait for all upstream tasks to be done)
  • one_done: At least one upstream task succeeded or failed
  • none_failed: All upstream tasks have not failed or upstream_failed – that is, all upstream tasks have succeeded or been skipped
  • none_failed_min_one_success: At least one upstream task has succeeded and all upstream tasks have not failed or upstream_failed.
  • none_skipped: No upstream task is in a skipped state – that is, all upstream tasks are in success, failed, or upstream_failed state
  • always: No dependencies at all,  run this task at any time

The Concept of Scheduling in Airflow

One of the apex features of Apache Airflow, scheduling helps developers schedule tasks and assist to assign instances for a DAG Run on a scheduled interval. By definition, the Apache scheduler’s job is to monitor and stay in sync with all DAG objects, employ a repetitive process to store DAG parsing results, and always be on the lookout for active tasks to be triggered.

But how does Scheduling work in Airflow?

For example, If you run a DAG with “Schedule_interval” of “1” day, and the run stamp is set at 2022-02-16, the task will trigger soon after “2022-02-16T23:59.” Hence, the instance gets a trigger once the period set limit is reached.

The Apache Scheduler is custom-built to work seamlessly in an Airflow production environment. To run the scheduler, try running the code given below. “Airflow.cfg” contains all you need to know about the configuration.

airflow scheduler

The Concept of DAG Runs in Airflow

DAG stands for Direct Acyclic Graphs. These graphs are a pictorial representation of tasks in a pecking order. Each task is shown in the graph with the flow of execution from one task to another.

On the other hand, a DAG Run works as an extension — or as defined in the Apache documentation, “an instantiation” — of the DAG in time.

All DAG Runs have a schedule to abide by, but DAG might or might not have a schedule. By default, the “schedule_interval” is the DAG argument. The scheduled arguments can be treated as a cron expression. Some examples of cron presets are as follows:

presetmeaningcron
NoneDon’t schedule, use for exclusively “externally triggered” DAGs
@onceSchedule once and only once
@hourlyRun once an hour at the beginning of the hour0 * * * *
@dailyRun once a day at midnight0 0 * * *
@weeklyRun once a week at midnight on Sunday morning0 0 * * 0
@monthlyRun once a month at midnight of the first day of the month0 0 1 * *
@quarterlyRun once a quarter at midnight on the first day0 0 1 */3 *
@yearlyRun once a year at midnight of January 10 0 1 1 *

Conclusion

In this guide, we learned in detail about the different ways to trigger Airflow DAGs. We also made our way through the basics of scheduling, triggers, and DAG Runs, too. And, if you want to get more familiar with the basics of scheduling data pipeline processes in Airflow, below-given two articles can help:

  1. Scheduling & Triggers
  2. DAG Runs

Although it’s easy to create workflow pipelines in Apache Airflow for a trained professional, the need of the hour is to create a solution that will help a professional with a non-technical background — and Hevo Data is here to help!

Visit our Website to Explore Hevo

Want to take Hevo for a spin? Sign Up for a 14-day free trial. You may also have a look at the amazing price, which will assist you in selecting the best plan for your requirements.

References

Yash Arora
Former Content Manager, Hevo Data

Yash is a Content Marketing professinal with experience in data-driven marketing campaigns. He has expertise in strategic thinking, integrated marketing, and customer acquisition. She has driven growth for startups and established brands through comprehensive marketing communications, and digital strategies.

No Code Data Pipeline For Your Data Warehouse

Get Started with Hevo