DAG scheduling may seem difficult at first, but it really isn’t. And if you are a data professional who wants to learn more about data scheduling and how to run a DAG in airflow, you’re at the right place. But, why do you need to trigger airflow DAGs in the first place?
The answer lies in processing the correct data at the right time. With its flexibility and freedom, Airflow provides users with the ability to schedule their Airflow Trigger DAG Runs. And with the help of the Airflow Scheduler, users can ensure future, as well as the past DAG Runs, to be performed at the right time and in the correct order.
In this guide, we will discuss, in detail, the concept of scheduling and DAG Runs in Airflow. And, moving forward, we’ll be examining two methods to Trigger Airflow DAGs. Let’s begin.
Two Methods to Trigger Airflow DAGs
This section of the blog post will discuss the two ways available to trigger Airflow DAGs:
- Trigger Airflow DAGs on a Schedule
- Trigger Airflow DAGs Manually
Trigger Airflow DAGs on a Schedule
To continue triggering Airflow DAGs on a schedule, it’s first required to specify the “start_date” and the “schedule_interval” parameters. Second, it’s required to upload the DAG file to your environment, too.
To specify the scheduling parameters, you define how many times the DAG will run by placing values in the “schedule_interval” parameter. And, to specify when Airflow should schedule DAG tasks, place the values in the “ start_date” parameter.
Note: Airflow schedules DAG Runs based on the minimum start date for tasks, as defined in the “schedule_interval” parameter which is the argument for DAG.
Look at the Airflow Trigger with Config example given below. It shows a DAG frequently running, in a periodic manner. The DAG runs every hour, from 15:00 on April 5, 2021. According to the parameters and working principles of DAG, Airflow schedules the first DAG Run at 16:00 April, 2021.
from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
with DAG(
dag_id='example_dag_schedule',
# At 15:00 on 5 April, 2021
start_date=datetime(2021, 4, 5, 15, 0),
# At minute 0 of every hour
schedule_interval='0 * * * *') as dag:
# Output the current date and time
t1 = BashOperator(
task_id='date',
bash_command='date',
dag=dag)
t1
Discover how Hevo can streamline your workflow. Solve your data replication problems with Hevo’s reliable, no-code, automated pipelines and 150+ connectors. Start today and enhance your data processes!
Get Started with Hevo for Free
Trigger Airflow DAGs Manually
While we trigger Airflow DAGs manually, Airflow performs the subsequent DAG Run. That means, if you independently execute Airflow DAGs trigger, already running on their respective schedule, then Airflow will go ahead and run that DAG, not affecting the scheduled DAGs. Now, to execute manual Airflow trigger DAG with config, you can use the following two methods:
- Trigger Airflow DAGs Using Airflow UI, and
- Trigger Airflow DAGs Using Google cloud
Method 1: Trigger Airflow DAGs manually using Airflow UI in GCC:
Step 1: In GCC, open the Environment page. Click here to open the Environment page.
Step 2: Now, in the Airflow webserver column, an Airflow link will be present for your environment. Click on that.
Step 3: While on the Airflow web interface, find the DAGs page. Now on the Links column for the DAG, click on the “Trigger Dag” button.
Step 4: You can also specify the DAG run configuration, but it’s optional.
Step 5: Click on the “Trigger” button.
Method 2: Trigger Airflow DAGs manually in GCC:
For Airflow 1.10.*, the CLI command, “trigger_dag” is used to trigger the DAG Run. Look at the code given below:
gcloud composer environments run ENVIRONMENT_NAME
--location LOCATION
trigger_dag -- DAG_ID
For Airflow 2, the CLI command, “dags trigger” is used to trigger the DAG Run. Look at the code given below:
gcloud composer environments run ENVIRONMENT_NAME
--location LOCATION
dags trigger -- DAG_ID
Note: You can replace the following parameter names with the following:
- “ENVIRONMENT_NAME” with the environment variable.
- “LOCATION” with the Compute Engine region, the location of the environment variable.
- “DAG_ID” with the name of the DAG.
External Triggers
DAG Runs
can also be created manually through the CLI while running an airflow trigger_dag
command. Here, you can define a specific run_id
. The DAG Runs
created externally to the scheduler get associated to the trigger’s timestamp. They will be displayed in the UI alongside scheduled DAG runs
.
What To Keep in Mind
- The first
DAG Run
is created on the basis of the minimum start_date
for the tasks in your DAG.
- The scheduler process creates subsequent
DAG Runs
on the basis of your DAG’s schedule_interval
, sequentially.
- It is important to keep in mind the DAG Run’s state, too, when clearing a set of tasks’ state in the hope of getting them to re-run, as it defines whether the scheduler should look into triggering tasks for that run.
To unblock tasks, you can:
- You can clear (as in delete the status of) individual task instances from the task instances dialog from the UI. This can be done while defining whether you want to include the past/future and the upstream/downstream dependencies. A confirmation window comes next, and you can see the set you are about to clear. All task instances associated with the dag can be cleared.
- For clearing task instance states, the CLI command
airflow clear -h
has lots of options. It also includes specifying date ranges, targeting task_ids by specifying a regular expression, flags for including upstream and downstream relatives, and targeting task instances in specific states (failed
, or success
)
- The task instance record will not be deleted by clearing a task instance. Instead, it updates max_tries and sets the current task instance state to be None.
- UI can be used for marking task instances as successful. This is usually done to fix false negatives, or for instance, when the fix has been applied outside of Airflow.
- The
airflow backfill
CLI subcommand has a flag to --mark_success
. It enables selecting subsections of the DAG as well as specifying date ranges.
Trigger Rules
Airflow will wait for all upstream (direct parents) tasks for a task to be successful before it runs that task.
However, this is just the default behavior, and you can apply trigger_rule argument to a Task to control it. The options for trigger_rule are:
all_success
(default): All upstream tasks have succeeded
all_failed:
All upstream tasks are in the state of failed or upstream_failed.
all_done:
All upstream tasks are done with their execution
all_skipped:
All upstream tasks are in a skipped state
one_failed:
Failure of at least one upstream task (does not wait for all upstream tasks to be done)
one_success:
At least one upstream task has succeeded (does not wait for all upstream tasks to be done)
one_done:
At least one upstream task succeeded or failed
none_failed:
All upstream tasks have not failed or upstream_failed – that is, all upstream tasks have succeeded or been skipped
none_failed_min_one_success:
At least one upstream task has succeeded and all upstream tasks have not failed or upstream_failed.
none_skipped:
No upstream task is in a skipped state – that is, all upstream tasks are in success, failed, or upstream_failed state
always:
No dependencies at all, run this task at any time
The Concept of Scheduling in Airflow
One of the apex features of Apache Airflow, scheduling helps developers schedule tasks and assist to assign instances for a DAG Run on a scheduled interval. By definition, the Apache scheduler’s job is to monitor and stay in sync with all DAG objects, employ a repetitive process to store DAG parsing results, and always be on the lookout for active tasks to be triggered.
But how does Scheduling work in Airflow?
For example, If you run a DAG with “Schedule_interval” of “1” day, and the run stamp is set at 2022-02-16, the task will trigger soon after “2022-02-16T23:59.” Hence, the instance gets a trigger once the period set limit is reached.
The Apache Scheduler is custom-built to work seamlessly in an Airflow production environment. To run the scheduler, try running the code given below. “Airflow.cfg” contains all you need to know about the configuration.
airflow scheduler
The Concept of DAG Runs in Airflow
DAG stands for Direct Acyclic Graphs. These graphs are a pictorial representation of tasks in a pecking order. Each task is shown in the graph with the flow of execution from one task to another.
On the other hand, a DAG Run works as an extension — or as defined in the Apache documentation, “an instantiation” — of the DAG in time.
All DAG Runs have a schedule to abide by, but DAG might or might not have a schedule. By default, the “schedule_interval” is the DAG argument. The scheduled arguments can be treated as a cron expression. Some examples of cron presets are as follows:
preset | meaning | cron |
---|
None | Don’t schedule, use for exclusively “externally triggered” DAGs | |
@once | Schedule once and only once | |
@hourly | Run once an hour at the beginning of the hour | 0 * * * * |
@daily | Run once a day at midnight | 0 0 * * * |
@weekly | Run once a week at midnight on Sunday morning | 0 0 * * 0 |
@monthly | Run once a month at midnight of the first day of the month | 0 0 1 * * |
@quarterly | Run once a quarter at midnight on the first day | 0 0 1 */3 * |
@yearly | Run once a year at midnight of January 1 | 0 0 1 1 * |
Learn More About:
Conclusion
In this guide, we learned in detail about the different ways to trigger Airflow DAGs. We also made our way through the basics of scheduling, triggers, and DAG Runs, too. And, if you want to get more familiar with the basics of scheduling data pipeline processes in Airflow, below-given two articles can help:
- Scheduling & Triggers
- DAG Runs
Although it’s easy to create workflow pipelines in Apache Airflow for a trained professional, the need of the hour is to create a solution that will help a professional with a non-technical background — and Hevo Data is here to help!
Yash is a Content Marketing professional with over three years of experience in data-driven marketing campaigns. He has expertise in strategic thinking, integrated marketing, and customer acquisition. Through comprehensive marketing communications and innovative digital strategies, he has driven growth for startups and established brands.