DAG scheduling may seem difficult at first, but it really isn’t. And, if you are a data professional who wants to learn more about data scheduling and how to trigger Airflow DAGs, you’re at the right place.
But, why did the need for data scheduling surface in the first place? The answer lies with the imperative to process the correct data at the right time. With its flexibility and freedom, Airflow provides users with the ability to schedule their Airflow Trigger DAG Runs. And with the help of the Airflow Scheduler, users can ensure future, as well as, the past DAG Runs, to be performed at the right time and in the correct order.
In this guide, we will discuss, in detail, the concept of scheduling and DAG Runs in Airflow. And, moving forward, we’ll be examining two methods to Trigger Airflow DAGs. Let’s begin.
Table of Contents
- What is Apache Airflow?
- The Concept of Scheduling in Airflow
- The Concept of DAG Runs in Airflow
- Two Methods to Trigger Airflow DAGs
- Conclusion
What is Apache Airflow?
Apache Airflow
Apache Airflow — an open-source and workflow management platform — is a tool to manage data engineering pipelines.
Programmed in Python and utilized with some standard features of the Python framework, Airflow enables its users to efficiently schedule data processing for engineering pipelines. The platform works as a building block that allows users to stitch together the many technologies present in today’s technological landscapes.
Some key features of Apache Airflow are as follows:
- It’s Extensible: It’s easy to define operators, extend libraries to fit the level of abstraction which suits your business requirements.
- It’s Dynamic: Configured as code (Python), Airflow pipelines allow dynamic pipeline generation, enabling users to restart from the point of failure without restarting the entire workflow again.
- It’s Sleek: Airflow pipelines are straightforward and their rich scheduling semantics enables users to run pipelines at regular intervals.
- It’s Scalable: Airflow has a modular design. In itself, Airflow is a general-purpose orchestration framework with a manageable set of features to learn.
The Concept of Scheduling in Airflow
One of the apex features of Apache Airflow, scheduling helps developers schedule tasks and assist to assign instances for a DAG Run on a scheduled interval. By definition, the Apache scheduler’s job is to monitor and stay in sync with all DAG objects, employ a repetitive process to store DAG parsing results, and always be on the lookout for active tasks to be triggered.
But how does Scheduling work in Airflow?
For example, If you run a DAG with “Schedule_interval” of “1” day, and the run stamp is set at 2022-02-16, the task will trigger soon after “2022-02-16T23:59.” Hence, the instance gets a trigger once the period set limit is reached.
The Apache Scheduler is custom-built to work seamlessly in an Airflow production environment. To run the scheduler, try running the code given below. “Airflow.cfg” contains all you need to know about the configuration.
airflow scheduler
The Concept of DAG Runs in Airflow
DAG stands for Direct Acyclic Graphs. These graphs are a pictorial representation of tasks in a pecking order. Each task is shown in the graph with the flow of execution from one task to another.
On the other hand, a DAG Run works as an extension — or as defined in the Apache documentation, “an instantiation” — of the DAG in time.
All DAG Runs have a schedule to abide by, but DAG might or might not have a schedule. By default, the “schedule_interval” is the DAG argument. The scheduled arguments can be treated as a cron expression. Some examples of cron presets are as follows:
preset | meaning | cron |
---|
None | Don’t schedule, use for exclusively “externally triggered” DAGs | |
@once | Schedule once and only once | |
@hourly | Run once an hour at the beginning of the hour | 0 * * * * |
@daily | Run once a day at midnight | 0 0 * * * |
@weekly | Run once a week at midnight on Sunday morning | 0 0 * * 0 |
@monthly | Run once a month at midnight of the first day of the month | 0 0 1 * * |
@quarterly | Run once a quarter at midnight on the first day | 0 0 1 */3 * |
@yearly | Run once a year at midnight of January 1 | 0 0 1 1 * |
Hevo Data, a No-code Data Pipeline helps to Load Data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, its and Streaming Services and simplifies the ETL process. It supports 100+ data sources and loads the data onto the desired Data Warehouse, enriches the data, and transforms it into an analysis-ready form without writing a single line of code.
Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different Business Intelligence (BI) tools as well.
Get Started with Hevo for Free
Check out why Hevo is the Best:
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!
Two Methods to Trigger Airflow DAGs
This section of the blog post will discuss the two ways available to trigger Airflow DAGs:
- Trigger Airflow DAGs on a Schedule: While creating a DAG in Airflow, you also have to specify a schedule trigger. The schedule then automatically decides to trigger DAG.
- Trigger Airflow DAGs Manually: It’s possible to trigger DAG manually via Airflow UI or by running a CLI command.
Trigger Airflow DAGs on a Schedule
To continue triggering Airflow DAGs on a schedule, it’s first required to specify the “start_date” and the “schedule_interval” parameters. Second, it’s required to upload the DAG file to your environment, too.
To specify the scheduling parameters, you define how many times the DAG will run by placing values in the “schedule_interval” parameter. And, to specify when Airflow should schedule DAG tasks, place the values in the “ start_date” parameter.
Note: Airflow schedules DAG Runs based on the minimum start date for tasks, as defined in the “schedule_interval” parameter which is the argument for DAG.
Look at the Airflow Trigger with Config example given below. It shows a DAG frequently running, in a periodic manner. The DAG runs every hour, from 15:00 on April 5, 2021. According to the parameters and working principles of DAG, Airflow schedules the first DAG Run at 16:00 April, 2021.
from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
with DAG(
dag_id='example_dag_schedule',
# At 15:00 on 5 April, 2021
start_date=datetime(2021, 4, 5, 15, 0),
# At minute 0 of every hour
schedule_interval='0 * * * *') as dag:
# Output the current date and time
t1 = BashOperator(
task_id='date',
bash_command='date',
dag=dag)
t1
Trigger Airflow DAGs Manually
While we trigger Airflow DAGs manually, Airflow performs the subsequent DAG Run. That means, if you independently execute Airflow DAGs trigger, already running on their respective schedule, then Airflow will go ahead and run that DAG, not affecting the scheduled DAGs. Now, to execute manual Airflow trigger DAG with config, you can use the following two methods:
- Trigger Airflow DAGs Using Airflow UI, and
- Trigger Airflow DAGs Using gcloud
Method 1: Trigger Airflow DAGs manually using Airflow U in GCC:
Step 1: In GCC, open the Environment page. Click here to open the Environment page.
Step 2: Now, in the Airflow webserver column, an Airflow link will be present for your environment. Click on that.
Step 3: While on the Airflow web interface, find the DAGs page. Now on the Links column for the DAG, click on the “Trigger Dag” button.
Step 4: You can also specify the DAG run configuration, but it’s optional.
Step 5: Click on the “Trigger” button.
Method 2: Trigger Airflow DAGs manually using gcloud in GCC:
For Airflow 1.10.*, the CLI command, “trigger_dag” is used to trigger the DAG Run. Look at the code given below:
gcloud composer environments run ENVIRONMENT_NAME
--location LOCATION
trigger_dag -- DAG_ID
For Airflow 2, the CLI command, “dags trigger” is used to trigger the DAG Run. Look at the code given below:
gcloud composer environments run ENVIRONMENT_NAME
--location LOCATION
dags trigger -- DAG_ID
Note: You can replace the following parameter names with the following:
- “ENVIRONMENT_NAME” with the environment variable.
- “LOCATION” with the Compute Engine region, the location of the environment variable.
- “DAG_ID” with the name of the DAG.
Conclusion
In this guide, we learned in detail about the different ways to trigger Airflow DAGs. We also made our way through the basics of scheduling, triggers, and DAG Runs, too. And, if you want to get more familiar with the basics of scheduling data pipeline processes in Airflow, below-given two articles can help:
- Scheduling & Triggers
- DAG Runs
Although it’s easy to create workflow pipelines in Apache Airflow for a trained professional, the need of the hour is to create a solution that will help a professional with a non-technical background — and Hevo Data is here to help!
Visit our Website to Explore Hevo
Hevo Data will automate your data transfer process, hence allowing you to focus on other aspects of your business like Analytics, Customer Management, etc. This platform allows you to transfer data from 100+ multiple sources like Salesforce for free to Cloud-based Data Warehouses like Snowflake, Google BigQuery, Amazon Redshift, etc. It will provide you with a hassle-free experience and make your work life much easier.
Want to take Hevo for a spin? Sign Up for a 14-day free trial. You may also have a look at the amazing price, which will assist you in selecting the best plan for your requirements.