Automation plays a key role in improving production rates and work efficiency in various industries. Airflow is used by many Data Engineers and Developers to programmatically author, schedule, and monitor workflows. However, manually maintaining and scaling Airflow, along with handling security and authorization for its users is a daunting task. This is where AWS Apache Airflow comes in.
Amazon Managed Workflows for Apache Airflow (MWAA) is a fully managed service that makes it easy to run Apache Airflow on AWS, and to create workflows to perform Extract-Transform-Load (ETL) jobs and Data Pipelines.
What is Airflow?
Airflow allows organizations to write workflows as Directed Acyclic Graphs (DAGs) in a standard Python programming language, ensuring anyone with minimal knowledge of the language can deploy one. Airflow helps organizations to schedule their tasks by specifying the plan and frequency of flows. Airflow also provides an interactive interface along with a bunch of different tools to monitor workflows in real-time.
Apache Airflow has gained a lot of popularity among organizations dealing with significant amounts of Data Collection, Processing, and Analysis. There are many tasks that IT experts need to perform manually on a daily basis. Airflow triggers automatic workflow and reduces the time and effort required for collecting data from various sources, processing it, uploading it, and finally creating reports.
What are Managed Workflows for Apache Airflow (MWAA)?
Amazon Managed Workflows for Apache Airflow is a fully managed service in the AWS Cloud for deploying and rapidly scaling open-source Apache Airflow projects. With Amazon Managed Workflows for Apache Airflow, you can author, schedule, and monitor workflows using Airflow within AWS without having to set up and maintain the underlying infrastructure.
Amazon MWAA is capable of automatically scaling Airflow’s workflow execution capacity to meet your needs. Airflow is integrated with AWS Security services to provide fast and secure access to your data. Amazon MWAA uses the Amazon VPC, DAG code, and supporting files in your Amazon S3 storage bucket to create an environment.
Airflow allows workflows to be written as Directed Acyclic Graphs (DAGs) using the Python programming language. Airflow workflows fetch input from sources like Amazon S3 storage buckets using Amazon Athena queries and perform transformations on Amazon EMR clusters. The output data can be used to train Machine Learning Models on Amazon SageMaker.
Key Features of AWS Apache Airflow
- Automatic Airflow Setup: You can easily set up Apache Airflow within the Amazon MWAA environment without facing any challenges. Amazon MWAA sets up Apache Airflow using the same Airflow User Interface (UI) and open-source code.
- Built-in Security: As discussed, Airflow Workers and Schedulers run in MWAA’s Amazon VPC, which means data is also automatically encrypted using AWS Key Management Service.
- Scalability: It is very easy to scale Airflow within MWAA, you can automatically scale Airflow Workers by specifying the minimum and a maximum number of workers. Its autoscaling component automatically adds workers to meet the requirements.
- Built-in Authentication: MWAA enables role-based authentication and authorization for your Airflow Web Server by defining the access control policies in AWS Identity and Access Management (IAM).
- AWS Integration: Deploying Airflow within AWS opens doors for open-source integrations with various AWS services such as Amazon Athena, AWS Batch, Amazon DynamoDB, AWS DataSync, Amazon EMR, Amazon EKS, AWS Glue, Amazon Redshift, Amazon SageMaker, Amazon S3, etc.
AWS Apache Airflow Architecture
- The Apache Airflow Scheduler and Workers are AWS Fargate containers that connect to the private sub-networks in the Amazon Service VPC for your environment.
- Airflow metadatabases are managed by AWS, and they can be accessed by Airflow Scheduler and Workers Fargate containers via a privately-secured VPC endpoint.
- However, other AWS services like Amazon CloudWatch, Amazon S3, Amazon SQS, Amazon ECR, and AWS KMS are separate from Amazon MWAA architecture.
But they can still be accessed from the Apache Airflow Scheduler(s) and Workers in the Fargate containers.
Airflow Web Server can be accessed in both ways, over the Internet, and within the Amazon VPC. To access the Airflow Server over the Internet, you can select the Public Network Apache Airflow Access Mode. And, to access the Airflow Server within the Amazon VPC, you can select the Private Network Apache Airflow Access Mode. In both ways, authentication, and authorization for your Airflow Server are controlled by the access control policy defined in AWS Identity and Access Management (IAM).
AWS Apache Airflow Integrations
As discussed in the previous sections, deploying Airflow within AWS opens doors for open-source integrations with various AWS services as well as 100s of built-in and community-created operators and sensors. The community-created operators or plugins for Apache Airflow simplify connections to AWS services such as Amazon S3, Amazon Redshift, Amazon EMR, AWS Glue, Amazon SageMaker, Amazon Athena, etc. You can further use these community-driven operators to connect with services on other Cloud platforms as well.
To provide flexibility in performing Data Processing Tasks, Apache Airflow fully supports integration with AWS services and popular third-party tools such as Apache Hadoop, Hive, Presto, and Spark. On top of that, Amazon MWAA maintains compatibility with the Amazon MWAA API.
Getting Started with AWS Apache Airflow
To start using Amazon Managed Workflows for Apache Airflow, follow the below-mentioned steps.
Create an Airflow Environment Using Amazon MWAA
- To create an Airflow Environment, open your MWAA console. From the Amazon MWAA console, click on “Create environment”. It will now prompt you to name the environment and select the Airflow version to use.
Upload your DAGs and Plugins to S3
- The next step requires you to upload DAGs and Plugins to S3. To do so, select the S3 Bucket where you want the codes and files to be uploaded.
- Then, you can select the folder to upload your DAG code. The S3 Bucket name must start with airflow-.
- In addition to that, you can also specify a plugin file and a requirements file to be uploaded to S3.
- The plugins file (ZIP) contains the plugins used by your DAGs.
- The requirements file describes the Python dependencies required to run your DAGs.
For plugins and requirements, select the S3 object version to use.
- Click on “Next” to configure the advanced settings. In the “Networking” window, you have the option to choose the network (Public network or Private network) for web server access. For the purpose of this demonstration, a Public network is chosen.
- You can now allow MWAA to create a VPC Security Group based on the selected web server access.
- Next up, you need to configure the “Environment class”. Based on the number of DAGs, you’re provided with a suggestion on which class can be used. However, you can modify its class at any time.
- Coming to encryption, the data at rest is always encrypted. However, you can select a customized key managed by AWS Key Management Service (KMS).
Monitor your Environment
- With MWAA, you can monitor your environment with CloudWatch. To do so, the environmental performance needs to be published to CloudWatch Metrics, an option that is enabled by default.
- In addition to environment metrics, you can also send Airflow Logs to CloudWatch Logs. To do so, specify the log level and the Airflow components that should send their logs to CloudWatch Logs. For the purposes of this demonstration, log level INFO is used.
- Finally, you need to configure the permissions to be used by your environment to access your DAGs, write logs, and run DAGs. Select “Create a new role” and click on the “Create environment” button. The new Airflow environment is now ready to use.
Conclusion
As the complexity of your Data Pipelines increase, it becomes necessary to orchestrate the overall process into a series of sub-tasks. Apache Airflow is used by many Developers and Data Engineers to programmatically automate and manage workflows. And with AWS Apache Airflow, you can get rid of the common challenges involved in running your own Airflow environments.
Amazon Managed Workflows for Apache Airflow is a fully managed service that makes it easy to run Apache Airflow on AWS.
Share your experience of working in the comments section below.
Raj, a data analyst with a knack for storytelling, empowers businesses with actionable insights. His experience, from Research Analyst at Hevo to Senior Executive at Disney+ Hotstar, translates complex marketing data into strategies that drive growth. Raj's Master's degree in Design Engineering fuels his problem-solving approach to data analysis.