Airflow is a workflow management tool that helps to represent data engineering pipelines as Python code. Airflow represents workflows as Directed Acyclic Graphs or DAGs. It provides an intuitive user interface that enables users to configure the workflow tasks, schedule and monitor them.

Airflow is generally used in fetching data from various sources, transforming them, and then pushing them to different destinations. To do this, Airflow provides a number of hooks and operators to fetch and transform data. This post talks about the Airflow S3 Hook provided by Airflow and how to use it.

Table of Contents

Prerequisites

To successfully set up the Airflow S3 Hook, you need to meet the following requirements:

  • Python 3.6 or above.
  • Airflow installed and configured to use.
  • An AWS account with S3 permission.

Understanding Airflow Hooks

Airflow S3 Hook - Apache Airflow Logo
Image Source

The purpose of Airflow Hooks is to facilitate integration with external systems. Hooks help Airflow users to connect to different data sources and targets. It is used in fetching data as well as pushing data. Airflow Hooks are used as the building block for implementing Airflow operators. Airflow operators abstract full functionality of extracting, transforming, and loading data for various source and destination combinations. 

Hooks help users to avoid boilerplate code that they would have to implement otherwise to connect to various systems. Airflow provides a number of built hooks to interface with systems like MySQL, PostgreSQL, S3, etc. Airflow also provides an interface for developing custom hooks, in case the built-in hooks are not enough for you.

Airflow S3 Hook provides methods to retrieve keys, buckets, check for the presence of keys, list all keys, load a file to S3, download a file from S3, etc. You can find a complete list of all functionalities supported by the S3 Hook here

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline, helps load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports Amazon S3 and other 100+ Data Sources including 40+ Free Sources. It is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. 

Hevo loads the data onto the desired Data Warehouse/destination in real-time and enriches the data and transforms it into an analysis-ready form without having to write a single line of code. Its completely automated pipeline, fault-tolerant, and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

GET STARTED WITH HEVO FOR FREE

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled securely and consistently with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.

Simplify your Data Analysis with Hevo today! 

SIGN UP HERE FOR A 14-DAY FREE TRIAL!

Steps to Set Up Airflow S3 Hook

The goal of this post is to help the reader get familiarized with the concept of Airflow Hooks and to build his first DAG using the Airflow S3 Hook. You will cover the following points in this article:

  • Work with Airflow UI
  • Configure the Airflow S3 Hook and its connection parameters;
  • Use Airflow S3 Hook to implement a DAG.

Follow the steps below to get started with Airflow S3 Hook:

Step 1: Set up Airflow S3 Hook

Once installed the Airflow S3 Hook, you can use the below command to start the Airflow Webserver:

airflow webserver -p 8080

Then, access localhost:8080 in your favorite browser to view the Airflow UI.

Airflow S3 Hook - Airflow UI
Image Source: Self

Step 2: Set Up the Airflow S3 Hook Connection

Click the Admin tab in the Airflow user interface and click Connections as shown below:

Airflow S3 Hook - Create Airflow S3 Hook Connection
Image Source: Self

Set up the S3 connection object by clicking the ‘+’ button. Add a relevant name, and ensure you select Connection Type as S3. You will have to use the name of the connection in your code. So please keep a note of the name that was entered here. 

Airflow S3 Hook - Add Airflow S3 Hook Connection Details
Image Source: Self

In the Extra section, add your AWS secret credentials in the below format. Leave the rest of the fields back.

{"aws_access_key_id":"_your_aws_access_key_id_", "aws_secret_access_key": "_your_aws_secret_access_key_"}

Step 3: Implement the DAG

Use the below code snippet to implement the DAG.

import logging
from airflow import DAG
from datetime import datetime, timedelta
from airflow.providers.postgres.hooks.postgres import PostgresHook
from airflow.operators.python import PythonOperator

from tempfile import NamedTemporaryFile
from typing import TYPE_CHECKING, Dict, List, Optional, Sequence, Union

from airflow.exceptions import AirflowException
from airflow.models import BaseOperator
from airflow.providers.amazon.aws.hooks.s3 import S3Hook


# Change these to your identifiers, if needed.
AWS_S3_CONN_ID = "S3_default"



def s3_extract():
   source_s3_key = "YOUR_S3_KEY"
   source_s3_bucket = "YOUR_S3_BUCKET"
   dest_file_path = "home/user/airflow/data/s3_extract.txt"
   source_s3 = S3Hook(AWS_S3_CONN_ID)
   source_s3.download(source_s3_key,source_s3_bucket,dest_file_path)

   	 
with DAG(
	dag_id="s3_extract",
	start_date=datetime(2022, 2, 12),
	schedule_interval=timedelta(days=1),
	catchup=False,
) as dag:

  t1 = PythonOperator(
    	task_id="s3_extract_task",
    	python_callable=s3_extract)
   	 
  t1

The first part of the DAG includes the import statements required. The core part of the DAG is the s3_extract function. This function uses the Airflow S3 Hook to initialize a connection to AWS. It then gets the file using the key and bucket name. The file is then downloaded using the download_fileobj method provided by the boto3 S3 client

The DAG is configured to run this extract every day starting a specific date. 

Now save the DAG as s3_extract_dag.py in the DAG directory. The DAG directory is specified as the dags_folder parameter in the airflow.cfg file that is located in your installation directory. 

Step 4: Run the DAG

After saving the file in the DAG directory, execute the below command to ensure that file has been indexed by Airflow. This is required for Airflow to recognize the file as a DAG. 

airflow db init

Go to the Airflow UI and find the DAG.

Airflow S3 Hook - Find the DAG
Image Source: Self

You can then trigger the DAG using the ‘play’ button in the top right corner.

Airflow S3 Hook - Run the DAG
Image Source: Self

Once it is successfully executed, head to the directory configured for saving the file and you will be able to find the output CSV file.

That concludes your effort to use the Airflow S3 Hook to download a file. See how easy it was to download a file from S3 because S3 Hook from Airflow abstracted away all the boilerplate code and provided a simple function we can call to download the file. 

Challenges faced with Airflow S3 Hooks

While the above Airflow S3 Hook connection may appear easy, this is not a reflection of the actual requirement that ETL Developers face in production. Let us now learn about some of the typical challenges faced while executing such a requirement in production.

  • The above method works well for a single one-off extract. S3 is used as a data lake in many ETL pipelines and file download may not be about just fetching and saving in the system. There will be additional complexities in the form of recognizing when a file arrived and then acting on it, checking for duplicate files, etc. In such cases, complex logic will have to be implemented in the form of airflow operators.
  • While Airflow provides a set of built-in operators and hooks, they are not sufficient more often than not, especially for organizations that use many SAAS offerings. 
  • The DAG definition still has to be done based on code. There are ETL tools like Hevo that can help avoid coding altogether and let you focus only on the business logic.

Conclusion

Airflow S3 Hooks provide an excellent way of abstracting all the complexities involved in interacting with S3 from Airflow. It also provides a clean way of configuring credentials outside the code through the use of connection configuration. The intuitive user interface helps to configure the periodic runs and monitor them. That said, it still needs you to write code since DAG definitions have to be written as code.

However, extracting complex data from a diverse set of data sources like CRMs, Project management Tools, Streaming Services, Marketing Platforms can be quite challenging. If you are looking for a no-code way of extracting data from S3 and transforming them, checkout Hevo. Hevo Data is a No-Code Data Pipeline that offers a faster way to move data from Amazon S3, and other 100+ Data Sources including 40+ Free Sources, into your Data Warehouse to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.

VISIT OUR WEBSITE TO EXPLORE HEVO

Want to take Hevo for a spin?

SIGN UP and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Share your experience with Airflow S3 Hook in the comments section below!

Talha
Software Developer, Hevo Data

Talha is a seasoned Software Developer, currently driving advancements in data integration at Hevo Data, where he have been instrumental in shaping a cutting-edge data integration platform for the past four years. With a significant tenure at Flipkart prior to their current role, he brought innovative solutions to the space of data connectivity and software development.

No-Code Data Pipeline For Your Amazon S3

Get Started with Hevo