Organizations today process and transform a large amount of data with ETL (extract, load, and transform) pipelines. But, loading and transforming this big data is time-consuming. However, sometimes, you do not need to process vast amounts of data for your smaller projects.
Instead, you can use micro ETL with the help of AWS Lambda to get relevant data immediately. With AWS Lambda functions you can trigger time-based events to extract, transform, and save the data into a central repository.
In this article, you will learn to create a Micro ETL Data Pipeline Lambda Functions. Data Pipeline and AWS Lambda are also discussed briefly here.
Prerequisites
Basics understanding of the need for data migration
What is Data Pipeline?
Data Pipeline is a series of steps implemented in a specific order to process and transfer data from one system to another. The first step in the Data Pipeline is to extract data from the source as input. In data pipelining, each step’s output serves as the next input.
The Data Pipeline process consists of three main elements – data source, processing steps, and final destination. Data Pipeline allows users to transfer data from source to destination with some modifications along with the data flow.
A Data Pipeline is an umbrella term for data movement from one place to another, including ETL and ELT processes. However, it is essential to observe that the Data Pipeline doesn’t necessarily mean that a transformation is carried out on the data.
What is AWS Lambda?
Developed in 2014, AWS Lambda is a serverless computing service that allows you to run code for any application or backend service without managing servers. AWS Lambda manages all the administrative tasks such as CPU utility, memory, resources, and more on its own. It can connect with more than 200 AWS services and SaaS applications.
AWS Lambda users write functions, which are self-contained applications written in one of the supported languages and runtimes, and upload them to AWS Lambda, which then executes them quickly and flexibly.
Lambda Functions can be used to do anything from serving web pages to processing data streams to calling APIs and integrating with other AWS services.
Getting Started with Data Pipeline Using Lambda
If you are building a data lake, an analytics pipeline, or a simple data feed, you will need small amounts of data that should be processed and refreshed. In this article, you can build and deploy a micro extract, transform and load (ETL) pipeline to handle this requirement. You will also configure a reusable Python environment to build and deploy micro ETL pipelines using your data source.
Micro ETL processes work seamlessly with the serverless architecture. Therefore, we will use the AWS Serverless Application Model (SAM) in this article.
You need a local environment to inspect the data, experiment and deploy the ETL process with the AWS SAM CLI (Command Line Interface). The deployment consists of a time-based event, which triggers the AWS Lambda function. This function is used for collecting, transforming, and storing in an Amazon S3 bucket, as shown in the below diagram.
Prerequisites
Download the Code
Download the code from Github with the below command.
git clone https://github.com/aws-samples/micro-etl-pipeline.git
Setup the Environment
The Github code comes with a preconfigured Conda environment. Therefore, you do not need to waste time installing the dependencies. The Conda environment is a directory containing a specific collection of Conda packages. You can use the environment.yml file to get the same dependencies.
- Create the environment using the below code.
conda env create -f environment.yml
- Activate the environment.
conda activate aws-micro-etl
Analyze Data with Jupyter Notebook
- In this article, you will run the Jupyter notebook locally.
- After activating the environment, you can launch your Jupyter notebook.
- Use the AWS CLI with the below command.
jupyter notebook
- It will open a browser window with a Jupyter dashboard into the root project folder.
- Select the aws_mini_etl_sample.ipynb file.
- The above Jupyter notebook consists of a sample micro ETL process. The ETL process leverages the publicly available data from the HM land registry, containing the average price by property type series.
- The Jupyter consists of functional scenarios, as stated in the following.
- The possibility to support partial requests and therefore fetch a small part of the larger file.
- The ability to inspect, manipulate data, and achieve the right outcome.
- Supported file types other than CSV.
- The easiest way is to save a CSV file directly into the S3 bucket.
Inspect the Function
- The downloaded code consists of an additional folder called micro-etl-app, containing ETL processes defined with the AWS SAM template, ready to deploy as a Lambda function.
- AWS SAM provides the syntax for expressing functions, APIs, databases, and event source mappings.
- Define the application and model it by using YAML with a few lines per resource. AWS SAM transforms and expands the AWS SAM syntax into the AWS CloudFormation syntax, enabling you to build serverless applications faster.
- The AWS SAM app consists of the below files.
- template.yml: It consists of the configuration to build and deploy the Lambda function.
- app/app.py: It consists of the application’s code from the Jupyter notebook.
- app/requirements.txt: It consists of the Python libraries needed for the Lambda function to run.
- The file template.yml consists of the details for deploying and building the ETL process, such as permissions, schedule rules, variables, and more.
- It becomes essential for this type of micro-application to allocate the right amount of memory and timeout, which is beneficial to avoid latency issues or resource restrictions. Under the Globals statement, memory and timeout setting for the Lambda function are defined, as shown below.
Globals:
Function:
Timeout: 20
MemorySize: 256
- Other necessary settings are defined inside the Property statement, such as the environment variables, which allow you to control settings like the URL to fetch without redeploying the code.
Environment:
Variables:
Url: 'http://publicdata.landregistry.gov.uk/market
S3Bucket: !Ref Bucket
LogLevel: INFO
Filename: 'ava-price-property-uk.csy'
- The definition of a cron event is under the Events statement, triggering the Lambda function every day at 8. am.
Events:
UpdateEvent:
Type: Schedule
Properties:
Schedule: cron(0 8 * * ? *)
- The initial section of the app.py file contains required dependencies, environment variables, and other supporting statements. The main code is inside the Lambda handler.
# Imports
import ...
import ...
import ...
# Environment variables
ONE = ...
TWO = ...
THREE = ...
# Lambda function handler
def lambda_handler(event, context):
# Code
#
- The app.py consists of comment that explains each statement. The first statement in the app.py file uses the requests library to fetch the last 2,000,000 bytes of your data source file defined in the URL environment variable.
res = requests.get(URL, headers=range_header(-2000000), allow_redirects=True)
- With the skiprows parameter, the second statement creates a pandas DataFrame directly from the source stream, removing the first row. It removes the first row because it is difficult to precisely fetch the beginning of a row using byte-range. The statement then assigns predefined column headers that are missing as part of the initial chunk of the file.
df = pd.read_csv(io.StringIO(res.content.decode('utf-8')), engine='python', error_bad_lines=False, names=columns, skiprows=1)
- The last file in the application is requirements.txt, which AWS SAM CLI uses to build and package the dependencies needed for the Lambda function to work correctly. You can also use additional package libraries in your application but define those in the requirements.txt.
Hevo Data, a Fully-managed Data Pipeline platform, can help you automate, simplify & enrich your data replication process in a few clicks. With Hevo’s wide variety of connectors and blazing-fast Data Pipelines, you can extract & load data from 150+ Data Sources straight into a Data Warehouse like AWS Redshift.
GET STARTED WITH HEVO FOR FREE
Hevo is the fastest, easiest, and most reliable data replication platform that will save your engineering bandwidth and time multifold. Try our 14-day full access free trial today to experience an entirely automated hassle-free Data Replication!
Build and Deploy ETL
To build and deploy the ETL process, follow the below steps.
- Step 1: Go to the micro-etl-app from the command line.
- Step 2: Run sam build for letting the AWS SAM CLI process the template file and bundle the application code on any functional dependencies.
- Step 3: Run the ‘sam deploy –stack-name my-micro-etl –guided’ command for deploying the process and saving parameters for future deploys.
- Step 4: Invoke the Lambda function and inspect the log simultaneously from the command line by using the below command.
aws lambda invoke --function-name FUNCTION_ARN out --log-type Tail --query 'LogResult' --output text | base64 -d
- Step 5: The base64 utility is accessible only on Linux and Ubuntu. For mac-OS, you can use the base64 -D command.
- Step 6: You can invoke the Lambda function on the Lambda console and inspect the CloudWatch log group with it, named /aws/lambda/<function name>.
- Step 7: The URL for the generated file in the S3 bucket is shown on the log’s final line. It should look like the below output.
Bash
## FILE PATH
53://micro-etl-bucket-XXXXXXXX/avg-price-property-uk.csv
- Step 8: Users can use AWS SAM CLI to inspect the content of the file and verify that it contains only rows from the range defined in app.py:
aws s3 cps3://micro-etl-bucket-xxxxxxxx/avg-price-property-uk.csv local_file.csv
Clean Up Resources
- You can delete the resources from the command line to avoid future charges.
aws cloudformation delete-stack --stack-name my-micro-etl
The above command removes all the resources used in this article, including the S3 bucket.
- Deactivate the Conda environment variable using the below command.
conda deactivate
Conclusion
In this article, you learned to create a micro ETL pipeline for data refresh. Organizations use ETL pipelines to fetch data from a particular source, transform it into a specific format, and then store that data in a warehouse. With AWS, ETL micro-processes are easy to build and deploy a cost-effective mechanism for regularly transferring and managing a small amount of data.
Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations with a few clicks. Hevo Data with its strong integration with 150+ sources (including 50+ free sources) allows you to export data from your desired data sources.
And load it to the destination of your choice such as AWS Redshift, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools.
Want to take Hevo for a spin?
Sign Up for a 14-day free trial and experience the feature-rich Hevo suite firsthand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.
Skand is a dedicated Customer Experience Engineer at Hevo Data, specializing in MySQL, Postgres, and REST APIs. With three years of experience, he efficiently troubleshoots customer issues, contributes to the knowledge base and SOPs, and assists customers in achieving their use cases through Hevo's platform.