Organizations today process and transform a large amount of data with ETL (extract, load, and transform) pipelines. But, loading and transforming this big data is time-consuming. However, sometimes, you do not need to process vast amounts of data for your smaller projects.
Instead, you can use micro ETL with the help of AWS Lambda to get relevant data immediately. With AWS Lambda functions you can trigger time-based events to extract, transform, and save the data into a central repository.
In this article, you will learn to create a Micro ETL Data Pipeline Lambda Functions. Data Pipeline and AWS Lambda are also discussed briefly here.
Table Of Contents
Prerequisites
Basics understanding of the need for data migration
What is Data Pipeline?
Data Pipeline is a series of steps implemented in a specific order to process and transfer data from one system to another. The first step in the Data Pipeline is to extract data from the source as input. In data pipelining, each step’s output serves as the next input.
The Data Pipeline process consists of three main elements – data source, processing steps, and final destination. Data Pipeline allows users to transfer data from source to destination with some modifications along with the data flow.
A Data Pipeline is an umbrella term for data movement from one place to another, including ETL and ELT processes. However, it is essential to observe that the Data Pipeline doesn’t necessarily mean that a transformation is carried out on the data.
Hevo Data, a Fully-managed Data Pipeline platform, can help you automate, simplify & enrich your data replication process in a few clicks. With Hevo’s wide variety of connectors and blazing-fast Data Pipelines, you can extract & load data from 150+ Data Sources straight into a Data Warehouse like AWS Redshift. To further streamline and prepare your data for analysis, you can process and enrich raw granular data using Hevo’s robust & built-in Transformation Layer without writing a single line of code!
GET STARTED WITH HEVO FOR FREE
Hevo is the fastest, easiest, and most reliable data replication platform that will save your engineering bandwidth and time multifold. Try our 14-day full access free trial today to experience an entirely automated hassle-free Data Replication!
What is AWS Lambda?
Developed in 2014, AWS Lambda is a serverless computing service that allows you to run code for any application or backend service without managing servers. AWS Lambda manages all the administrative tasks such as CPU utility, memory, resources, and more on its own. It can connect with more than 200 AWS services and SaaS applications.
The term “serverless” computing refers to the fact that you don’t need to run these functions on your servers. AWS Lambda is a fully managed service that handles all of your infrastructure requirements.
AWS Lambda users write functions, which are self-contained applications written in one of the supported languages and runtimes, and upload them to AWS Lambda, which then executes them quickly and flexibly.
Lambda Functions can be used to do anything from serving web pages to processing data streams to calling APIs and integrating with other AWS services.
AWS manages AWS Lambda’s entire infrastructure layer. Customers don’t have much control over how the system works, but they also don’t have to worry about updating the underlying machines, avoiding network contention, and so on—AWS handles all of that.
Each Lambda function has its container in which it runs. Lambda packages a function into a new container and runs it on an AWS multi-tenant cluster of machines. Each function’s container is assigned its required RAM and CPU capacity before the functions begin to run. When the functions are completed, the RAM allocated at the start is multiplied by the length of time the function was active.
It also integrates with a variety of other AWS services, including API Gateway, DynamoDB, and RDS, and forms the foundation for AWS Serverless solutions.
AWS Lambda allows users to execute their code only when it is needed. Therefore, AWS Lambda can automatically scale from a few daily requests to thousands per second. Users must set the triggers in AWS Lambda functions to run the code.
Key Features of AWS Lambda
- Lambda extensions: Lambda extensions enhance the Lambda functions by combining them with your selected tools for security, monitoring, observability, and governance.
- Integration: AWS Lambda can integrate with various AWS services like S3, DynamoDB, API Gateway, and more for developing functional applications.
- Code Signing: Code Signing allows users to trust and integrity controls that ensure only unmodified code published by authorized developers are deployed to your Lambda services.
- Function Blueprint: The Lambda function blueprint includes sample code that shows how to use the Lambda function with other AWS services or third-party applications. The blueprint also contains setup settings for Python and Node.js.
- Reduced Expenses: With AWS Lambda, you only pay for the resources you use. The pay-as-you-go model prevents additional costs of unused time or storage.
- Functions defined as Container Images: AWS allows users to use their favorite container image tooling, processes, and dependencies for developing, testing, and deploying your Lambda function.
What is Micro ETL Pipeline?
Micro ETL pipeline is a short process that you can schedule for handling small amounts of data. At times, you only need to ingest, transform and load a subset of a larger dataset, excluding using the expensive and complex computational resources. The micro ETL processes are helpful when you deal with small data feeds that need to refresh regularly, like daily currency exchange rates, 5-minutes weather measurements, hourly stock availability for a small category of products, and more.
Getting Started with Data Pipeline Using Lambda
If you are building a data lake, an analytics pipeline, or a simple data feed, you will need small amounts of data that should be processed and refreshed. In this article, you can build and deploy a micro extract, transform and load (ETL) pipeline to handle this requirement. You will also configure a reusable Python environment to build and deploy micro ETL pipelines using your data source.
Micro ETL processes work seamlessly with the serverless architecture. Therefore, we will use the AWS Serverless Application Model (SAM) in this article.
You need a local environment to inspect the data, experiment and deploy the ETL process with the AWS SAM CLI (Command Line Interface). The deployment consists of a time-based event, which triggers the AWS Lambda function. This function is used for collecting, transforming, and storing in an Amazon S3 bucket, as shown in the below diagram.
Follow the below steps to build a Data Pipeline using AWS Lambda:
Prerequisites
Data Pipeline Lambda: Download the Code
Download the code from Github with the below command.
git clone https://github.com/aws-samples/micro-etl-pipeline.git
Providing a high-quality ETL solution can be a difficult task if you have a large volume of data. Hevo’s automated, No-code platform empowers you with everything you need to have for a smooth data replication experience.
Check out what makes Hevo amazing:
- Fully Managed: Hevo requires no management and maintenance as it is a fully automated platform.
- Data Transformation: Hevo provides a simple interface to perfect, modify, and enrich the data you want to transfer.
- Faster Insight Generation: Hevo offers near real-time data replication so you have access to real-time insight generation and faster decision making.
- Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
- Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (with 40+ free sources) that can help you scale your data infrastructure as required.
- Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-day free trial!
Data Pipeline Lambda: Setup the Environment
The Github code comes with a preconfigured Conda environment. Therefore, you do not need to waste time installing the dependencies. The Conda environment is a directory containing a specific collection of Conda packages. You can use the environment.yml file to get the same dependencies.
- Create the environment using the below code.
conda env create -f environment.yml
- Activate the environment.
conda activate aws-micro-etl
Data Pipeline Lambda: Analyze Data with Jupyter Notebook
- In this article, you will run the Jupyter notebook locally.
- After activating the environment, you can launch your Jupyter notebook.
- Use the AWS CLI with the below command.
jupyter notebook
- It will open a browser window with a Jupyter dashboard into the root project folder.
- Select the aws_mini_etl_sample.ipynb file.
- The above Jupyter notebook consists of a sample micro ETL process. The ETL process leverages the publicly available data from the HM land registry, containing the average price by property type series.
- The Jupyter consists of functional scenarios, as stated in the following.
- The possibility to support partial requests and therefore fetch a small part of the larger file.
- The ability to inspect, manipulate data, and achieve the right outcome.
- Supported file types other than CSV.
- The easiest way is to save a CSV file directly into the S3 bucket.
Data Pipeline Lambda: Inspect the Function
- The downloaded code consists of an additional folder called micro-etl-app, containing ETL processes defined with the AWS SAM template, ready to deploy as a Lambda function.
- AWS SAM provides the syntax for expressing functions, APIs, databases, and event source mappings.
- Define the application and model it by using YAML with a few lines per resource. AWS SAM transforms and expands the AWS SAM syntax into the AWS CloudFormation syntax, enabling you to build serverless applications faster.
- The AWS SAM app consists of the below files.
- template.yml: It consists of the configuration to build and deploy the Lambda function.
- app/app.py: It consists of the application’s code from the Jupyter notebook.
- app/requirements.txt: It consists of the Python libraries needed for the Lambda function to run.
- The file template.yml consists of the details for deploying and building the ETL process, such as permissions, schedule rules, variables, and more.
- It becomes essential for this type of micro-application to allocate the right amount of memory and timeout, which is beneficial to avoid latency issues or resource restrictions. Under the Globals statement, memory and timeout setting for the Lambda function are defined, as shown below.
Globals:
Function:
Timeout: 20
MemorySize: 256
- Other necessary settings are defined inside the Property statement, such as the environment variables, which allow you to control settings like the URL to fetch without redeploying the code.
Environment:
Variables:
Url: 'http://publicdata.landregistry.gov.uk/market
S3Bucket: !Ref Bucket
LogLevel: INFO
Filename: 'ava-price-property-uk.csy'
- The definition of a cron event is under the Events statement, triggering the Lambda function every day at 8. am.
Events:
UpdateEvent:
Type: Schedule
Properties:
Schedule: cron(0 8 * * ? *)
- The initial section of the app.py file contains required dependencies, environment variables, and other supporting statements. The main code is inside the Lambda handler.
# Imports
import ...
import ...
import ...
# Environment variables
ONE = ...
TWO = ...
THREE = ...
# Lambda function handler
def lambda_handler(event, context):
# Code
#
- The app.py consists of comment that explains each statement. The first statement in the app.py file uses the requests library to fetch the last 2,000,000 bytes of your data source file defined in the URL environment variable.
res = requests.get(URL, headers=range_header(-2000000), allow_redirects=True)
- With the skiprows parameter, the second statement creates a pandas DataFrame directly from the source stream, removing the first row. It removes the first row because it is difficult to precisely fetch the beginning of a row using byte-range. The statement then assigns predefined column headers that are missing as part of the initial chunk of the file.
df = pd.read_csv(io.StringIO(res.content.decode('utf-8')), engine='python', error_bad_lines=False, names=columns, skiprows=1)
- The last file in the application is requirements.txt, which AWS SAM CLI uses to build and package the dependencies needed for the Lambda function to work correctly. You can also use additional package libraries in your application but define those in the requirements.txt.
Data Pipeline Lambda: Build and Deploy ETL
To build and deploy the ETL process, follow the below steps.
- Step 1: Go to the micro-etl-app from the command line.
- Step 2: Run sam build for letting the AWS SAM CLI process the template file and bundle the application code on any functional dependencies.
- Step 3: Run the ‘sam deploy –stack-name my-micro-etl –guided’ command for deploying the process and saving parameters for future deploys.
- Step 4: Invoke the Lambda function and inspect the log simultaneously from the command line by using the below command.
aws lambda invoke --function-name FUNCTION_ARN out --log-type Tail --query 'LogResult' --output text | base64 -d
- Step 5: The base64 utility is accessible only on Linux and Ubuntu. For mac-OS, you can use the base64 -D command.
- Step 6: You can invoke the Lambda function on the Lambda console and inspect the CloudWatch log group with it, named /aws/lambda/<function name>.
- Step 7: The URL for the generated file in the S3 bucket is shown on the log’s final line. It should look like the below output.
Bash
## FILE PATH
53://micro-etl-bucket-XXXXXXXX/avg-price-property-uk.csv
- Step 8: Users can use AWS SAM CLI to inspect the content of the file and verify that it contains only rows from the range defined in app.py:
aws s3 cps3://micro-etl-bucket-xxxxxxxx/avg-price-property-uk.csv local_file.csv
Data Pipeline Lambda: Clean Up Resources
- You can delete the resources from the command line to avoid future charges.
aws cloudformation delete-stack --stack-name my-micro-etl
The above command removes all the resources used in this article, including the S3 bucket.
- Deactivate the Conda environment variable using the below command.
conda deactivate
Conclusion
In this article, you learned to create a micro ETL pipeline for data refresh. Organizations use ETL pipelines to fetch data from a particular source, transform it into a specific format, and then store that data in a warehouse. With AWS, ETL micro-processes are easy to build and deploy a cost-effective mechanism for regularly transferring and managing a small amount of data.
visit our website to explore hevo
Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations with a few clicks. Hevo Data with its strong integration with 150+ sources (including 50+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice such as AWS Redshift, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools.
Want to take Hevo for a spin?
Sign Up for a 14-day free trial and experience the feature-rich Hevo suite firsthand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.