Working with AWS Batch Job: Comprehensive Guide to Kickoff Pipeline Jobs 101

on AWS, AWS Batch, AWS Batch Jobs, Docker • April 8th, 2022 • Write for Hevo

aws batch jobs - featured image

AWS Batch is one of the most popular services of AWS that allows you to create and run pipeline jobs periodically or on-demand. With its user-friendly and interactive user interface, AWS Batch enables you to seamlessly build, configure, and launch pipeline jobs. AWS not only allows you to create and execute jobs with its UI but also empowers you to execute jobs using the pre-built or pre-customized Docker images. With AWS Batch, you can run a single Docker script to kickoff multiple pipeline jobs periodically or based on specific time schedules. 

In this article, you will learn about AWS Batch and how to create and kickoff pipeline AWS batch jobs using Docker images. 

Table of Contents

Prerequisites

A fundamental understanding of data pipelines.

What is AWS Batch?

AWS Batch Jobs: AWS logo
Image Source

Introduced by AWS in 2017, AWS Batch is a fully managed batch processing platform that allows you to build and execute batch computing workloads on the AWS Cloud. Since AWS Batch is a fully managed service, it enables you to run batch computing workloads of any scale asynchronously across several servers. In other words, AWS Batch maintains the infrastructure for you, thereby saving you the time and effort of installing, administering, monitoring, and scaling your batch computing processes.

Furthermore, AWS Batch automatically allocates the compute resources and optimizes the workload distribution based on workload quantity and scale. As AWS Batch eliminates the need for configuring and managing the required infrastructure for implementing batch processing mechanisms, there’s no need to install and manage several batch computing software. 

Replicate Data in Minutes using Hevo’s No-code Data Pipeline

Hevo Data, a Fully-managed Data Pipeline platform, can help you automate, simplify & enrich your data replication process in a few clicks. With Hevo’s wide variety of connectors and blazing-fast Data Pipelines, you can extract & load data from 100+ data sources (including 40+ free data sources) straight into your Data Warehouse or any Databases. To further streamline and prepare your data for analysis, you can process and enrich raw granular data using Hevo’s robust & built-in Transformation Layer without writing a single line of code!

GET STARTED WITH HEVO FOR FREE[/hevoButton]

Hevo is the fastest, easiest, and most reliable data replication platform that will save your engineering bandwidth and time multifold. Try our 14-day full access free trial today to experience an entirely automated hassle-free Data Replication!

How to Initiate and launch Pipeline AWS batch Jobs

With AWS Batch, you can run or invoke pipeline AWS batch jobs without installing and configuring any batch computing tools or server clusters so that you can spend more time evaluating data and addressing problems. It is very straightforward to initiate and launch data pipeline Jobs using AWS batch Jobs tool.

1. Prerequisites

To kickoff pipeline jobs using AWS batch Jobs tool, you have to satisfy certain prerequisites. If this is your first time using AWS batch Jobs tool, make sure you have a valid task queue and compute environment in the AWS Batch space. You can follow this official documentation to learn how to create a task queue and compute environment in AWS batch Jobs tool. In addition, you should have a preconfigured or ready-to-use docker environment to develop and register the Docker image, which you will use in further steps for creating pipeline jobs. You should also pre-installed the AWS CLI (Command-Line Tool) to run commands for accessing AWS services. Refer to this documentation for learning how to install and configure AWS CLI.

2. Building the Fetch and Run Docker image

The fetch & run Docker image is a simple script that reads certain environment variables to download and then executes the job script (or zip file) using the AWS CLI. To download the docker image, visit the GitHub repository of “aws-batch-helpers” and download the source code. Then, navigate to the “fetch-and-run” folder after unzipping the downloaded file. You can also download the most recent version of the docker image by pulling or cloning the fetch and run folder from the GitHub repository. After unzipping the “fetch-and-run” folder, you can find two files such as Dockerfile and fetch_and_run.sh. 

  • Initially, you have to build the “fetch-and-run” docker image by executing the Docker command given below.
docker build -t awsbatch/fetch_and_run
  • After executing the above command, you will get an output that resembles the following image.
AWS Batch Jobs: output of AWS batch fetch command
Image Source: Self
  • You can confirm whether the docker image is successfully built by executing the command given below. After executing the above command, you can see the newly created Docker image is active.

3. Creating an ECR repository

In the next step, you have to create an ECR repository that allows you to store, monitor, and delete Docker images. You can effectively store the newly created “fetch-and-run” docker image and set access permissions so that it can be retrieved by AWS Batch Jobs tool while exciting pipeline jobs.

  • Initially, navigate to the ECR console and click on “Create Repository.”
  • Then, enter the name of the ECR repository as “awsbatch/fetch_and_run” and click on “Next Step.”
AWS Batch Jobs: Build, tag and push Docker image
Image Source
  • Now, you successfully created an ECR repository.

What makes Hevo’s ETL Process Best-In-Class

Providing a high-quality ETL solution can be a difficult task if you have a large volume of data. Hevo’s automated, No-code platform empowers you with everything you need to have for a smooth data replication experience.

Check out what makes Hevo amazing:

  • Fully Managed: Hevo requires no management and maintenance as it is a fully automated platform.
  • Data Transformation: Hevo provides a simple interface to perfect, modify, and enrich the data you want to transfer.
  • Faster Insight Generation: Hevo offers near real-time data replication so you have access to real-time insight generation and faster decision making. 
  • Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (with 40+ free sources) that can help you scale your data infrastructure as required.
  • Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-day free trial!

4. Pushing the Docker Image to the ECR repository

In the next step, you have to push the Docker image to the newly created “awsbatch/fetch and run” repository. Execute the following command in AWS CLI to implement the process of pushing the Docker image into the ECR repository. You can replace the AWS account number and region in the command with your own account and region.

aws ecr get-login --region us-east-1
docker tag awsbatch/fetch_and_run:latest 012345678901.dkr.ecr.us-east-1.amazonaws.com/awsbatch/fetch_and_run:latest
docker push 012345678901.dkr.ecr.us-east-1.amazonaws.com/awsbatch/fetch_and_run:latest

4. Creating a Pipeline job script and upload to S3

  • In the next step, you have to create a new pipeline job by executing the “fetch and run” image that you already created and registered in ECR. Initially, you have to create a file called “myjob.sh” with the following sample content and then upload the script to an S3 bucket.
#!/bin/bash
Date
echo "Args: $@"
Env
echo "This is my simple test job!."
echo "jobId: $AWS_BATCH_JOB_ID"
echo "jobQueue: $AWS_BATCH_JQ_NAME"
echo "computeEnvironment: $AWS_BATCH_CE_NAME"
sleep $1
Date
echo "bye bye!!"
  • After executing the above code, upload the script to the S3 bucket by executing the following command.
aws s3 cp myjob.sh s3://<bucket>/myjob.sh

5. Creating an IAM Role

To authentically execute the AWS Batch job for accessing the S3 bucket, you must first create an IAM role. Since the fetch and run image fetches the job script from Amazon S3 when executed as an AWS Batch job, you’ll require an IAM role that allows the AWS Batch job to access S3. 

AWS Batch Jobs: create role
Image Source
  • Navigate to the IAM console and choose Roles. Then, click on Create New Role. In the “Select type of trusted entity” section and choose AWS service. 
AWS Batch Jobs: Choose the service for the role
Image Source
  • Now, select “Elastic Container Service,” as shown in the above image.
AWS Batch Jobs: selection of use case
Image Source
  • In the “Select your use case” section, select Elastic Container Service Task, and click on “Next: Permissions.”
AWS Batch Jobs: attach permission policies
Image Source
  • Now, you are redirected to the Attach Policy page. In the search bar, type “AmazonS3ReadOnlyAccess” as shown in the above image. Then select the “AmazonS3ReadOnlyAccess” policy checkbox and click on choose “Next: Review.”
AWS Batch Jobs: role specifications
Image Source
  • Now, choose Create Role and give your new role a name as batchJobRole. Then, the new role’s specifications are disclosed to you, as shown in the above image.

6. Creating a Job Definition

As of now, you have created all of the necessary resources to build a pipeline job in AWS batch Jobs tool. Now, pull them all together and construct a job description that you can use to run one or more AWS batch Jobs tool processes. 

AWS Batch Jobs: create job definition
Image Source
  • Navigate to the AWS Batch Jobs console and choose the Job Definitions menu on the left side panel. 
  • Now, you can find the “Create a job definition” section on the right side, as shown in the above image.
  • Then, in the Job Definition field, enter “fetch_and_run.” 
  • In the Container image field, enter the URL of the ECR Repository. For this case, the URL is 012345678901.dkr.ecr.us-east-1.amazonaws.com/awsbatch/fetch_and_run.
  • You can leave the Command field blank and for vCPUs and Memory field, enter 1 and 500, respectively.
  • After filling in all the necessary fields, click on Create job definition.

7. Running a Pipeline Job

This phase requires you to submit and run a task that uses the fetch and run image to download and execute the job script.

AWS Batch Jobs: submit AWS batch job
Image Source
  • In the AWS batch Jobs tool console, click on the Jobs menu in the left side panel and select Submit Job.
  • In the Job name field, enter a “script_test.”
  • Then, select the newly created fetch_and_run job definition from the dropdown menu in the Job definition field.
  • In the Job Queue field, select the first-run-job-queue from the dropdown menu.
  • In the Command section, enter “[myjob.sh,60]” and click on the Validate command.
AWS Batch Jobs: validate and submit job
Image Source
  • Now, you have to add Key and Value to the Environment Variables section, as shown in the above image.
    • Key=BATCH_FILE_TYPE, Value=script
    • Key=BATCH_FILE_S3_URL, Value=s3:///myjob.sh. Don’t forget to use the correct URL for your file.
  • After filling in all the necessary fields, click on the “Submit Job” button.
  • Now, confirm whether the job is successfully submitted by checking the final status in the console.
AWS Batch Jobs: list of the jobs published
Image Source
  • As shown in the above image, you can find the status of the job as “SUCCEEDED,” which confirms that the job has been submitted successfully.

By following the above-mentioned steps, you successfully created and executed pipeline jobs using AWS batch Jobs tool.

Conclusion

In this article, you learned about AWS batch Jobs tool and how to create and kickoff pipeline jobs in AWS batch Jobs tool. This article mainly focused on creating a single job definition and job using AWS batch Jobs tool. However, you can also run as many jobs as you need with the same job definition by uploading your jobs’ script to Amazon S3 and running “SubmitJob” with the appropriate environment variables. 

There are various trusted sources that companies use as it provides many benefits but transferring data from it into a data warehouse is a hectic task. The Automated data pipeline helps in solving this issue and this is where Hevo comes into the picture. Hevo Data is a No-code Data Pipeline and has awesome 100+ pre-built Integrations that you can choose from.

visit our website to explore hevo

Hevo can help you Integrate your data from numerous sources and load them into a destination to Analyze real-time data. It will make your life easier and data migration hassle-free. It is user-friendly, reliable, and secure.

SIGN UP for a 14-day free trial and see the difference!

Share your experience of learning about AWS Batch Jobs in the comments section below.

No-code Data Pipeline For your Data Warehouse