Azure Batch Job Processing: 2 Critical Methods

on Microsoft Azure • May 20th, 2022 • Write for Hevo

Azure Batch Job FI

Microsoft Azure is a widely known Cloud Computing service leveraged by Microsoft for application development. It provides many Cloud services that include analytics, compute, networking, and storage. Azure Batch is a subset of Microsoft’s Azure Cloud offer that lets you run large-scale parallel batch workloads. 

This blog elicits the different aspects and methods of scheduling and running an Azure DevOps Trigger Batch Job. It also sheds light on the key features, benefits, and working of Azure Batch processing before diving into the nitty-gritty of Azure Batch Job processing methods. 

Table of Contents

What is Azure Batch?

Azure Batch Job: Azure Batch Architecture
Image Source

The Microsoft Azure platform was designed to execute tailored batch computing jobs across any number of virtual machines or scalable nodes. It is a perfect fit for situations when a high memory/CPU-demanding process can be executed through multiple parallel tasks independently.

Here are a few examples of workloads that can be executed by leveraging Azure Batch Jobs:

  • Deep Learning
  • Image Rendering and Processing
  • Engineering Simulations
  • Execution of Software Tests
  • ETL (Extract-Transform-Load Process)

A few key concepts you need to be acquainted with before diving into Azure Batch Job deployment methods are listed below:

  • Azure Batch Pool: A compute pool is defined as a collation of nodes that your application will run on. It is the top component of an Azure Batch platform and it offers flexible scaling, number of nodes, allocation, installation of applications on the nodes, monitoring, and data distribution.
  • Azure Batch Node: An Azure Batch node is defined as a single Azure Virtual Machine that can process a vast chunk of your workload in Azure Batch. You can decide the size of the node, which in turn, can decide the memory capacity and the number of CPU cores. You can leverage nodes to run any script or executable which can be supported by the installed system environment.  

Key Features of Azure Batch Processing

Here are a few key features of Azure Batch Job Processing:

  • Azure Batch allows you to fully configure the nodes yourselves since it offers support for Docker configurations.
  • It also gives you the ability to run large-scale parallel workloads at a pretty low cost by leveraging low-priority VMs.
  • Azure Batch can also be auto-scaled, therefore, providing more nodes to cater to your requirements. For this, Azure Batch utilizes a formula, for instance, a formula that can increase the number of computing nodes if more than X tasks are queued.
  • With Azure Batch, you can run any type of node: Windows or Linux nodes, GPU instances, and Dockers.
  • It also allows you to monitor your jobs with Batch Explorer or Application Insights interactively.
  • You can easily create pools and execute jobs through an intuitive code interface with R (through Python and DoParallel)
  • Azure Batch can easily integrate with Data Lake Storage and Blob storage to fetch data for any given task.

Simplify ETL Using Hevo’s No-Code Data Pipeline

Hevo Data, a Fully-managed Data Ingestion solution, can help you automate, simplify & enrich your ingestion process in a few clicks. With Hevo’s out-of-the-box connectors and blazing-fast Data Pipelines, you can extract & aggregate data from 100+ Data Sources(including 40+ Free Sources) such as Azure straight into your Data Warehouse, Database, or any destination. 

GET STARTED WITH HEVO FOR FREE[/hevoButton]

Hevo is the fastest, easiest, and most reliable data replication platform that will save your engineering bandwidth and time multifold. Try our 14-day full access free trial today to experience an entirely automated hassle-free Data Replication!Try our 14-day full access free trial today!

What are Azure Batch Jobs and Tasks?

In Azure Batch, a task is a unit of computation. A job is defined as a collection of these tasks. Jobs handle how computation is executed by its tasks on the compute nodes in a pool.

A job mentions the pool in which the work needs to be executed. You can generate a new pool for every job or leverage one pool for various jobs. You can make a pool for each job that is related to a job schedule, or one pool for all jobs linked with a job schedule.

Job Priority

You can also assign an optional job priority to all the jobs you create. The Batch Service leverages the priority value of the job to ascertain the order of scheduling (for all tasks within the job) in each pool.

To update the priority of a job, you can invoke the ‘Update the properties of a job’ operation (Batch REST) or amend the CloudJob.Priority (Batch.NET). Priority values fall in the range -1000 to 1000.

In the same pool, higher priority jobs have scheduling precedence over lower-priority jobs. However, the tasks in low-priority jobs that are already running, won’t be preempted by tasks present in a higher-priority job. Jobs that have the same priority level have an equal chance of being scheduled, and ordering of task execution isn’t defined.

Job Constraints

You can use job constraints to mention certain limits for your jobs:

  • You can mention the maximum number of task retries as a constraint, including whether a task is never retried or always retried. Retrying a task means that if the task fails, it will be added to the queue to be executed again.
  • You can even specify a maximum wallclock time so that if a job runs longer than the maximum wallclock time mentioned, the job and all of its tasks are aborted.

What makes Hevo’s ETL Process Best-In-Class

Providing a high-quality ETL solution can be a difficult task if you have a large volume of data. Hevo’s automated, No-code platform empowers you with everything you need to have for a smooth data replication experience.

Check out what makes Hevo amazing:

  • Fully Managed: Hevo requires no management and maintenance as it is a fully automated platform.
  • Data Transformation: Hevo provides a simple interface to perfect, modify, and enrich the data you want to transfer.
  • Faster Insight Generation: Hevo offers near real-time data replication so you have access to real-time insight generation and faster decision making. 
  • Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (with 40+ free sources) that can help you scale your data infrastructure as required.
  • Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-day free trial!

Working with Azure Pipeline Trigger Batch

Since Azure Batch is a non-visual tool, you don’t have an actual user interface to deal with. Once you initiate the Azure Batch component in Microsoft Azure, you possess the ability to make compute pools:

  • For starters, Azure Batch assumes that there is data somewhere that needs to be processed. Usually, this data is located in Azure Data Lake Store or Azure Blob Storage. Therefore, your first task is to make sure that your data is uploaded to these storage folders.
  • Once you’ve uploaded the data to these folders, you can start creating a compute pool. A compute pool is a pool of one or more compute nodes on which you can assign jobs. While setting up the compute pool, you’ll be asked the name of the pool and what types of nodes the pool should possess (which software is installed, which OS, linked to which Azure storage, etc).
  • Therefore, all the compute nodes within a compute pool are identical. They are Microsoft Azure VMs that can be configured to your requirements:
    • You can mount them with an Azure VM, Docker Image, or customized image setup.
    • These nodes can either be Windows or Linux nodes.
    • They can either be low-priority or dedicated nodes.

Once you’ve completed the compute pool and nodes setup, you can designate work to it. Work will take the form of Tasks and Jobs. After you’ve submitted the jobs, Azure Batch can automatically allocate tasks dynamically to various nodes. Every node can absorb one or more tasks based on the number of cores the VM has. Finally, once all the tasks are completed, the job will be marked as complete and the compute nodes will be ready to execute another job.

Azure Batch Job: Azure Pipeline Trigger Batch
Image Source

Running and Scheduling Azure Batch Jobs

Here are the different methods you can deploy to schedule and run Azure Batch Jobs:

Running your Azure Batch Job with Azure Portal

Here are the steps involved in running your Azure Batch Job with Azure Portal:

Step 1: Creating a Batch Account

  • In the Microsoft Azure portal, you can choose ‘Create a resource’. Next, type in “batch service” in the search box, then choose Batch service.
Azure Batch Job: Creating a Batch Account
Image Source
  • Select the Create option and then from the Resource Group field, you can choose the ‘Create New’ option and type in a name for your resource group.
  • Next, you can enter a value for your Account Name. This name should be unique within the Azure location you pick. It can only contain lowercase numbers and letters, and it needs to be between 3-24 characters.
  • Under the Storage Account option, you can click on the ‘Select a storage account’ option and then select an existing storage account or make a new one.
  • You can leave all the other settings as-is. Next, you need to select the ‘Review+create’ option followed by the ‘Create’ option to make the Batch account. When the Deployment Succeeded message pops up, go to the Batch account that you made. 

Step 2: Generating a Pool of Compute Nodes

  • In the batch account, you need to select Pools > Add and enter a pool id named mypool.
  • In the Operating system, you can use the settings mentioned below:
SettingValue
Image TypeMarketplace
Publishermicrosoftwindowsserver
Offerwindowsserver
Sku2019-datacenter-core-smalldisk
  • Next, you can scroll down to enter the Scale and Node Size settings. The suggested node size provides a good balance of performance vs cost for this example.
SettingValue
Node pricing tierStandard_A1_v2
Target dedicated nodes2
  • You can keep the defaults for the remaining settings, and then select OK to generate the pool.

Step 3: Creating a Job

  • Now that you have a pool, you can create a job to run on it. A Batch job is simply a logical group of one or more tasks. A job consists of settings common to the tasks, such as the pool to run tasks on and the priority. The jobs won’t have tasks until you make them.
  • In the Batch Account view, select Jobs > Add.
  • Type in a Job ID called myjob. Similarly, in the Pool option, select mypool.
  • You can keep the defaults for the remaining settings, and select OK. 

Step 4: Creating Tasks

  • Next, you need to select the job to open the Tasks page. You’ll be creating sample tasks here to run in the job. Generally, you can create multiple tasks that Batch queues and distributes to run on the compute nodes. In this instance, you can create two identical tasks with each task running a command line to depict the Batch environment variables on a computer node and then wait 90 seconds.
  • When you leverage Azure Batch Jobs, you’ll need to specify your script or app on the command line. Azure Batch Jobs provides various ways to deploy scripts and apps to compute nodes.
  • To create the first task, you first need to select the Add option. Next, you’ll have to enter a Task ID called mytask.
  • In the command-line, you need to enter the command ‘cmd /c “set AZ_BATCH & timeout /t 90 > NUL”’. You can keep the defaults for the remaining settings, and select the Submit button.
  • You can repeat the steps mentioned above to generate a second task. You just need to enter a different Task ID such as mytask2 while using the command line.  

Step 5: Viewing Task Output

  • The example tasks you’ve created will be completed in a couple of minutes. To look at the output of a completed task, you simply need to select the task and choose the file stdout.txt to look at the standard output of the task. The contents will resemble the following example:
Azure Batch Job: Viewing Task Output
Image Source
  • The contents depict the Azure Batch environment variables that were set in the node. When you make your Batch tasks and jobs, you can reference these environment variables in task command lines, and the scripts and apps run by the command lines.

Step 6: Cleaning up Resources

  • You are only charged for the pool when the nodes are running, even if no jobs were scheduled. If you don’t need the pool anymore, you can delete it.
  • In the account view, choose the Pools option along with the name of the pool you’re trying to delete. Then, you can select the Delete button.
  • Now that you’ve deleted the pool, all task output on the nodes gets deleted with it.

Running your Azure Batch Job with Azure CLI

Here are the steps involved in running your Azure Batch Job with Azure CLI:

Step 1: Creating a Resource Group

  • You can create a resource group with the ‘az group create’ command. An Azure resource group is defined as a logical container into which Azure resources are managed and deployed.
  • The following instance creates a resource group called QuickstartBatch-rg in the eastus2 location.
az group create 
    --name QuickstartBatch-rg 
    --location eastus2

Step 2: Creating a Storage Account

  • You can connect an Azure storage account with your Batch account. Although the storage account isn’t essential to the quickstart, it can be leveraged to deploy applications and store output and input data for most real-world workloads. You can use the ‘az storage account create’ command to create a storage account in your resource group.
az storage account create 
    --resource-group QuickstartBatch-rg 
    --name mystorageaccount 
    --location eastus2 
    --sku Standard_LRS

Step 3: Creating a Batch Account

  • You can make a batch account by using the ‘az batch account create’ command. This is essential to this process because you need an account to generate Batch jobs and compute resources (pools of compute nodes).
  • The following instance will generate a batch account called mybatchaccount in QuickstartBatch-rg and creates a link to the storage account created.
az batch account create 
    --name mybatchaccount 
    --storage-account mystorageaccount 
    --resource-group QuickstartBatch-rg 
    --location eastus2
  • To manage and develop compute jobs and pools, you need to authenticate with Batch. Next, you need to log in to the account by leveraging the ‘az batch account login’ command. After you log in, your az batch commands will use the account context as follows:
az batch account login 
    --name mybatchaccount 
    --resource-group QuickstartBatch-rg 
    --shared-key-auth

Step 4: Generating a Pool of Compute Nodes

  • Now that you have a Batch account, you need to create a sample pool of Linux Compute nodes by leveraging the az batch pool create command. The following instance generates a pool called mypool of 2 Standard_A1_v2 nodes by running Ubuntu 16.04 TLS. The recommended node size provides a good balance of performance as opposed to the cost for this quick example:
az batch pool create 
    --id mypool --vm-size Standard_A1_v2 
    --target-dedicated-nodes 2 
    --image canonical:ubuntuserver:16.04-LTS 
    --node-agent-sku-id "batch.node.ubuntu 16.04"
  • Azure Batch Job will generate the pool immediately, but it takes a couple of minutes to allocate and start the compute nodes. At this time, the pool is present in the resizing stage. To look at the status of the pool, you’ll need to run the az batch pool show command. This command depicts all the properties of the pool, and you can query for specified properties. The following command obtains the allocation state of the pool:
az batch pool show --pool-id mypool 
    --query "allocationState"

Step 5: Creating a Job

  • Now that you have a pool, you need to generate a job to run on it. A batch job is a logical group for one or more tasks. A job consists of settings common to the tasks, such as the priority and the pool to execute the tasks on. First, you need to generate a batch job by leveraging the az batch job create command.
  • The following instance creates a job myjob on the pool mypool. However, initially, the job contains no tasks:
az batch job create 
    --id myjob 
    --pool-id mypool

Step 6: Creating Tasks

  • Next, you need to leverage the az batch task create command to make some tasks run within the job. In this instance, you will be creating four identical tasks. Each task in turn runs a command line to depict the Batch environment variables on a compute node, and then waits for 90 seconds.
  • When you utilize Azure Batch jobs, this command line is where you can mention your script or app. Azure Batch Jobs offers several ways to deploy scripts and apps to compute nodes.
  • The following bash script will generate four parallel tasks (mytask1 to mytask4):
for i in {1..4}
do
   az batch task create 
    --task-id mytask$i 
    --job-id myjob 
    --command-line "/bin/bash -c 'printenv | grep AZ_BATCH; sleep 90s'"
done
  • The command output would show the settings for all the tasks. Azure Batch Jobs distribute the tasks to the compute nodes. 

Step 7: Viewing Task Status

  • After you make a task, Azure Batch queues it to run on the pool. Once a node is available to execute it, the task runs.
  • You can use the az batch task show command to look at the status of the Azure Batch Job tasks. The following instance depicts details about mytask1 running on one of the pool nodes.
az batch task show 
    --job-id myjob 
    --task-id mytask1
  • The command output includes various details, but observe the exitCode of the task command line along with the nodeId. An exitCode of 0 depicts that the task command line is completed successfully. The nodeId depicts the ID of the pool node on which the task is running. 

Step 8: Viewing Task Output

  • To list the files generated by a task on a compute node, you can utilize the az batch task file list command. Here’s a code snippet for the same:
az batch task file list 
    --job-id myjob 
    --task-id mytask1 
    --output table
  • This is what the output looks like:
Name        URL                                                                                         Is Directory      Content Length
----------  ------------------------------------------------------------------------------------------  --------------  ----------------
stdout.txt  https://mybatchaccount.eastus2.batch.azure.com/jobs/myjob/tasks/mytask1/files/stdout.txt  False                  695
certs       https://mybatchaccount.eastus2.batch.azure.com/jobs/myjob/tasks/mytask1/files/certs       True
wd          https://mybatchaccount.eastus2.batch.azure.com/jobs/myjob/tasks/mytask1/files/wd          True
stderr.txt  https://mybatchaccount.eastus2.batch.azure.com/jobs/myjob/tasks/mytask1/files/stderr.txt  False                     0
  • To download one of the output files to a local directory, you can leverage the az batch task file download command. In this instance, task output can be placed in stdout.txt.
az batch task file download 
    --job-id myjob 
    --task-id mytask1 
    --file-path stdout.txt 
    --destination ./stdout.txt
  • You can take a look at the contents of stdout.txt in an editor. The contents will show the Azure Batch job environment variables that are set on the node. When you create your Batch jobs, you can reference these environment variables within task command files, and in the scripts and apps executed by the command lines. For instance:
AZ_BATCH_TASK_DIR=/mnt/batch/tasks/workitems/myjob/job-1/mytask1
AZ_BATCH_NODE_STARTUP_DIR=/mnt/batch/tasks/startup
AZ_BATCH_CERTIFICATES_DIR=/mnt/batch/tasks/workitems/myjob/job-1/mytask1/certs
AZ_BATCH_ACCOUNT_URL=https://mybatchaccount.eastus2.batch.azure.com/
AZ_BATCH_TASK_WORKING_DIR=/mnt/batch/tasks/workitems/myjob/job-1/mytask1/wd
AZ_BATCH_NODE_SHARED_DIR=/mnt/batch/tasks/shared
AZ_BATCH_TASK_USER=_azbatch
AZ_BATCH_NODE_ROOT_DIR=/mnt/batch/tasks
AZ_BATCH_JOB_ID=myjobl
AZ_BATCH_NODE_IS_DEDICATED=true
AZ_BATCH_NODE_ID=tvm-257509324_2-20180703t215033z
AZ_BATCH_POOL_ID=mypool
AZ_BATCH_TASK_ID=mytask1
AZ_BATCH_ACCOUNT_NAME=mybatchaccount
AZ_BATCH_TASK_USER_IDENTITY=PoolNonAdmin

Step 9: Cleaning up Resources

  • You are only charged for the pools while the nodes are being executed, even if no jobs are scheduled. When you don’t need a pool anymore, you can delete it with the az batch pool delete command. When you delete the pool, all task output on the nodes will be deleted.
az batch pool delete --pool-id mypool
  • When you don’t need it anymore, you can leverage the az group delete command to get rid of the resource group, pools, Batch account, and all related resources. You can delete the resources as follows:
az group delete --name QuickstartBatch-rg

Conclusion

This blog talks about the salient aspects of Azure Batch jobs and the methods you can use to schedule and run Batch jobs on Microsoft Azure.

visit our website to explore hevo

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations, with a few clicks. Hevo Data with its strong integration with 100+ sources (including 40+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools.

Want to take Hevo for a spin?

Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the pricing that will help you choose the right plan for your business needs.

No-code Data Pipeline for Your Data Warehouse