In an ever-changing world that is increasingly dominated by data, it is more important now than ever before that data professionals create avenues in which one can connect to traditional data warehouses and today’s modern machine learning and AI platforms.

Data pipelines are one such avenue and building data pipeline solutions that support both traditional and modern data stores has become an integral part of efficient management and control of data being produced daily by organizations and corporations.

For pipelines to have a great impact, activities are created to handle different phases and operations of the general process, thereby, helping you to perform and carry out tasks easily without having to interfere at every given point.

This article will introduce you to Azure Data Factory and will look at everything you need to know about building data pipelines activities in the Microsoft platform. Read along to learn the Data Manipulations of Azure Data Factory Activities and the steps to set them using UI and JSON!

What are Data Pipelines/Activities?

A pipeline is a logical grouping of activities that together performs a task. It allows the management of activities as a unit instead of having to do it individually meaning you do not have to deploy and schedule the activities individually rather, you deploy the pipeline to carry out your scheduled task. 

The activities in a pipeline can be described as the actions performed on your data to yield your desired result. For example, when a pipeline is created to perform an ETL job, multiple activities are used to extract data, transform the data, and subsequently load the transformed data into the data warehouse.

The activity uses input and output datasets. A dataset simply identifies your data within different data stores such as tables, files, folders, and documents. An input dataset represents the input for an activity in the pipeline and an output dataset represents the output for the activity. An activity can take zero or more input datasets and produce one or more datasets. Azure Data Factory has three groupings of activities which will be described and explained further in the next section of this piece.

Azure Data Factory Activities

As has been alluded to in the previous section, the Azure Data Factory Activities can be broadly categorized into the following three groups:  

Data Movement

  • Data Movement in Azure Data Factory Activities can be referred to as pulling data from a source to a data store. Data Factory uses the Copy activity to move source data from a data location to a sink data store. This is done by writing data from any source store to a data sink be it located on-premise or in the cloud. The Copy activity is executed on an integration runtime where different types of integration runtimes can be used for different copy situations.
  • In copying this data from the source to the sink, the Copy activity reads the data from the source, then performs serialization/deserialization, compression/decompression, column mapping, and other operations on it, before finally writing the data to the sink/destination datastore.

The following data stores and data formats are supported on Azure Data Factory, they include Azure Blob storage, Azure Cognitive Search index, Azure Cosmos DB (SQL API), Azure Data Explorer, Amazon RDS for SQL Server, Amazon Redshift, Google BigQuery, HBase, Hive, Cassandra, MongoDB, Amazon S3, FTP, etc.

Data transformation

Data transformation in Azure Data Factory Activities can help you use its transformation process to get useful predictions and insights from your raw data at scale. The transformation process is executed in a computing environment which can be added to pipelines individually or chained with another activity as well as providing detailed information on each transformation activity.

Whatever method you desire and think is best suited for your organization is supported and each of these methods is discussed below.

  • Mapping Data Flows: These are visually designed data transformations found in Azure Data Factory that allows the use of graphical data transformation logic without writing any code. Once the data flow is complete, the resulting flow can be used for activities within pipelines. 
  • Data Wrangling: Cloud-scale data wrangling in Azure Data Factory is enabled with the aid of Power Query. It allows for the preparation of data without the use of code and integrates with Power Query Online and makes available Power Query M functions.
  • HDInsight Hive Activity: This is used to execute Hive queries in a pipeline with your code or on-demand Windows/Linux-based HDInsight cluster. It is used to process and analyze structured data.
  • HDInsight Pig Activity: This is used to execute Pig queries in a pipeline through your code or on-demand Windows/Linux-based HDInsight cluster.
  • HDInsight MapReduce Activity: This is used to execute MapReduce activity in the pipeline with your code or on-demand Windows/Linux-based HDInsight cluster.
  • HDInsight Streaming Activity: This is used to execute Hadoop Streaming programs in the pipeline with your code or on-demand Windows/Linux-based HDInsight cluster.
  • HDInsight Spark Activity: This is used to execute Spark programs in a pipeline with your code or on-demand Windows/Linux-based HDInsight cluster.
  • ML Studio (Classic) Activities: This service is used in the creation of pipelines that utilize a published ML Studio (Classic) web service for predictive analysis. Though support for Machine Learning Studio (Classic) will end on 31 August 2024 and it is recommended that data be transformed on Azure Machine Learning by that time.
  • Stored Procedure Activity: SQL Server Stored Procedure activity is used to invoke a stored procedure in any of the available data stores in a Data Factory pipeline. The data store may include Azure SQL Database, Azure Synapse Analytics, etc.
  • Data Lake Analytics U-SQL Activity: This is used to run a U-SQL script on an Azure Data Lake Analytics cluster.
  • Azure Synapse Notebook Activity: This is used in a pipeline to run a Synapse notebook in the Azure Synapse workspace.
  • Databricks Notebook Activity: This is used in a pipeline to run a Databricks notebook in the Azure Databricks workspace. 
  • Databricks Jar Activity: This is used in a pipeline to run a Spark Jar in the Azure Databricks cluster.
  • Databricks Python Activity: This is used in a pipeline to run a Python file in the Azure Databricks cluster. 
  • Custom Activity: You can create a custom activity to cater to your data needs by forming your own data processing logic to be used in the pipeline. This will mean you can transform your data in ways not supported directly by Data Factory. You can configure custom activities such as custom .NET or run R scripts.
  • Compute Environments: Linked services can be created for the compute environment and then used to define a transformation activity. There are two types of compute environments supported by Azure Data Factory namely On-Demand and Bring Your Own. On-Demand is fully managed by the service while Bring Your Own involves you registering your computing environment as a linked service.
Simplify your Data Analysis with Hevo’s No-code Data Pipeline

Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready.

Start for free now!

Get Started with Hevo for Free

Data Control

Below is a list of the control activities supported in Azure Data Factory describing its basic function.

  • Append Variable Activity: This is used to add a value to an existing array variable.
  • Execute Pipeline Activity: This is used when you want a Data Factory pipeline to invoke another pipeline.
  • Filter Activity: This is used to apply a filter expression to an input array.
  • For Each Activity: This defines a repeating control flow in your pipeline. It executes specified activities in a loop by iterating over a collection.
  • Get Metadata Activity: This is used in the retrieving of the metadata of any data found in the Data Factory.
  •  If Condition Activity: This is used to evaluate a set of activities as either true or false. 
  • Lookup Activity: This is used to read or look up a record/table name/value from an external source.
  • Set Value: This is used to set the value of an existing variable.
  • Until Activity: This is used to execute a set of activities in a loop until the condition associated with the activity evaluates to true,  though, a timeout value can also be specified for the Until Activity.
  • Wait Activity: This is used when you want the pipeline to wait for a particular time before executing subsequent activities in the pipeline.
  • Web Activity: This can be used to call a custom REST endpoint from a pipeline.
  • Webhook Activity: This can be used to call an endpoint, and pass a callback URL. The run on the pipeline will wait for the implementation of the callback before proceeding to the next activity on the pipeline.

Creating Azure Data Factory Activities with UI

A pipeline can be created on Azure Data Factory through the User Interface (UI) by carrying out the following steps:

  1. Navigate to the Author tab found in a Data Factory Studio.
  2. Click on the plus sign, then point to Pipeline from the menu that comes up.
  3. From the submenu, select pipeline again and a pipeline editor will be displayed.
Azure Data Factory Activities: Factory Resources
Azure Data Factory Activities: Factory Resources

The pipeline editor displays an Activity pane where you can select the activities to be included within the pipeline, an editor canvas where activities appear when added to the pipeline, a pipeline configuration pane that shows Parameters, Variables, general Settings, and Output, and a Pipeline Properties pane that displays the pipeline name, optional description as displayed in the image below.

Configurations
Configurations

Creating Azure Data Factory Activities with JSON

Pipelines can also be defined on Azure Data Factory using JSON format. A sample JSON file defining a pipeline is shown below.

{
    "name": "PipelineName",
    "properties":
    {
        "description": "pipeline description",
        "activities":
        [
        ],
        "parameters": {
        },
        "concurrency": <your max pipeline concurrency>,
        "annotations": [
        ]
    }
}

Conclusion

The article has given you an overview of how pipeline activities are created and operated on Microsoft Azure Data Factory. It first explained the meaning of a pipeline/activity to give you a better understanding of the concept before narrowing it down to Data Factory specifically. 

Azure Data Factory is great for performing data integrations. However, at times, you need to transfer this data from multiple sources to your Data Warehouse for analysis. Building an in-house solution for this process could be an expensive and time-consuming task. Hevo Data, on the other hand, offers a No-code Data Pipeline that can automate your data transfer process, hence allowing you to focus on other aspects of your business like Analytics, Customer Management, etc. This platform allows you to transfer data from 150+ sources to Cloud-based Data Warehouses like Snowflake, Google BigQuery, Amazon Redshift, etc. It will provide you with a hassle-free experience and make your work life much easier.

Want to take Hevo for a spin? Sign Up or a 14-day free trial and experience the feature-rich Hevo suite firsthand. Also checkout our unbeatable pricing to choose the best plan for your organization.

Arsalan Mohammed
Research Analyst, Hevo Data

Arsalan is a research analyst at Hevo and a data science enthusiast with over two years of experience in the field. He completed his B.tech in computer science with a specialization in Artificial Intelligence and finds joy in sharing the knowledge acquired with data practitioners. His interest in data analysis and architecture drives him to write nearly a hundred articles on various topics related to the data industry.