Connecting traditional data warehouses with modern AI and machine learning platforms is essential in today’s data-driven world. Data pipelines are crucial in managing and controlling the vast amounts of data organizations generate daily. With Azure Data Factory Activities, you can automate various phases of the pipeline process, streamlining tasks and reducing manual effort.
In this article, we’ll explore everything you need to know about Azure Data Factory Activities and how to set them up using both the UI and JSON.
What are Data Pipelines/Activities?
A pipeline is a logical grouping of activities that together performs a task. It allows the management of activities as a unit instead of having to do it individually meaning you do not have to deploy and schedule the activities individually rather, you deploy the pipeline to carry out your scheduled task.
The activities in a pipeline can be described as the actions performed on your data to yield your desired result. For example, when a pipeline is created to perform an ETL job, multiple activities are used to extract data, transform the data, and subsequently load the transformed data into the data warehouse.
The activity uses input and output datasets. A dataset simply identifies your data within different data stores such as tables, files, folders, and documents. An input dataset represents the input for an activity in the pipeline and an output dataset represents the output for the activity. An activity can take zero or more input datasets and produce one or more datasets. Azure Data Factory has three groupings of activities which will be described and explained further in the next section of this piece.
Azure Data Factory Activities
As has been alluded to in the previous section, the Azure Data Factory Activities can be broadly categorized into the following three groups:
Data Movement
- Data Movement in Azure Data Factory Activities can be referred to as pulling data from a source to a data store. Data Factory uses the Copy activity to move source data from a data location to a sink data store. This is done by writing data from any source store to a data sink be it located on-premise or in the cloud. The Copy activity is executed on an integration runtime where different types of integration runtimes can be used for different copy situations.
- In copying this data from the source to the sink, the Copy activity reads the data from the source, then performs serialization/deserialization, compression/decompression, column mapping, and other operations on it, before finally writing the data to the sink/destination datastore.
The following data stores and data formats are supported on Azure Data Factory, they include Azure Blob storage, Azure Cognitive Search index, Azure Cosmos DB (SQL API), Azure Data Explorer, Amazon RDS for SQL Server, Amazon Redshift, Google BigQuery, HBase, Hive, Cassandra, MongoDB, Amazon S3, FTP, etc.
Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines. With integration with various data sources like azure blob storage and MySQL on Microsoft Azure as sources and Azure synapse analytics as destination, we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready.
Hevo offers industry-leading features such as:
- Data Transformation: Hevo provides a simple interface for perfecting, modifying, and enriching the data you want to transfer using a drag-and-drop feature.
- Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
- Live Support: The Hevo team is available 24/7 to provide exceptional customer support through chat, email, and support calls.
Try Hevo today and experience seamless data integration.
Get Started with Hevo for Free
Data transformation
Data transformation in Azure Data Factory Activities can help you use its transformation process to get useful predictions and insights from your raw data at scale. The transformation process is executed in a computing environment which can be added to pipelines individually or chained with another activity as well as providing detailed information on each transformation activity.
Whatever method you desire and think is best suited for your organization is supported and each of these methods is discussed below.
- Mapping Data Flows: These are visually designed data transformations found in Azure Data Factory that allows the use of graphical data transformation logic without writing any code. Once the data flow is complete, the resulting flow can be used for activities within pipelines.
- Data Wrangling: Cloud-scale data wrangling in Azure Data Factory is enabled with the aid of Power Query. It allows for the preparation of data without the use of code and integrates with Power Query Online and makes available Power Query M functions.
- HDInsight Hive Activity: This is used to execute Hive queries in a pipeline with your code or on-demand Windows/Linux-based HDInsight cluster. It is used to process and analyze structured data.
- HDInsight Pig Activity: This is used to execute Pig queries in a pipeline through your code or on-demand Windows/Linux-based HDInsight cluster.
- HDInsight MapReduce Activity: This is used to execute MapReduce activity in the pipeline with your code or on-demand Windows/Linux-based HDInsight cluster.
- HDInsight Streaming Activity: This is used to execute Hadoop Streaming programs in the pipeline with your code or on-demand Windows/Linux-based HDInsight cluster.
- HDInsight Spark Activity: This is used to execute Spark programs in a pipeline with your code or on-demand Windows/Linux-based HDInsight cluster.
- ML Studio (Classic) Activities: This service is used in the creation of pipelines that utilize a published ML Studio (Classic) web service for predictive analysis. Though support for Machine Learning Studio (Classic) will end on 31 August 2024 and it is recommended that data be transformed on Azure Machine Learning by that time.
- Stored Procedure Activity: SQL Server Stored Procedure activity is used to invoke a stored procedure in any of the available data stores in a Data Factory pipeline. The data store may include Azure SQL Database, Azure Synapse Analytics, etc.
- Data Lake Analytics U-SQL Activity: This is used to run a U-SQL script on an Azure Data Lake Analytics cluster.
- Azure Synapse Notebook Activity: This is used in a pipeline to run a Synapse notebook in the Azure Synapse workspace.
- Databricks Notebook Activity: This is used in a pipeline to run a Databricks notebook in the Azure Databricks workspace.
- Databricks Jar Activity: This is used in a pipeline to run a Spark Jar in the Azure Databricks cluster.
- Databricks Python Activity: This is used in a pipeline to run a Python file in the Azure Databricks cluster.
- Custom Activity: You can create a custom activity to cater to your data needs by forming your own data processing logic to be used in the pipeline. This will mean you can transform your data in ways not supported directly by Data Factory. You can configure custom activities such as custom .NET or run R scripts.
- Compute Environments: Linked services can be created for the compute environment and then used to define a transformation activity. There are two types of compute environments supported by Azure Data Factory namely On-Demand and Bring Your Own. On-Demand is fully managed by the service while Bring Your Own involves you registering your computing environment as a linked service.
Data Control
Below is a list of the control activities supported in Azure Data Factory describing its basic function.
- Append Variable Activity: This is used to add a value to an existing array variable.
- Execute Pipeline Activity: This is used when you want a Data Factory pipeline to invoke another pipeline.
- Filter Activity: This is used to apply a filter expression to an input array.
- For Each Activity: This defines a repeating control flow in your pipeline. It executes specified activities in a loop by iterating over a collection.
- Get Metadata Activity: This is used in the retrieving of the metadata of any data found in the Data Factory.
- If Condition Activity: This is used to evaluate a set of activities as either true or false.
- Lookup Activity: This is used to read or look up a record/table name/value from an external source.
- Set Value: This is used to set the value of an existing variable.
- Until Activity: This is used to execute a set of activities in a loop until the condition associated with the activity evaluates to true, though, a timeout value can also be specified for the Until Activity.
- Wait Activity: This is used when you want the pipeline to wait for a particular time before executing subsequent activities in the pipeline.
- Web Activity: This can be used to call a custom REST endpoint from a pipeline.
- Webhook Activity: This can be used to call an endpoint, and pass a callback URL. The run on the pipeline will wait for the implementation of the callback before proceeding to the next activity on the pipeline.
Integrate Azure Blob Storage to BigQuery
Integrate MySQL on Microsoft Azure to Snowflake
Integrate MongoDB to Azure Synapse Analytics
Creating Azure Data Factory Activities with UI
A pipeline can be created on Azure Data Factory through the User Interface (UI) by carrying out the following steps:
- Navigate to the Author tab found in a Data Factory Studio.
- Click on the plus sign, then point to Pipeline from the menu that comes up.
- From the submenu, select pipeline again and a pipeline editor will be displayed.
The pipeline editor displays an Activity pane where you can select the activities to be included within the pipeline, an editor canvas where activities appear when added to the pipeline, a pipeline configuration pane that shows Parameters, Variables, general Settings, and Output, and a Pipeline Properties pane that displays the pipeline name, optional description as displayed in the image below.
Creating Azure Data Factory Activities with JSON
Pipelines can also be defined on Azure Data Factory using JSON format. A sample JSON file defining a pipeline is shown below.
{
"name": "PipelineName",
"properties":
{
"description": "pipeline description",
"activities":
[
],
"parameters": {
},
"concurrency": <your max pipeline concurrency>,
"annotations": [
]
}
}
Migrate Seamlessly to Azure with Hevo
No credit card required
Conclusion
The article has given you an overview of how pipeline activities are created and operated on Microsoft Azure Data Factory. It first explained the meaning of a pipeline/activity to give you a better understanding of the concept before narrowing it down to Data Factory specifically.
Azure Data Factory is great for performing data integrations. However, at times, you need to transfer this data from multiple sources to your Data Warehouse for analysis. Building an in-house solution for this process could be an expensive and time-consuming task.
Hevo Data, on the other hand, offers a no-code data pipeline that can automate your data transfer process, allowing you to focus on other aspects of your business, such as analytics, customer management, etc. Sign up for Hevo’s 14-day free trial and experience seamless data migration.
FAQs
1. What does an Azure Data Factory do?
Azure Data Factory orchestrates and automates data movement, transformation, and integration across different data sources.
2. How many types of activities are there in ADF?
Azure Data Factory (ADF) has three types of activities: data movement activities, data transformation activities, and control activities.
3. What is the structure of Azure Data Factory?
Azure Data Factory’s structure includes:
Pipelines: Group activities for data processes.
Activities: Tasks like data movement or transformation.
Datasets: Define data structures.
Linked Services: Connections to data sources.
Triggers: Start pipelines.
Integration Runtimes: Compute infrastructure for execution.
Arsalan is a research analyst at Hevo and a data science enthusiast with over two years of experience in the field. He completed his B.tech in computer science with a specialization in Artificial Intelligence and finds joy in sharing the knowledge acquired with data practitioners. His interest in data analysis and architecture drives him to write nearly a hundred articles on various topics related to the data industry.