In an ever-changing world that is increasingly dominated by data, it is more important now than ever before that data professionals create avenues in which one can connect to traditional data warehouses and today’s modern machine learning and AI platforms.
Data pipelines are one such avenue and building data pipeline solutions that support both traditional and modern data stores has become an integral part of efficient management and control of data being produced daily by organizations and corporations. For pipelines to have a great impact, activities are created to handle different phases and operations of the general process, thereby, helping you to perform and carry out tasks easily without having to interfere at every given point.
This article will introduce you to Azure Data Factory and will look at everything you need to know about building data pipelines activities in the Microsoft platform. Read along to learn the Data Manipulations of Azure Data Factory Activities and the steps to set them using UI and JSON!
Table of Content
What are Data Pipelines/Activities?
A pipeline is a logical grouping of activities that together performs a task. It allows the management of activities as a unit instead of having to do it individually meaning you do not have to deploy and schedule the activities individually rather, you deploy the pipeline to carry out your scheduled task.
The activities in a pipeline can be described as the actions performed on your data to yield your desired result. For example, when a pipeline is created to perform an ETL job, multiple activities are used to extract data, transform the data, and subsequently load the transformed data into the data warehouse.
The activity uses input and output datasets. A dataset simply identifies your data within different data stores such as tables, files, folders, and documents. An input dataset represents the input for an activity in the pipeline and an output dataset represents the output for the activity. An activity can take zero or more input datasets and produce one or more datasets. Azure Data Factory has three groupings of activities which will be described and explained further in the next section of this piece.
Hevo Data, an Automated No-code Data Pipeline can help you automate, simplify & enrich your data replication process in a few clicks. With Hevo’s wide variety of connectors and blazing-fast Data Pipelines, you can extract & load data from 100+ data sources straight into your Data Warehouse or any Databases. To further streamline and prepare your data for analysis, you can process and enrich raw granular data using Hevo’s robust & built-in Transformation Layer without writing a single line of code!
Get Started with Hevo for Free
Hevo is the fastest, easiest, and most reliable data replication platform that will save your engineering bandwidth and time multifold. Try our 14-day full access free trial today to experience an entirely automated hassle-free Data Replication!
Azure Data Factory Activities
As has been alluded to in the previous section, the Azure Data Factory Activities can be broadly categorized into the following three groups:
Azure Data Factory Activities: Data Movement
Data Movement in Azure Data Factory Activities can be referred to as pulling data from a source to a data store. Data Factory uses the Copy activity to move source data from a data location to a sink data store. This is done by writing data from any source store to a data sink be it located on-premise or in the cloud. The Copy activity is executed on an integration runtime where different types of integration runtimes can be used for different copy situations.
In copying this data from the source to the sink, the Copy activity reads the data from the source, then performs serialization/deserialization, compression/decompression, column mapping, and other operations on it, before finally writing the data to the sink/destination datastore.
The following data stores and data formats are supported on Azure Data Factory, they include Azure Blob storage, Azure Cognitive Search index, Azure Cosmos DB (SQL API), Azure Data Explorer, Amazon RDS for SQL Server, Amazon Redshift, Google BigQuery, HBase, Hive, Cassandra, MongoDB, Amazon S3, FTP, etc. For a comprehensive list of Azure Data Factory-supported data stores and formats or a general overview of its Copy activity, visit here.
Azure Data Factory Activities: Data Movement
Data transformation in Azure Data Factory Activities can help you use its transformation process to get useful predictions and insights from your raw data at scale. The transformation process is executed in a computing environment which can be added to pipelines individually or chained with another activity as well as providing detailed information on each transformation activity.
You can transform data in Azure Data Factory natively with data flows such as mapping and data wrangling where you do not have to write code or through external sources like HDInsight Hive activity, HDInsight Pig activity, etc. where you can hand-code transformations and manage the external computing environment yourself.
Whatever method you desire and think is best suited for your organization is supported and each of these methods is discussed below.
- Mapping Data Flows: These are visually designed data transformations found in Azure Data Factory that allows the use of graphical data transformation logic without writing any code. Once the data flow is complete, the resulting flow can be used for activities within pipelines.
- Data Wrangling: Cloud-scale data wrangling in Azure Data Factory is enabled with the aid of Power Query. It allows for the preparation of data without the use of code and integrates with Power Query Online and makes available Power Query M functions.
- HDInsight Hive Activity: This is used to execute Hive queries in a pipeline with your code or on-demand Windows/Linux-based HDInsight cluster. It is used to process and analyze structured data.
- HDInsight Pig Activity: This is used to execute Pig queries in a pipeline through your code or on-demand Windows/Linux-based HDInsight cluster.
- HDInsight MapReduce Activity: This is used to execute MapReduce activity in the pipeline with your code or on-demand Windows/Linux-based HDInsight cluster.
- HDInsight Streaming Activity: This is used to execute Hadoop Streaming programs in the pipeline with your code or on-demand Windows/Linux-based HDInsight cluster.
- HDInsight Spark Activity: This is used to execute Spark programs in a pipeline with your code or on-demand Windows/Linux-based HDInsight cluster.
- ML Studio (Classic) Activities: This service is used in the creation of pipelines that utilize a published ML Studio (Classic) web service for predictive analysis. Though support for Machine Learning Studio (Classic) will end on 31 August 2024 and it is recommended that data be transformed on Azure Machine Learning by that time.
- Stored Procedure Activity: SQL Server Stored Procedure activity is used to invoke a stored procedure in any of the available data stores in a Data Factory pipeline. The data store may include Azure SQL Database, Azure Synapse Analytics, etc.
- Data Lake Analytics U-SQL Activity: This is used to run a U-SQL script on an Azure Data Lake Analytics cluster.
- Azure Synapse Notebook Activity: This is used in a pipeline to run a Synapse notebook in the Azure Synapse workspace.
- Databricks Notebook Activity: This is used in a pipeline to run a Databricks notebook in the Azure Databricks workspace.
- Databricks Jar Activity: This is used in a pipeline to run a Spark Jar in the Azure Databricks cluster.
- Databricks Python Activity: This is used in a pipeline to run a Python file in the Azure Databricks cluster.
- Custom Activity: You can create a custom activity to cater to your data needs by forming your own data processing logic to be used in the pipeline. This will mean you can transform your data in ways not supported directly by Data Factory. You can configure custom activities such as custom .NET or run R scripts.
- Compute Environments: Linked services can be created for the compute environment and then used to define a transformation activity. There are two types of compute environments supported by Azure Data Factory namely On-Demand and Bring Your Own. On-Demand is fully managed by the service while Bring Your Own involves you registering your computing environment as a linked service.
For more information on Azure Data Factory Activities regarding Data Transformation, visit here.
Azure Data Factory Activities: Data Control
Below is a list of the control activities supported in Azure Data Factory describing its basic function.
- Append Variable Activity: This is used to add a value to an existing array variable.
- Execute Pipeline Activity: This is used when you want a Data Factory pipeline to invoke another pipeline.
- Filter Activity: This is used to apply a filter expression to an input array.
- For Each Activity: This defines a repeating control flow in your pipeline. It executes specified activities in a loop by iterating over a collection.
- Get Metadata Activity: This is used in the retrieving of the metadata of any data found in the Data Factory.
- If Condition Activity: This is used to evaluate a set of activities as either true or false.
- Lookup Activity: This is used to read or look up a record/table name/value from an external source.
- Set Value: This is used to set the value of an existing variable.
- Until Activity: This is used to execute a set of activities in a loop until the condition associated with the activity evaluates to true, though, a timeout value can also be specified for the Until Activity.
- Wait Activity: This is used when you want the pipeline to wait for a particular time before executing subsequent activities in the pipeline.
- Web Activity: This can be used to call a custom REST endpoint from a pipeline.
- Webhook Activity: This can be used to call an endpoint, and pass a callback URL. The run on the pipeline will wait for the implementation of the callback before proceeding to the next activity on the pipeline.
Providing a high-quality ETL solution can be a difficult task if you have a large volume of data. Hevo’s automated, No-code platform empowers you with everything you need to have for a smooth data replication experience.
Check out what makes Hevo amazing:
Sign up here for a 14-day free trial!
- Fully Managed: Hevo requires no management and maintenance as it is a fully automated platform.
- Data Transformation: Hevo provides a simple interface to perfect, modify, and enrich the data you want to transfer.
- Faster Insight Generation: Hevo offers near real-time data replication so you have access to real-time insight generation and faster decision making.
- Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
- Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (with 40+ free sources) that can help you scale your data infrastructure as required.
- Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Creating Azure Data Factory Activities with UI
A pipeline can be created on Azure Data Factory through the User Interface (UI) by carrying out the following steps:
- Navigate to the Author tab found in a Data Factory Studio.
- Click on the plus sign, then point to Pipeline from the menu that comes up.
- From the submenu, select pipeline again and a pipeline editor will be displayed.
The pipeline editor displays an Activity pane where you can select the activities to be included within the pipeline, an editor canvas where activities appear when added to the pipeline, a pipeline configuration pane that shows Parameters, Variables, general Settings, and Output, and a Pipeline Properties pane that displays the pipeline name, optional description as displayed in the image below.
Creating Azure Data Factory Activities with JSON
Pipelines can also be defined on Azure Data Factory using JSON format. A sample JSON file defining a pipeline is shown below.
"description": "pipeline description",
"concurrency": <your max pipeline concurrency>,
The article has given you an overview of how pipeline activities are created and operated on Microsoft Azure Data Factory. It first explained the meaning of a pipeline/activity to give you a better understanding of the concept before narrowing it down to Data Factory specifically.
Visit our Website to Explore Hevo
Azure Data Factory is great for performing data integrations. However, at times, you need to transfer this data from multiple sources to your Data Warehouse for analysis. Building an in-house solution for this process could be an expensive and time-consuming task. Hevo Data, on the other hand, offers a No-code Data Pipeline that can automate your data transfer process, hence allowing you to focus on other aspects of your business like Analytics, Customer Management, etc. This platform allows you to transfer data from 100+ sources to Cloud-based Data Warehouses like Snowflake, Google BigQuery, Amazon Redshift, etc. It will provide you with a hassle-free experience and make your work life much easier.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.
Share your views on Azure Data Factory Activities in the comments section!