As data collection within organizations proliferates rapidly, developers are automating data movement through Data Ingestion techniques. However, implementing complex Data Ingestion techniques can be tedious and time-consuming for developers.

As a result, to overcome such issues, Microsoft developed Azure Data Factory to help organizations build cost-effective Data Ingestion, ELT (Extract, Load, Transform), and ETL (Extract, Transform and Load) processes with a simple Graphical User Interface.

You can also monitor your ingested data pipelines and schedule them using Azure Data Factory scheduling features. Azure Data Factory can perform Data Ingestion processes on cloud and on-premises services.

This article talks about Data Ingestion Azure Data Factory in detail. It also explains Data Ingestion and Azure Data Factory briefly.

Prerequisites

Basics understanding of integration

What is Data Ingestion?

Data Ingestion moves data from one or more sources to a destination for further processing and analysis. Usually, Data Ingestion is leveraged to bring data from disparate sources like Saas applications into a Data Lake or Data Warehouse, or other storage for consolidating data.

Data Ingestion has the following benefits: 

  • Data is Easily Available: Organizations use Data Ingestion processes to collect data from different sources and move it to a unified environment so that data can be easily accessed and further analyzed. 
  • Data is Simplified: Due to the advancement in Data Ingestion techniques like ETL (Extract, Transform, Load), data can be quickly transformed into various predefined formats and then sent to the centralized storage.
  • Saves Time: Before Data Ingestion tools, developers or engineers manually performed the Data Ingestion process, which was time-consuming. However, now data engineers can perform Data Ingestion by using no-code solutions to expedite the process.

What is Azure Data Factory?

Developed in 2015, Azure Data Factory is a managed cloud service built for extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects. Azure Data Factory allows you to visually integrate data sources with more than 90 built-in, maintenance-free connectors. It enables users to prepare data easily, construct ETL and ELT processes, and monitor pipelines with code-free services.

Organizations often store unorganized or raw data in relational and non-relational forms. However, these data often do not provide the proper context for meaningful insights. Azure Data Factory provides a platform where all this data is transformed and stored in a centralized location, which organizations can further use for gaining meaningful insights.

Key Features of Azure Data Factory

  • Scalability: Azure Data Factory was developed to handle large amounts of data. It consists of in-built features like parallelism and time slicing that can transfer gigabytes of data in the cloud within a few hours.
  • Built-in Connectors: Azure Data Factory consists of more than 90 built-in connectors to access data from different sources like Amazon Redshift, Google BigQuery, Oracle Exadata, Teradata, Salesforce, Marketo, ServiceNow, and more.
  • Orchestrate, Monitor, and Manage Pipeline Performance: Managing data pipelines becomes difficult and time-consuming with the changing data landscape. With Azure Data Factory, you can monitor your data pipeline by setting up alerts. These alerts appear in the Azure alerts group, which notify users of data pipeline problems.

Understanding Data Ingestion Azure Data Factory

Basic concepts in Data Ingestion with Azure Data Factory

  • Connectors or Linked Services: It contains the configuration settings for specific data sources, including server name, database name, files, folders, credentials, and more. Each data flow can have one or more linked services, depending on the nature of the job.
  • Pipelines: Pipelines refer to logging groups of activities. The Data Ingestion Azure Data Factory consists of one or more pipelines where each pipeline contains one or more activities.
  • Triggers: Triggers are used for scheduling configuration for pipelines. It contains configuration settings like start or end date, execution frequency, and more. Triggers are not an essential part of Data Ingestion Azure Data Factory, but they are needed when your pipelines need to be scheduled.
  • Activities: Activities are actions like data movement, transformations, or control flow actions. Activities configurations contain settings like a database query, parameters, script location, stored procedure name, and more.

Data Ingestion Azure Data Factory: Azure Data Factory with Azure Functions

Microsoft Azure functions is a cloud-based service that allows running event-triggered code in a scalable way without managing the complete infrastructure of applications. In this method, the data is processed with custom Python code wrapped into an Azure function.

The Azure function is called with the Azure Function activity in Azure Data Factory. This Azure Function activity is excellent for lightweight data transformations. It enables you to run your Azure functions in an Azure Data Factory or Synapse pipeline. Synapse is an analytics service in Azure that allows you to perform data integration, enterprise data warehousing, and big data analytics.

data ingestion azure data factory: azure data factory with azure functions
Data ingestion azure data factory: azure Data factory with azure functions

To run the Azure function, you need to create a linked service connection. Then, use the linked service with an activity that specifies the Azure function you want to execute.

Creating an Azure function Activity with UI

Follow the below steps for using an Azure function activity in a pipeline.

  • Step 1: Initially, you need to create your first Azure Data Factory using the Azure portal.
  • Step 2: Expand the pipeline Activities pane’s Azure function section and drag an Azure function activity to the pipeline canvas.
  • Step 3: Select the new Azure function activity on the canvas. If the function is not selected, go to its settings tab to edit details, as shown below.
azure function activity with UI step 1,2,3
Azure function activity with UI step 1,2,3
  • Step 4: If you do not have an Azure Function linked service defined, select New to create it. Choose your existing Azure Function App URL and provide a function key in the Azure Function linked service pane.
azure function activity with UI step 4
Azure function activity with UI step 4
  • Step 5: After selecting the Azure function-linked service, complete the configuration by providing the function name and other details.

You can read more about Azure function linked service and Azure activity.

Data Ingestion Azure Data Factory: Azure Data Factory with Custom Component Activity

In this method, the data is processed with custom Python code wrapped into an executable. Then, the data is invoked with an Azure Data Factory Custom Component activity. Azure Data Factory with Custom Component activity is best suited for large amounts of data.

Azure creates custom activity to transform or process data when the service is unavailable. It allows users to create a custom activity with their data movement or transformation logic. And therefore, they can use that custom activity in a pipeline. The custom activity will run your code on the Azure Batch pool of virtual machines.

azure custom component activity
Azure custom component activity

Adding Custom Activities to the Pipeline with UI

Follow the below steps to add custom activity to a pipeline.

  • Step 1: Initially, you need to create your first Azure Data Factory using the Azure portal.
  • Step 2: Search for the custom activity in the pipeline Activities pane and drag a custom activity to the pipeline canvas.
  • Step 3: Select the new one on the canvas if no custom activity is selected.
  • Step 4: Click on the Azure Batch tab for selecting or creating a new Azure Batch linked service that can execute the custom activity.
Add custom activities to pipeline step 1,2,3,4
Add custom activities to pipeline step 1,2,3,4
  • Step 5: Click on the settings tab and specify the command to be executed on the Azure Batch. 
Add custom activities to pipeline step 5
Add custom activities to pipeline step 5

You can read more about Azure Batch linked service.

Data Ingestion Azure Data Factory: Azure Data Factory with Azure Databricks Notebook

In this method, the data is transformed by a Python notebook running in an Azure Databricks cluster. This method leverages the full power of Azure Databricks and is used for distributed data processing at scale.

Azure data factory with azure databricks notebook
Azure data factory with azure databricks notebook

Creating an Azure Databricks Linked Service

Initially, you need to create your first Azure Data Factory using the Azure portal.

Follow the below steps to create an Azure Databricks linked service.

  • Step 1: Go to the Manage tab in the left panel on the home page.
Azure databricks linked service step 1
Azure databricks linked service step 1
  • Step 2: Click on the Linked services under the Connections tab. Then click on New. 
Azure databricks linked service step 2
Azure databricks linked service step 2
  • Step 3: In the New Linked service, click on the Compute and then click on Azure Databricks. Click on Continue.
Azure databricks linked service step 3
Azure databricks linked service step 3
  • Step 4: Follow the below steps in the New Linked service.
    • Step 4.1: Enter AzureDatabricks_LinkedService for the name.
    • Step 4.2: Select the Databricks that you will use to run your notebook.
    • Step 4.3: Select the New job cluster for the selected cluster.
    • Step 4.4: The information will be auto-populated for the Databricks workspace URL.
    • Step 4.5: Generate Access tokens from Azure Databricks workplace.
    • Step 4.6: Select the Cluster version you want to use.
    • Step 4.7: Select Standard_D3_v2 under the General Purpose (HDD) category for the Cluster node type.
    • Step 4.8: Enter 2 for workers and then click on create. Workers are nothing but the number of nodes in the cluster.
data ingestion azure data factory: azure databricks linked service step 4
Azure databricks linked service step 4

You can follow the further instructions for Creating a pipeline, Triggering a pipeline, and Monitoring a pipeline.

What Makes Hevo’s Data Ingestion Process Unique

Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready.

Start for free now!

Get Started with Hevo for Free

Data Ingestion Azure Data Factory: Consuming data in Azure Machine Learning

The Azure Data Factory pipelines store your prepared data in your cloud storage like Azure Blob or Azure Data lake.

You can consume your prepared data for machine learning models by invoking an Azure machine learning pipeline from your Data Factory pipeline. However, you can also create an Azure machine learning datastore and machine learning dataset to consume your prepared data.

  • Using the Data Factory Pipeline to Call an Azure Machine Learning Pipeline

This method is suitable for Machine Learning Operations Workflow (MLOps).

Each time the Data Ingestion Azure Data Factory runs:

  • The information will be saved somewhere else.
  • The Data Factory calls an Azure machine learning pipeline to pass the location to Azure machine learning. When this machine learning pipeline gets called, the data location and run ID are set as parameters.
  • This ML pipeline can create an Azure machine learning data store and dataset with the data location. You can read more about the Execute Azure machine learning pipeline.
data ingestion azure data factory: azure machine learning pipeline
Azure machine learning pipeline

After the data is accessed to a dataset or data source, it can be used to train a machine learning model. The training process can be part of the same ML, called Azure Data Factory, or a separate process like experimentation in the Jupyter notebook.

  • Read Data Directly from the Storage

If you do not want to create an ML pipeline, you can access the data directly from the storage account where your prepared data is saved with the Azure machine learning dataset or datastore.

The below Python code demonstrates how to create a datastore that connects to Azure Data lake Generation 2 storage. You can read more about datastores.

ws = Workspace.from_config()
adlsgen2_datastore_name = '<ADLS gen2 storage account alias>'  #set ADLS Gen2 storage account alias in AML

subscription_id=os.getenv("ADL_SUBSCRIPTION", "<ADLS account subscription ID>") # subscription id of ADLS account
resource_group=os.getenv("ADL_RESOURCE_GROUP", "<ADLS account resource group>") # resource group of ADLS account

account_name=os.getenv("ADLSGEN2_ACCOUNTNAME", "<ADLS account name>") # ADLS Gen2 account name
tenant_id=os.getenv("ADLSGEN2_TENANT", "<tenant id of service principal>") # tenant id of service principal
client_id=os.getenv("ADLSGEN2_CLIENTID", "<client id of service principal>") # client id of service principal
client_secret=os.getenv("ADLSGEN2_CLIENT_SECRET", "<secret of service principal>") # the secret of service principal

adlsgen2_datastore = Datastore.register_azure_data_lake_gen2(
	workspace=ws,
	datastore_name=adlsgen2_datastore_name,
	account_name=account_name, # ADLS Gen2 account name
	filesystem='<filesystem name>', # ADLS Gen2 filesystem
	tenant_id=tenant_id, # tenant id of service principal
	client_id=client_id, # client id of service principal

Create a dataset to reference the file you want to use in your machine learning task.

The below code creates a Tabular dataset from a CSV file called prepared-data.csv. You can read more about dataset types and accepted file formats.

from azureml.core import Workspace, Datastore, Dataset
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig

# retrieve data via AML datastore
datastore = Datastore.get(ws, adlsgen2_datastore)
datastore_path = [(datastore, '/data/prepared-data.csv')]
   	 
prepared_dataset = Dataset.Tabular.from_delimited_files(path=datastore_path) 

You can now use the prepared_dataset to reference your prepared data. Read more about Train models using Azure machine learning.

Conclusion

In this article, you learned about Data Ingestion Azure Data Factory. Data Ingestion with Azure Data Factory consists of three methods – Azure functions, custom component activity, and Azure Databricks notebook.

Azure Data Factory allows organizations to build and manage data pipelines with the Graphical User Interface (GUI), which ultimately increases the company’s productivity by creating, executing, and triggering data pipelines.

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations, with a few clicks.

Hevo Data with its strong integration with 150+ sources (including 40+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools.

Want to take Hevo for a spin? Sign Up or a 14-day free trial and experience the feature-rich Hevo suite firsthand. Also checkout our unbeatable pricing to choose the best plan for your organization.

Manjiri Gaikwad
Technical Content Writer, Hevo Data

Manjiri is a proficient technical writer and a data science enthusiast. She holds an M.Tech degree and leverages the knowledge acquired through that to write insightful content on AI, ML, and data engineering concepts. She enjoys breaking down the complex topics of data integration and other challenges in data engineering to help data professionals solve their everyday problems.