Azure Synapse Data Ingestion Simplified 101

Divyansh Sharma • Last Modified: March 28th, 2023

Azure Synapse Data Ingestion- Featured Image

Bringing together the benefits of Enterprise Data Warehouse and Big Data Analytics on the cloud, Microsoft Azure Synapse Analytics is a limitless analytics service with unmatched time to insight. It features a deep integration of Azure Machine Learning and Power BI to help query files effectively without requiring any other service. Azure Synapse merges all Azure data resources into one shared space. This gives flexibility and a single shared view for your Data Engineers, Analysts, and Data Scientists. 

Your road to Data Processing and Big Data Analytics begins with the fundamental process of ingesting data. This is called Data Ingestion. Azure Synapse Data Ingestion helps you bring data from various sources to Azure Synapse through various means, so you can analyze and gain insights to propel your business forward. 

This guide covers one of the fundamental and most important aspects of getting started with Azure Synapse Analytics –Azure Synapse Data Ingestion. We’ll go through the Azure Synapse Data Ingestion process in-depth, including techniques for doing ingestion, tools, types, and so forth. Continue reading to learn how to get started with Data Ingestion in Azure Synapse.

Table of Contents

What Is Azure Synapse?

Azure Synapse Analytics Logo: Azure Synapse Data Ingestion
Image Source: Microsoft Azure Synapse

Current trends in Modern Data Warehousing Solutions are seeing a drastic shift toward cloud storage and processing. Businesses today use a number of cloud-hosted applications like CRM, HR, and ERP systems which enable scalability, consolidation, and single access view for people.

Azure Synapse is an industry-leading solution that runs Enterprise Data Warehousing Workloads in the cloud. It is an End-to-End Analytics Platform that offers features like  Data Ingestion, Data Warehousing, and Big Data Analytics in a single service. Azure Synapse, using its scalable and separate computing and storage architecture, can be scaled instantaneously in ways that traditional systems such as Teradata, Netezza, or Exadata cannot.

This degree of seamless integration and mixing of workloads in a single service is made possible by its flawless interaction with Power BI and Azure Machine Learning. Using Azure Synapse, your data professionals can create and use a unified experience for Azure Synapse Data Ingestion, Data Preparation, Data Management, Data Warehousing, Big Data, and AI tasks. Moreover, your users can effortlessly query data on their own terms, whether in the cloud or in a dedicated environment– at scale.

Business Benefits of Using Azure Synapse

  • Unparalleled Performance: Azure Synapse Analytics offers the best relational database performance by using Massively Parallel Processing (MPP) and automatic in-memory caching. Most of your data operations like Azure Synapse Data Ingestion or Data Processing will be executed in record time.
  • Limitless Scalability: Azure Synapse can scale to meet your growing data demands by adding resources incrementally. It also provides low-cost storage options for staging and production data.
  • Cost Savings: When you use Azure Synapse, you incur low costs for implementing and maintaining it. Azure Synapse lets you only pay for what you use, without the need for complicated reconfiguration.
  • Managed Infrastructure: Forget about data center administration and operations that are essential to run your organization. By leveraging Azure Synapse Analytics, you can reallocate important resources to where they are required and also focus on competitive insights.
  • Best-In-Class Security: Azure Synapse secures your data with the most advanced security and privacy standards that are sure to give hackers and pesky bugs a hard time. It features column and row-level security with dynamic data masking to protect your data.
Simplify Azure Database ETL Using Hevo’s No-Code Data Pipeline 

Hevo Data, an Automated No Code Data Pipeline, can help you automate, simplify & enrich your data replication process in a few clicks. With Hevo’s wide variety of connectors and blazing-fast Data Pipelines, you can extract & load data from 150+ Data Sources straight into your Data Warehouse or any Databases. To further streamline and prepare your data for analysis, you can process and enrich raw granular data using Hevo’s robust & built-in Transformation Layer without writing a single line of code!

Get Started with Hevo for Free

Hevo is the fastest, easiest, and most reliable data replication platform that will save your engineering bandwidth and time multifold. Try our 14-day full access free trial today to experience an entirely automated hassle-free Data Replication!

What Is Data Ingestion?

Data Ingestion: Azure Synapse Data Ingestion
Image Source: Mockingo

Data Ingestion is a process of ingesting or importing data from one or more sources and transferring it to a common destination (target) for analysis. The destination or target can be a document store, database, Data Warehouse, Data Mart, etc. Integral to the Extract, Transform, and Load (ETL) process, a simple Data Ingestion process may comprise one or more Data Transformations that filter or enrich data before writing it to the destination. 

To gain in-depth information on Data Ingestion, we have a separate guide for you here – What is Data Ingestion? 10 Critical Aspects. You can also visit these helpful resources for more information about Data Transformation and ETL. 

Azure Data Synapses offers built-in functionality to import data into a table in the Azure Synapse Data Explorer pool. Once ingested, the data becomes available to users for querying. 

Azure Synapse Data Ingestion Using Data Explorer Tool

The Azure Synapse Data Explorer data management service is responsible for ingesting data from your Data Sources. It does this using the following steps:

Step 1: Pull data from external sources in batches or in streaming mode and read requests from a pending Azure queue.

Step 2: Optimize batch data inflow for ingestion throughput.

Step 3: Validate initial data and perform transformation where needed.

Step 4: Use additional data processing like schema matching, structuring, indexing, encoding, and compressing if required.

Step 5: Store data according to the set retention policy.

Step 6: Ingest data and commit it into the engine where it can be queried by the user.

Before you approve Azure Synapse Data Ingestion to activate its service, there are a few things to consider regarding supported file formats for Azure Synapse Data Ingestion, ingestion properties, and data ingest permissions.

Azure Synapse Data Ingestion: Supported File Formats

Listed below are the file formats supported by Azure Synapse Data Explorer for Data Ingestion.

Note: Make sure to properly format data before ingesting it using the Apache Synapse Data Ingestion service. You can check your file formats using online validators like CSV Lint for .csv files and JSON Lint for .json files.

Format ExpressionDescription
ApacheAvro.avroA row-based file format that uses JSON to declare data structures. Supported codecs include null, deflate, and snappy. 
Avro.avroA row-based file format that uses JSON to declare data structures. Supported codecs include null, deflate, and snappy. 
CSV.csvA plain text file containing information separated by commas.
JSON.jsonAn open standard and lightweight text file for data interchange. Contains JSON objects delimited by n.
MultiJSON.multijsonA text file with a JSON array of property bags.
ORC.orcA file with optimized row columnar storage.
Parquet.parquetA file with column-oriented data for efficient data storage.
PSV.psvA text file with pipe-separated values.
RAW.rawA file containing unprocessed, unencrypted, and uncompressed data as a single string value.
SCsv.scsvA text file with semicolon-separated values.
SOHsv.sohsvA text file with SOH separated values.
TSV/TSVE.tsvA text file with tab-separated values.
TXT.txtA standard text document containing plain text.
W3CLOGFILE.logA log file created and maintained by the web server, standardized by W3C.
Table Source: Microsoft Docs

Azure Synapse Data Ingestion: Properties and Permissions

Ingestion properties determine how data will be ingested using the Data Explorer. These properties can be added to your ingestion command followed by the “with” keyword. Here are a few of them:

Ingestion PropertyDescriptionExample
ingestionMappingA string value indicating how to map data from the source file to the table’s columns.with (format=”json”, ingestionMapping = “[{“column”:”rownumber”, “Properties”:{“Path”:”$.RowNumber”}}, {“column”:”rowguid”, “Properties”:{“Path”:”$.RowGuid”}}]”)(deprecated: avroMapping, csvMapping, jsonMapping)
ingestionMappingReferenceA string value indicating how to map data from the source file to the table’s columns using a named mapping policy object.with (format=”csv”, ingestionMappingReference = “Mapping1”)(deprecated: avroMappingReference, csvMappingReference, jsonMappingReference)
creationTimeA datetime value to be specified at time of ingestion.with (creationTime=”2022-07-07″)
extend_schemaA boolean value to extend the schema of the table. If the original table schema is (a:string, b:int), a valid schema extension would be (a:string, b:int, c:datetime, d:string).
formatSpecifies supported data format.with (format=”json”)
ingestIfNotExistsA string value to prevent data ingestion if the table already has data tagged with an “ingest-by:” tag with the same value.with (ingestIfNotExists='[“Part0001”]’, tags='[“ingest-by:Part0001”]’)
ignoreFirstRecordA boolean value to disregard the first record of every file during ingestion.with (ignoreFirstRecord=true)
recreate_schemaA boolean value to recreate the schema of the table. This ingestion property takes precedence over “extend_schema” if applied simultaneously.with (recreate_schema=true)
tagsProperty that defines tags associated with ingested data.with (tags=”[‘Tag1’, ‘Tag2’]”)
validationPolicyA JSON string that runs validations during ingestion.with (validationPolicy='{“ValidationOptions”:1, “ValidationImplications”:1}’
Information Source: Microsoft Docs

Ingestion permissions define authorization levels for users. There are many forms of role-based authorizations offered by Azure Data Explorer for finer control of data.

  • A Database Administrator has complete control over the database and can conduct any action. 
  • A Database User has the ability to read and create tables in the database. 
  • A Database Viewer can read data and metadata. 
  • A Database Ingestor can insert data into all existing tables in the database, but cannot query it.
  • A Function Admin can modify or delete a function and can give admin powers to another principal. 
  • A Table Admin can perform anything within the scope of a certain table.

Likewise, there are a few more roles that can be assigned to any user or user group. These roles provide divided permissions to perform different tasks and help ensure security. You can find more information on Azure role-based authorizations and permissions on this page – Azure Synapse Access Control.

Batching vs Streaming In Data Ingestion

Azure Synapse Data Ingestion can be executed in two of the following ways: Batching and Streaming

A batch-based Azure Synapse Data Ingestion clubs user data and moves it to the destination at scheduled intervals. Batching of data ensures high ingestion throughput, and is the de-facto for ingesting data. Data is batched according to ingestion properties, as has been described in the table above. The maximum batching value for Azure Synapse Data Ingestion is 5 minutes, 1000 items, or a total batch size of 1 GB. The data size limit for a batch ingestion command is 4 GB.

Streaming refers to a continuous process of Data Ingestion that happens in real-time or in near real-time. Data is retrieved and processed fast from a streaming source and moved to column store extents for further use.

Azure Synapse Data Ingestion Methods and Tools

Azure Synapse Data Explorer offers various ingestion techniques, each with its own set of target scenarios. The upcoming sections will discuss ingestion using managed pipelines, connectors & plugins, programmable ingestion via SDKs, tools, and direct ingestion.

Ingestion Using Managed Pipelines

Managed pipelines relieve organizations from the hassle of maintaining and running in-house Data Pipelines. They also offer a huge benefit of pre-configured connectors and scalability so that businesses can focus on their key projects without wasting time on managing Data Pipeline infrastructure.

Azure Synapse Data Explorer offers two managed Data Pipelines to help you perform Data Ingestion and Transformation on your data sets. These are as follows:

  • Azure Event Hub: Azure Event Hub is a fully managed, real-time Data Ingestion Service for big data. It can retrieve and process millions of events per second in both directions – inbound and outbound. Some real-world examples where Azure Event Hub finds its use are IoT scenarios where millions of sensors send and receive data or electric vehicles doing thousands of computations to analyze space, nearby vehicles, and their speed.
  • Azure Synapse Pipelines: On its own, Azure Synapse Analytics also offers Synapse Pipelines to construct end-to-end Data Workflows for Enterprise Data Processing and Data Ingestion.  Using Azure Synapse Pipelines, you as a user can create and schedule Data-Driven Workflows using the support of over 90 sources. Azure Synapse Pipelines can also transform and enrich your data to provide actionable insights that give you a better perspective and boost your daily operations.

Programmatic Data Ingestion Using SDKs

The second method to ingest data using Azure Synapse Data Explorer is through SDKs. Programmatic Data Ingestion using SDKs reduces ingestion expenses and minimizes storage transactions during and after the ingestion process. 

To get started with programmatic ingestion, here are the steps to achieve so:

Step 1: Configure programmatic ingestion by visiting your Synapse Studio. Head to the left pane, select Manage and then click on Data Explorer Tools.

Step 2: From the available list, choose the Data Explorer pool you want to use.

Data Explorer Pool: Azure Synapse Data Ingestion
Image Source: Microsoft Docs

Step 3: Under the details, find the Query and Data Ingestion endpoints by clicking on the Data Explorer pool name. A new screen will appear listing the same.

Query and Data Ingestion Endpoint: Azure Synapse Data Ingestion
Image Source: Microsoft Docs

Step 4: Use these endpoints’ data while connecting your desired SDKs. Supported ones include Python SDK, .NET SDK, Java SDK, Node SDK, REST API, and GO SDK.

Using Kusto Query Language To Ingest Data

Kusto Query Language (KQL) is a powerful language to simplify the Azure Synapse Data Ingestion process. The benefit of using KQL is that you can bypass the Data Management services. However, this method is only suitable for exploration and prototyping. This strategy should not be used in production or high-volume applications.

KQL offers a bunch of control commands to help you streamline Data Ingestion: 

  • “.ingest.inline” to ingest data into a table by pushing the data that is embedded inline.
  • “.set, .append, .set-or-append, or .set-or-replace to set a table, append the data or replace the data.
  • .ingest.into to ingest data into a table from one or more cloud storage files.

More information about KQL and its supported control commands can be found here – Ingesting Sample Data and Analyzing With Simple Query.

Data Ingestion Tools

Azure Synapse Data Ingestion offers one-click ingestion, a tool specifically designed to ingest data quickly and efficiently. This one-click ingestion feature can ingest data from a wide variety of sources and file formats, create database tables, map tables and suggest schema that is easy to change. 

One-click ingestion tool can be availed in Azure Synapse using the one-click ingestion wizard. To access and use the wizard, you can visit the following page- Access the one-click wizard.

Azure Synapse Ingestion Process

We’ve already briefed you on the process of how Azure Synapse ingests data. Once you’ve decided on the best Data Ingestion method for your needs, you need to perform the following steps:

  • Set Retention Policy: The retention policy governs the mechanism that automatically removes data from tables or materialized views. When you don’t set a retention policy on a table, it gets retrieved from the database’s retention policy. Setting up a retention strategy is critical for clusters that are constantly ingesting data in order to keep expenses under control.
  • Create a Table: You need to create a table beforehand to be able to ingest data. To create a table, you can either use a table command “.create table” or use a one-click ingestion tool.
  • Create Schema Mapping: Schema mapping is a cardinal component of Data Ingestion and Transformation. It helps establish semantic relationships between data to help bind source data fields to target destinations in table formats. Mapping allows you to merge data from multiple sources into a single table based on defined classes and attributes.
  • Set Update Policy: The updated policy enables lightweight processing for scenarios that usually require complex processing at ingest time. It tells Azure Data Explorer to automatically append data to a target table whenever new data is inserted into the source table. This step is optional.

Conclusion

Azure Synapse Analytics accelerates time to insight using its powerful workload management features and combined data engineering, machine learning, and business intelligence offerings in a single unified service. It is a cost-effective solution that provides separate storage and computes resources for better analysis and data exploration, especially for enterprise needs.

While Azure Synapse offers seamless integration with Microsoft technologies, there is a limitation regarding how you can integrate third-party Data Sources and start ingesting data from them in Azure Synapse. Azure Synapse is best designed to work and cater to the needs of enterprises. When it comes to conventional business intelligence requirements, you can consider better alternatives like Google BigQuery, Snowflake, Amazon Redshift, or Firebolt.

Thinking about how to set up and ingest data into one of these Data Warehouses?

Worry not!

Hevo Data can connect your frequently used databases and SaaS applications like Microsoft Azure for MariaDB, MySQL, PostgreSQL, MS SQL Server, Microsoft Advertising, Salesforce, Mailchimp, Asana, Trello, Zendesk, and other 150+ data sources to a Data Warehouse with few simple clicks. It can not only export data from sources and load data to the destinations, but also transform & enrich your data, & make it analysis-ready so that you can focus only on your key business needs.

Using Hevo is simple, and you can set up a Data Pipeline in minutes without worrying about any errors or maintenance aspects. Hevo also supports advanced data transformation and workflow features to mold your data into any form before loading it to the target database. We are happy to announce that Hevo has launched Azure Synapse as a destination.

Visit our Website to Explore Hevo

Why not try Hevo for yourself? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite firsthand.

Have more ideas on Microsoft Azure products or features you would like us to cover? Drop a comment below to let us know. 

No Code Data Pipeline For Your Data Warehouse