Your organization stores its transactional workloads in the Azure SQL Databases to maintain data integrity and consistency during database operations. However, the Azure SQL database would not be suitable if your organization relies on advanced analytics, machine learning, and handling large volumes of semi-structured and unstructured data. In addition, the Azure SQL Database does not support distributed transactions, database mirroring, and failover clustering.

Since the Azure SQL Database has many limitations, it would be a good choice to migrate from it to a modern analytical platform like Databricks. The Azure SQL to Databricks migration offers a cloud-based unified analytics engine that allows you to handle data processing at scale using Apache Spark. With this interactive workspace, you can analyze data using Python, R, Scala, and SQL and easily share your findings with others. 

Let’s look at the process for migrating from Azure SQL Database to Databricks. 

Why Migrate to Databricks?

Here are a few reasons to move data from Azure SQL Database to Databricks:

  • Delta Lake: An open-format storage layer that combines all the data types for transactional, analytical, and AI applications in one place. So, it allows you to translate data across multiple data formats automatically. 
  • Databricks SQL Service: You can process BI and SQL workloads on the data lake using a dedicated SQL-native workspace and built-in connectors for BI tools. Since Databricks SQL is a separate environment, you can query the data lakes within the Databricks platform directly without affecting the data science tools. 

Azure SQL: A Brief Overview

Azure SQL Database is a relational database platform-as-a-service built on the Microsoft SQL Server engine in the Azure cloud. It is part of the secure and intelligent Azure SQL family, including Azure SQL Managed Instance and SQL Server on Azure Virtual Machines (VM). 

With the Azure SQL Database, you can build a high-performance data storage layer for modern cloud applications in Azure to process both relational and non-relational data. You can automatically scale up or down your compute resources according to your needs using the Hyperscale service tier.  

Additionally, Azure SQL Database offers advanced query processing capabilities, such as intelligent query processing and high-performance in-memory technologies. 

Databricks: A Brief Overview

Databricks is an Apache Spark-based data intelligence and analytics platform. This serverless data warehouse is built on a lakehouse architecture, allowing you to execute all your analytical workloads at scale. 

Integrating with Mosaic AI, Databricks helps you build, scale, and deploy generative Artificial Intelligence (AI) and machine learning (ML) models. Using Delta Sharing with Unity Catalog feature in Databricks, you can govern your shared structured and unstructured data, AI and ML models, dashboards, and flat files across multiple cloud platforms.

With Databricks’ automated cluster scaling, you can cost-effectively utilize computing resources to execute multiple tasks within a job. Automated monitoring allows you to scan the workloads to identify anomalies, track resource usage, and ensure applications function correctly. 

Solve your data replication problems and connect Azure SQL to Databricks with Hevo’s reliable, no-code, automated pipelines with 150+ connectors.
Get your free trial right away!

Methods for Data Migration from Azure SQL to Databricks

To transfer your data from Azure SQL Database to Databricks, you can employ Hevo Data or a manual CSV Export/Import method. 

Method 1: How to Connect Azure SQL to Databricks Using CSV Export/Import Method

In this migration, you must initially export the data from the Azure SQL Database as a CSV file using Microsoft Command-Line utilities such as bcp, Invoke-Sqlcmd, or Export-Csv. Then, download the CSV file data from the Azure Portal and load it into Databricks using add data UI. 

Step 1: Export Data From Azure SQL Database As a CSV File 

Before you begin, ensure the following prerequisites are ready:

  • An active Microsoft Azure account.
  • Access to Azure Cloud Shell.
  • Access your Azure SQL Database from Azure Portal or create one via Azure Cloud Shell.
  • Obtain the Azure SQL database’s server name, database name, username, and password. 

You can use any of the following ways to extract data from the Azure SQL database as a CSV file:

  1. Using BCP Utility
  • Log into the Azure Portal with your Azure credentials.
  • Click the Cloud Shell icon from the top-right corner of the Azure Portal.
  • Choose the Bash environment.
  • Run the following command to export data from Azure SQL as a CSV file:
bcp <schema>.<object_name> out. /data/filename.csv -t “,” -w -S <AzureSQLServer_name>.database.windows.net -U <username> -d <database>

You can replace the placeholders with your actual information. 

  1. Using Invoke-Sqlcmd and Export-CSV Utility
  • Log into the Azure Portal.
  • Click the Cloud Shell icon from the top navigation of the Azure Portal.
  • Choose the Powershell environment.
  • Execute the following command to export data from Azure SQL as a CSV file:
Invoke-Sqlcmd -Query "SELECT * FROM <schema>.<object_name>" -ServerInstance "<AzureSQLServer-name>.database.windows.net" -d <database_name> -u <username> -p <password>" -IgnoreProviderContext | Export-Csv -path ./data/filename.csv -Delimiter "," -Encoding UTF8

You can replace the placeholders with your actual database information. 

Step 2: Prepare Your CSV File to Import

Before importing your CSV file into Databricks, you must ensure your data is clean, consistent, and formatted correctly. For preprocessing your CSV file, perform the following things:

  • Remove invalid or irrelevant data, like empty rows or columns, to clean and format your CSV file.
  • Check all rows contain the same number of attributes.
  • Verify that all the values in a column properly align with their corresponding columns. For example, if a column should contain numeric values, then check that it contains only numbers, not strings or dates.

For more information about how Databricks helps you format your CSV file, read the format options in Databricks.

Step 3: Import CSV File Data to Databricks Using Add Data UI

Before you start, verify the following prerequisites are in place:

  • Create your Databricks workspace.
  • Configure the Databricks cluster to process your data.

Here are the steps to load data into Databricks using add data UI:

  1. To access the UI, Click the New > Add Data in the sidebar of your Databrick workspace.
  2. Choose Create or modify table to load CSV file into Delta Lake tables.
  3. Import your file by clicking on the file browser button, or you can simply drag and drop your files into the Delta tables.

Thus, your Databricks workspace will have a new Delta table containing your CSV file exported from your Azure SQL Database.

Limitations of Azure SQL to Databricks Migration Using CSV Export/Import Method

  • Limited Data Volume: The CSV Export/Import method suits smaller datasets. When working with large data volumes, CSV files can lead to performance issues during data migration and processing. 
  • High Complexity: You must write custom code to connect with the Azure SQL database and export the data into CSV format. Manual interventions are required to prepare your CSV file before uploading it into Databricks.  

Method 2: How to Integrate Azure SQL to Databricks Using Hevo Data

Hevo Data is a no-code, real-time ELT platform that helps you cost-effectively automate data pipelines that are flexible to your preferences. With the integration of 150+ data sources, Hevo Data allows you to extract and load data and transform it to conduct in-depth analysis.

Let’s look into some of the key features of Hevo Data:

  • Data Transformation: Hevo Data provides analyst-friendly transformation options like Python-based scripts or drag-and-drop blocks. It lets you clean, prepare, and transform data before transferring it to your destination.
  • Incremental Data Load: Hevo Data enables real-time data migration between a source and destination and optimizes bandwidth utilization on both ends of the data pipeline.
  • Auto-Schema Mapping: Hevo Data’s auto-mapping feature automatically identifies the incoming data format and copies it to the destination schema, eliminating manual schema management. Based on your data replication requirements, you can select Full or Incremental mappings. 

Let’s get started with how to insert Azure SQL data to Databricks using Hevo Data.

Step 1: Set Up Azure SQL as Your Source

Check the following prerequisites are ready before you start the setup:

  • Microsoft SQL Server version is 2008 or higher.
  • Whitelist Hevo’s IP addresses.
  • Grant SELECT, ALTER DATABASE, and VIEW CHANGE TRACKING permissions to the database user.
  • Enable Change Tracking mechanism.
  • Obtain the Azure SQL Server database’s hostname and port number.
  • You must assigned a Pipeline Administrator, Team Administrator, or Team Collaborator role in Hevo.

Here are the steps to configure the Azure SQL Database as your source in the Hevo:

  1. Sign in to your Hevo account and click PIPELINES from the Navigation Bar.
  2. Navigate to the Pipelines List View and click the + CREATE button.
  3. Choose Azure SQL Server as your source type in the Select Source Type page.
  4. Enter the required information in the Configure your Azure SQL Server Source page.
Azure SQL to Databricks: Configuring your Azure SQL Server Source Page
Azure SQL to Databricks: Configuring your Azure SQL Server Source Page
  1. Click the TEST CONNECTION > TEST & CONTINUE button to finish the source configuration.

To learn more about source setup, read the Azure SQL Server Documentation in Hevo.

Step 2: Set Up Databricks as Your Destination

Hevo Data provides the following two options to configure Databricks as the destination:

  1. The Databricks Partner Connect (Hevo Data’s recommended method)
  2. The Databricks Credentials

Before you begin, verify that the following prerequisites are in place:

  • Access to a cloud service account on AWS, Azure, or Google Cloud Platform.
  • Access to Databricks workspace in your cloud service account or set up a new one.
  • Databricks workspace must allow connections from your region’s Hevo IP addresses only if IP access lists are enabled in your cloud service account. 
  • Obtain Databricks workspace’s URL in the form of https://<deployment name>.cloud.databricks.com.
  • You must have a Team Collaborator or an administrator role, excluding the Billing Administrator role in Hevo.  

Here are the steps to configure Databricks as the destination:

  1. Log in to your Hevo account or set up one.
  2. Go to the Navigation Bar and select the DESTINATIONS option.
  3. From the Destinations List View, click the + CREATE button.
  4. In the Add Destination page, select Databricks as your destination type.
  5. Specify the required details in the Configure your Databricks Destination page. 
Azure SQL to Databricks: Configuring your Databricks Destination Page
Azure SQL to Databricks: Configuring your Databricks Destination Page
  1. Click the TEST CONNECTION
  2. Click on the SAVE & CONTINUE button to complete the configuration.

For more information about your destination configuration, read the Databricks Hevo documentation.

Use Cases of Azure SQL Databricks Migration

  • Hugging Face Transformers: An open-source tool that offers APIs and libraries to integrate existing pre-trained machine-learning models into your workflow. You can then tune it to improve the workflow performance. 
  • Legacy Dashboards: The Databricks UI has an advanced SQL editor that allows you to generate a legacy dashboard. It combines visualizations and text boxes to derive valuable insights from your data.   

Conclusion

Choosing the right method for Azure SQL to Databricks migration depends on your business requirements, technical expertise, and dataset size. 

In this article, you have explored two efficient methods for migrating data from Azure SQL to Databricks. When using CSV Export/Import, multiple steps are required, which can be time-consuming and difficult to implement if the datasets are large. 

Alternatively, you can utilize Hevo Data, which allows you to automate a data pipeline between Azure SQL and Databricks in minutes. This will update your tables in Databricks in real-time without manual intervention. Explore the process to integrate Azure PostgreSQL with Databricks for optimized data management and analysis.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite firsthand. Also checkout our unbeatable pricing to choose the best plan for your organization.

Share your experience of Azure SQL to Databricks integration in the comments section below!

Frequently Asked Questions (FAQs)

Q. I am running an SQLAlchemy script to connect to the Azure SQL database. Whenever the cluster starts, I must manually add the Azure Databricks cluster’s IP address to the Azure SQL database firewall. How can I automate this?

A. You can use one of the following methods:

Veeresh Biradar
Senior Customer Experience Engineer

Veeresh is a skilled professional specializing in JDBC, REST API, Linux, and Shell Scripting. With a knack for resolving complex issues and implementing Python transformations, he plays a crucial role in enhancing Hevo's data integration solutions.