Databricks Connect to SQL Server Simplified: 2 Easy Methods

on Databricks, ETL, ETL Tools, ETL Tutorials, Microsoft Azure, SQL Server • November 5th, 2021 • Write for Hevo

Integrating SQL Server & Databricks

Do you want to analyze the Microsoft SQL Server data in Databricks? Is it true that you are finding it challenging to set up the SQL Server Databricks Integration? Assuming indeed, you’ve arrived at the correct spot! This article will answer all your questions and diminish the strain of discovering a really efficient arrangement. 

Upon a complete walkthrough of this article, you will gain a decent understanding of Microsoft SQL Server and Databricks along with the salient features that they offer. This article will also discuss two of the most efficient methods that can be leveraged for Databricks Connect to SQL Server. Read along to learn more about the steps required for setting up Databricks Connect to SQL Server.

Table of Contents

Prerequisites

  • Basic understanding of JDBC Drivers.

Introduction to SQL Server

SQL Server is a Relational Database Management System developed by Microsoft that houses support for a wide range of business applications including Transaction Processing, Business Intelligence, and Data Analytics. Microsoft SQL Server, like any other RDBMS software, is based on SQL, a standardized programming language used by Database Administrators (DBAs) and other IT professionals to administer databases and query the data they contain.

Microsoft SQL Server is primarily based on a Row-based table structure that connects similar data items in distinct tables to one another, eliminating the need to redundantly store data across many databases. To ensure Data Accuracy, the Relational Model offers referential integrity and other integrity constraints. These checks are part of a larger adherence to the ACID(Atomicity, Consistency, Isolation, and Durability) properties, which are designed to ensure that database transactions are processed in a seamless fashion.

Introduction to Databricks

Databricks Connect to SQL Server - Databricks Logo
Image Source

Databricks is one of the most popular Cloud-based Data Engineering platforms that is used to handle and manipulate vast amounts of data as well as explore the data using Machine Learning Models. Being recently added to Azure, it is the newest Big Data addition for the Microsoft Cloud. It is freely available to all businesses and helps them realize the full potential of their Data, ELT Procedures, and Machine Learning.

This Apache Spark based Big Data Platform houses Distributed Systems which means the workload is automatically dispersed across multiple processors and scales up and down according to the business requirements. For complex tasks, increased efficiency translates into real-time and cost savings. Resources (such as the amount of compute clusters) are readily handled, and it only takes a few minutes to get started, as with all other Azure tools.

Key Features of Databricks

Some of the key features of Databricks are as follows:

  • Databricks provides the users with an Interactive Workspace which enables members from different teams to collaborate on a complex project.
  • While Azure Databricks is best suited for large-scale projects, it can also be leveraged for smaller projects for development/testing. Databricks can be utilized as a one-stop-shop for all the analytics needs. 
  • Databricks is powerful as well as cost-effective. In recent years, using Big Data technology has become a necessity for many firms to capitalize on the data-centric market. Databricks is incredibly adaptable and simple to use, making distributed analytics much more accessible.

Ways to set up Databricks Connect to SQL Server

Method 1: Using Custom Code to Connect Databricks to SQL Server

This method represents more of a manual approach in which you have to create a SQL database on top of Databricks and then load the data to it and later on analyze it. Here, we’ll be using an Azure-based Databricks setup, but you can apply the same process on all Cloud-versions of Databricks as per your deployment. Further, the procedure remains universal for all Local and Cloud-based SQL Server Database deployments.

Method 2: Connecting SQL Server to Databricks using Hevo Data

A fully managed, No-code Data Pipeline platform like Hevo Data, helps you load data from Microsoft SQL Server & Azure SQL Database among 100+ Data Sources (including 40+ free sources) to Databricks in real-time, in an effortless manner. Hevo, with its minimal learning curve, can be set up in a matter of minutes, making the users ready to load data without compromising performance. Its strong integration with various sources such as databases, files, analytics engines, etc. gives users the flexibility to bring in data of all different kinds in a way that’s as smooth as possible, without having to write a single line of code.

Sign up here for a 14-day Free Trial!

Methods for Building Databricks Connect to SQL Server

There are multiple ways to set up Databricks Connect to SQL Server, but we have hand picked two of the easiest methods to do so:

Download the Ultimate Guide on Database Replication
Download the Ultimate Guide on Database Replication
Download the Ultimate Guide on Database Replication
Learn the 3 ways to replicate databases & which one you should prefer.

Method 1: Using Custom Code to Connect Databricks to SQL Server

Follow the steps given below to set up Databricks Connect to SQL Server by writing custom ETL Scripts. This will be an easy six-step process that begins with creating an SQL Server Database on Azure. In case you already have a SQL Server Database, deployed either locally or on other Cloud Platforms such as Google Cloud, you can directly jump to Step 4 to connect your database.

Note: Here, we are using a Databricks set up deployed on Azure for tutorial purposes. However, you can apply the same procedure for connecting an SQL Server Database with Databricks deployed on other Cloud’s such as AWS and GCP. The process and drivers involved remain universal.

Step 1: Create a New SQL Database

  • Login to the Microsoft Azure portal using the appropriate credentials.
  • You can either click directly on the Create a Resource button or navigate to the left side vertical menu bar and select SQL Databases to get redirected to the Create SQL Database Screen.
Databricks Connect to SQL Server - Create a Resource
Image Source
  • Fill in the required fields like Database Name and choose your desired server. Once you are done doing that, click on the Review+Create button to create your SQL Database on Azure.
Databricks Connect to SQL Server - Creating SQL Database
Image Source

Step 2: Upload the Desired File to Databricks Cluster

  • Navigate to the left side menu bar on your Azure Databricks Portal and click on the Data Icon.
  • Now, click on the Add Data button.
Databricks Connect to SQL Server - Adding Data
Image Source
  • Browse the file that you wish to upload to the Azure Databrick Cluster and then click on the Select button to upload the file.
Uploading File for Databricks Connect to SQL Server
Image Source

Step 3: Load the Data

  • Click on the Workspace icon present on the left side of the home page.
  • Click on the Create option and then select Notebook to write your own custom code.
Databricks Connect to SQL Server 
- Workspace
Image Source
  • Now, provide a unique name to the Notebook and select Scala as the preferred language in which you wish to write the custom code.
  • By default, the Cluster’s name is pre-populated if you are working with a single cluster. In case you have created multiple clusters, you can select the desired cluster from the drop-down menu.
  • Finally, click on the Create button.
Databricks Connect to SQL Server - Creating new Notebook
Image Source

 Step 4: Create the JDBC URL and Properties 

  • Firstly, you need to create a JDBC URL that will contain information associated with either your Local SQL Server deployment or the SQL Database on Azure or any other Cloud platform. You further need to add other details such as Port Number, User, and Password. You can refer to the following piece of code to do so:
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
You can replace the Hostname, Port and Database parameters as per your deployment. Once you’ve added your DB specific details, your URL string will look like this:

Example: val url = "jdbc:sqlserver://azsqlhevoserver.database.windows.net:1433;database=azsqlhevodb;user=Rakesh;password=*******"
  • Now it’s time to create the properties or functions to link the parameters.
import java.util.Properties
val myproperties = new Properties()
myproperties.put("user", "Rakesh")
myproperties.put("password", "******")

Step 5: Check the Connectivity to the SQL Server Database

The code given below will help you in checking the connectivity to the SQL Server database:

val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
myproperties.setProperty("Driver", driverClass)

Step 6: Read & Display the Data

  • To read the content of the file that you uploaded in the previous step, you can create a Data Frame. The columns in the first row of the file will be used as the Data Frame’s column names if header = true is set. You can specify Spark to automatically infer the schema of each column by setting inferSchema = true.
val mydf = spark.read.format("csv")
    .option("header","true")
    .option("inferSchema", "true")
          .load("File")
  • Lastly, to display the data, you can simply use the display function:
  display(mydf)

Once you follow all the above steps in the correct sequence, you will be able to build Databricks Connect to SQL Server.

Limitations of Writing Custom Code to Set up Databricks Connect to SQL Server

There are a few limitations of using Manual ETL Scripts to Connect Datascripts to SQL Server. Some of them are listed below:

  • Manually writing ETL Scripts requires significant technical bandwidth.
  • If you need the data to be transferred in real-time, writing custom scripts to accomplish this can be tricky, as it can lead to a compromise in Data Accuracy and Consistency.
  • This approach is suitable for a one-time bulk insert. In most cases, you will need to execute a continuous load process to ensure that the destination always receives the latest data. In that case, you will need logic to handle the duplicate data in real-time.

Method 2: Connecting SQL Server to Databricks using Hevo Data

Hevo Banner
Image Source

Using Hevo Data would be a much superior alternative to the previous method as it can automate this ETL process allowing your developers to focus on BI and not coding complex ETL pipelines. Hevo is a No-code Data Pipeline that helps you transfer data from Microsoft SQL Server, Azure SQL Database and even your SQL Server Database on Google Cloud (among 100+ Other Data Sources) to Databricks & lets you visualize it in a BI tool.

Hevo is fully managed and completely automates the process of loading data from your desired source and enriching the data, and transforming it into an analysis-ready form without having to write a single line of code. In addition, its fault-tolerant architecture ensures that the data is handled securely and consistently with zero data loss.

It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination. It allows you to focus on key business needs and perform insightful analysis using various BI tools such as Power BI, Tableau, etc.

Get Started with Hevo for free

Check Out What Makes Hevo Amazing:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.

Conclusion

This blog introduced you to two methods that can be used to set up Databricks Connect to SQL Server. It also briefed you about SQL Server and Databricks along with their features. If you want to integrate data from various data sources such as SQL Server into your desired Database/destination like Databricks and seamlessly visualize it in a BI tool of your choice, Hevo Data is the right choice for you! It will help simplify the ETL and management process of both the data sources and destinations.

Visit our Website to Explore Hevo

Hevo Data provides its users with a simpler platform for integrating data from 100+ Data Sources like SQL Server to Databricks for Analysis.. It is a No-code Data Pipeline that can help you combine data from multiple sources. You can use it to transfer data from multiple data sources into your Data Warehouse, Database, or a destination of your choice. It also provides you with a consistent and reliable solution to manage data in real-time, ensuring that you always have Analysis-ready data in your desired destination.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at our unbeatable pricing that will help you choose the right plan for your business needs!

Share your preferred approach for setting up Databricks Connect to SQL Server. Let us know in the comments below!

No-Code Data Pipeline for Databricks