The importance of using data in sectors like Data Science, Machine Learning, etc. grows as the amount of data sources, and data types in an organization expand. Converting raw data into a clean and reliable form is a key step for extracting meaningful insights out of it. ETL (Extract, Transform, and Load) is a Data Engineering process that involves extracting data from various sources, transforming it into a specific format, and loading it to a centralized location (majorly a Data Warehouse). One of the best ETL Pipelines is provided by Databricks ETL.
Databricks is a platform that helps you extract value out of massive amounts of data and make use of it to help data teams in solving the world’s toughest problems like the life sciences industry to find cures faster, the entertainment industry to deliver more personalized content, etc.
This article commences with an easy introduction to ETL (Extract, Transform and Load) and furthermore elaborates on Databricks ETL in specific. It provides a step-by-step guide for 2 different methods to perform ETL operations. The first method involves using Azure Databricks ETL, which is a completely manual process. While the second method uses Hevo Data’s No-Code Data Pipeline Solution.
Table of Contents
- Introduction to ETL
- Introduction to Databricks
- Methods to Set Up Databricks ETL
- Method 1: Extract, Transform, and Load using Azure Databricks ETL
- Limitations of using Azure Databricks ETL
- Method 2: Setting Up Databricks ETL using Hevo Data
- Advantages of using Hevo Data
Introduction to ETL
ETL, in simple terms, refers to the 3 phases (Extract, Transform, and Load) of Data Integration that are used to combine data from various sources. It is most frequently used to construct a Data Warehouse. Data is first extracted from a source system, transformed into an analyzable format, and then loaded into a specific location (majorly a Data Warehouse).
The first stage in this process is to extract data from heterogeneous sources such as SaaS applications, APIs (Application Programming Interface), Transactional Databases, etc. Some of the data types are semi-structured JSON server logs, while others are structured outputs of widely used systems. The extraction can be done in a variety of ways such as:
- Partial Extraction: This is the simplest method of extracting data from multiple sources. In Partial Extraction, the sources notify the user when the record(s) have been updated or added.
- Complete Extraction: Some systems are unable to determine which data has been modified at all. In this instance, the only way to get the data out of the system is to do a thorough extract. This method necessitates having a copy of the previous extract in the same format so that the changes may be identified.
In this stage, the raw data gathered from the sources is transformed into a specific (analyzable) format that can be used by various applications. Data is cleansed, mapped, and converted to a specified schema, to meet operational requirements. Several types of transformations are used in this process to assure data quality and integrity. Data is generally loaded into a Staging Database which ensures a quick rollback in case something goes wrong. Moreover, you can prepare audit reports for regulatory compliance during this stage.
In this stage, data is transferred from the Staging Database to a targeted destination (a Data Warehouse in most cases). This method might be straightforward or complex, depending on the application’s requirements. ETL tools or custom code can be used to complete each of these processes.
For more information on ETL, click here.
Introduction to Databricks
Databricks is a cloud-based platform that aims to accomplish the Data Lakehouse tasks in a single unified platform. Databricks sits on top of your existing data lake and can connect to AWS S3 and Google Cloud Storage, among other cloud storage options.
Databricks is built for Data Scientists, Data Engineers, and Data Analysts to assist them in combining the concepts of Data Science and Data Engineering across various cloud platforms. This connection makes the process of preparing data, experimenting, and deploying Machine Learning applications much easier.
The DataBricks technology is quicker than Apache Spark and claims to speed up development and innovation. It also increases security by standardizing the pipeline for producing Machine Learning tools. With just a few clicks, data processing clusters may be created and deployed. Moreover, it has several built-in Data Visualization options like Graphs, Bar Charts, etc.
Layers in Databricks Architecture
There are several built-in tools for Data Science, BI Reporting, and MLOps. However, the 2 important layers in Databricks are:
- Delta Lake: Delta Lake is a Storage Layer that helps Data Lakes be more reliable. Delta Lake integrates streaming and batch data processing while providing ACID (Atomicity, Consistency, Isolation, and Durability) transactions and scalable metadata handling. Furthermore, it is fully compatible with Apache Spark APIs and runs on top of your existing data lake.
- Delta Engine: The Delta Engine is a query engine that is optimized for efficiently processing data stored in the Delta Lake.
For more information on Databricks, click here.
Methods to Set Up Databricks ETL
The first method is an entirely manual process and demands high-end programming expertise. Generally, you need to integrate Databricks ETL with a Cloud Platform such as Azure, Google Cloud Platform, etc. for extracting, transforming, and loading data to a specific location. For ease of explanation, you will be using Azure Databricks in this article.
Hevo provides a hassle-free solution that helps you set up Databricks ETL without any intervention in an effortless manner for free. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. With Hevo Data, you can have everything at your fingertips irrespective of your technical background i.e. you do not need any technical expertise anymore for Databricks ETL.
Methods to Set Up Databricks ETL
Now, that you have a basic understanding of ETL, let’s have a look at the methods that are available to set up Databricks ETL. There are majorly 2 methods to set up Databricks ETL:
- Method 1: Extract, Transform, and Load using Azure Databricks ETL
- Method 2: Setting Up Databricks ETL using Hevo Data
Let’s walk through these methods in detail, one by one.
Method 1: Extract, Transform, and Load using Azure Databricks ETL
In this method, you will use Azure Synapse Connector for Azure Databricks ETL to load data to Azure Databricks. Before starting with the process, you have a few prerequisites to go through. Following are the prerequisites of setting up Azure Databricks ETl:
- An Azure account.
- Create an Azure Synapse with admin privileges. Also, create a server-level firewall rule, and a master key. You can refer to this link for more information.
- Create an Azure Blob Storage account and make sure to retrieve the access key for the same. You can refer to this link for more information.
- Create an Azure Data Lake Storage Gen2 Storage account with administrative privileges. You can refer to this link for more information.
- Create a service principal that can access Azure AD Application resources. You can refer to this link for more information.
- Basic understanding of Scala language.
Now, follow the steps below to set up Azure Databricks ETL:
- Step 1: Create an Azure Databricks ETL Service
- Step 2: Create a Spark Cluster in Azure Databricks ETL
- Step 3: Create a Notebooks in Azure Databricks ETL Workspace
- Step 4: Extract Data from the Storage Account
- Step 5: Transform the Extracted Data
- Step 6: Load Transformed Data into Azure Synapse
Step 1: Create an Azure Databricks ETL Service
Follow the steps below to create an Azure Databricks ETL Service:
- Log in to your Azure portal and select Create a resource > Analytics > Azure Databricks as shown in the image below.
- This will lead you to a new window, where you will be asked for a couple of parameters like Workspace Name, Subscription, Resource Group, Location, and Pricing Tier. Fill up the details and select Pin to dashboard > Create.
Step 2: Create a Spark Cluster in Azure Databricks ETL
Follow the steps below to create a Spark Cluster in Azure Databricks ETL:
- Go to the Azure Databricks ETL Service that you created in your last step, then select Launch Workshop > New Cluster.
- This will redirect you to a new window where you will be asked for a couple of information like Cluster Name, Cluster-Mode, etc. as shown in the image below.
NOTE: Ensure to select the “Terminate after specific minutes of inactivity” checkbox.
- Once you have filled up all the necessary details, click on the “Create Cluster” button.
Step 3: Create a Notebooks in Azure Databricks ETL Workspace
In this step, you will simply create a file system in the Azure Data Lake Storage Gen2 account. Furthermore, you will run a couple of code snippets to configure the storage account. Follow the steps given below for the same:
- Go to the Azure Databricks ETL Service that you create before, and select Launch Workspace > Workspace > Create > Notebook as shown in the image below.
- This will open up a Create Notebook dialog box where you will have to fill up a couple of parameters like Name, Language, and Cluster. For name, you can give an appropriate name as per your choice. However, select “Scala” for language and the Spark Cluster that you created in Step 2 for the cluster parameter. Once done, click on the “Create” button.
- Following is the Session Configuration code snippet. Copy-paste it into your Azure Databricks ETL Notebook.
val appID = "<appID>" val secret = "<secret>" val tenantID = "<tenant-id>" spark.conf.set("fs.azure.account.auth.type", "OAuth") spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") spark.conf.set("fs.azure.account.oauth2.client.id", "<appID>") spark.conf.set("fs.azure.account.oauth2.client.secret", "<secret>") spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<tenant-id>/oauth2/token") spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
- Following is the Account Configuration code snippet. Similar to the last code snippet, copy-paste it into your Azure Databricks ETL Notebook.
NOTE: Replace the placeholder values for app-id, secret, tenant-id, and storage-account-name with the values you acquired while completing the prerequisites. Also, replace the placeholder value for file-system-name with the name of your choice.
val storageAccountName = "<storage-account-name>" val appID = "<app-id>" val secret = "<secret>" val fileSystemName = "<file-system-name>" val tenantID = "<tenant-id>" spark.conf.set("fs.azure.account.auth.type." + storageAccountName + ".dfs.core.windows.net", "OAuth") spark.conf.set("fs.azure.account.oauth.provider.type." + storageAccountName + ".dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") spark.conf.set("fs.azure.account.oauth2.client.id." + storageAccountName + ".dfs.core.windows.net", "" + appID + "") spark.conf.set("fs.azure.account.oauth2.client.secret." + storageAccountName + ".dfs.core.windows.net", "" + secret + "") spark.conf.set("fs.azure.account.oauth2.client.endpoint." + storageAccountName + ".dfs.core.windows.net", "https://login.microsoftonline.com/" + tenantID + "/oauth2/token") spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true") dbutils.fs.ls("abfss://" + fileSystemName + "@" + storageAccountName + ".dfs.core.windows.net/") spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")
- Once done, press the SHIFT + ENTER combination keys to execute the code.
- Ingest data into your storage account for initiating the Azure Databricks ETL process. You can refer to this link for the same.
Step 4: Extract Data from the Storage Account
In this step, you will be extracting data from the Azure Data Lake Storage Gen2 account that you created in your previous steps. Copy-paste the code given below into your Azure Databricks ETL Notebook that you created earlier.
val df = spark.read.json("abfss://" + fileSystemName + "@" + storageAccountName + ".dfs.core.windows.net/small_radio_json.json")
NOTE: Make sure to replace the parameters given inside the parenthesis with your values. Press the SHIFT + ENTER combination keys to execute the code.
Step 5: Transform the Extracted Data
In this step, you will be transforming the data (i.e. retrieving specific columns that are necessary and dumping all the irrelevant data) that you extracted in the last step. Copy-paste the code snippet given below in your Azure Databricks ETL Notebook.
val specificColumnsDf = df.select("col_x", "col_y", "col_z", "col_n", "col_m") specificColumnsDf.show()
NOTE: Make sure to change the parameters given inside the parenthesis with the column names that you want to retrieve. Also, you can add more columns, if need be. Press the SHIFT + ENTER combination keys to execute the code.
Step 6: Load Transformed Data into Azure Synapse
Follow the steps below to load the data that you transformed in the last step into Azure Synapse:
- Execute the code given below to configure the access from Azure Databricks ETL to Azure Storage account.
val blobStorage = "<blob-storage-account-name>.blob.core.windows.net" val blobContainer = "<blob-container-name>" val blobAccessKey = "<access-key>"
- Now, specify a temporary folder. This is necessary to transfer data between Azure Databricks ETL and Azure Synapse because the Azure Synapse connector uses Azure Blob Storage that requires temporary storage to upload data. Execute the code given below for the same.
NOTE: Use the credentials you must have acquired while completing the prerequisites for this method.
val tempDir = "wasbs://" + blobContainer + "@" + blobStorage +"/tempDirs"
- Execute the code below to store the Azure Blog Storage access keys (that you must have acquired while going through the prerequisites of this method) in the configuration.
val acntInfo = "fs.azure.account.key."+ blobStorage sc.hadoopConfiguration.set(acntInfo, blobAccessKey)
- Now, to connect to the Azure Synapse instance, you need to provide the values that you must have acquired while creating an Azure Synapse Analytics service as a prerequisite. Execute the code given below after replacing the necessary parameters’ values:
//Azure Synapse related settings val dwDatabase = "<database-name>" val dwServer = "<database-server-name>" val dwUser = "<user-name>" val dwPass = "<password>" val dwJdbcPort = "1433" val dwJdbcExtraOptions = "encrypt=true;trustServerCertificate=true;hostNameInCertificate=*.database.windows.net;loginTimeout=30;" val sqlDwUrl = "jdbc:sqlserver://" + dwServer + ":" + dwJdbcPort + ";database=" + dwDatabase + ";user=" + dwUser+";password=" + dwPass + ";$dwJdbcExtraOptions" val sqlDwUrlSmall = "jdbc:sqlserver://" + dwServer + ":" + dwJdbcPort + ";database=" + dwDatabase + ";user=" + dwUser+";password=" + dwPass
- Execute the code snippet given below to transfer the transformed data as a table in Azure Synapse>
NOTE: You can replace the name of the table with a name of your choice.
spark.conf.set( "spark.sql.parquet.writeLegacyFormat", "true") renamedColumnsDF.write.format("com.databricks.spark.sqldw").option("url", sqlDwUrlSmall).option("dbtable", "SampleTable") .option( "forward_spark_azure_storage_credentials","True").option("tempdir", tempDir).mode("overwrite").save()
That’s it! You have now successfully performed ETL operations using Azure Databricks ETL.
Limitations of using Azure Databricks ETL
Although the above-mentioned method works perfectly for any data that you want to transfer, it has some drawbacks. Following are the limitations of using Azure Databricks ETL:
- It is a manual process that requires high-end technical and programming proficiency. A simple mistake during this process could lead to major setbacks.
- It requires you to integrate Azure and Databricks. This is a time-consuming process in itself, making the whole thing an extremely tedious and technical task.
- You can only move data between Azure Databricks and Azure Synapse. If you need to transfer data to some other Cloud platform, then you need to integrate Databricks with that particular platform.
With that being said, there is a better solution to all these problems of yours. The solution comes as Hevo Data and its Databricks’ connector. For more information, keep reading!
Method 2: Setting Up Databricks ETL using Hevo Data
Hevo Data, a No-code Data Pipeline helps to Load Data from any data source such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services and simplifies the ETL process. It supports 100+ data sources and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo loads the data onto the desired Data Warehouse, enriches the data, and transforms it into an analysis-ready form without writing a single line of code.
Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different Business Intelligence (BI) tools as well.Get Started with Hevo for free
Advantages of using Hevo Data
Listed below are the advantages of using Hevo Data over any other platform:
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled securely and consistently with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and your data volume grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
The article introduced you to ETL (Extract, Transform and Load) in general and further elaborated on Databricks ETL. It also provided a comprehensive guide on how to perform ETL operations using Azure Databricks ETL and Hevo Data. Furthermore, it displayed how Hevo Data is a better choice for any individual/company over any other platform.
With the complexity involved in Manual Integration, businesses are leaning towards Automated Integration. It is hassle-free, easy to operate, and does not require any technical background. In such a case Hevo Data is the right choice for you. You can also check out the 100+ data sources integration options that Hevo Data provide to its users.Visit our Website to Explore Hevo
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.
Share your experience of setting up Databricks ETL in the comments section below!