Databricks is a web-based Data Warehousing solution with powerful Machine Learning features. It is a one-stop platform for all your data needs. It is good for data storage and extracts insights using SparkSQL, creates predictive models using SparkML, and has active connections to business intelligence tools like Tableau, Power BI, QlikView, and others.
When working with Databricks, users will want to extract insights from the data for decision making. This requires them to create Machine Learning models that use algorithms to analyze the data. In some cases, the users will be required to write code and generate visualizations from the data.
This may be hard for non-technical users and those without coding knowledge. The Databricks Notebooks come to the rescue for these users. A Databrick Notebook is a web-based interface to a document with runnable code, narrative text, and visualizations. Databricks Notebooks empower developers with little coding knowledge to create complex datasets and Machine Learning models. In this article, we will be discussing more about the Databricks Notebooks.
Table of Contents
Prerequisites
- A Databricks Account.
- Working Knowledge of Databricks.
Understanding Databricks Notebooks
Image Source
Databricks Notebooks provide non-advanced data users with a way of running data processing code. They use the “run in production” approach. Databricks Notebooks make it easy for all users to process data using Code and Machine Learning models. Databricks Notebooks make ETL orchestration easy, straightforward, and visual. They are also good for modularizing Data Pipelines.
Databricks Notebooks are the most preferred way of running data processing code in Databricks for users with little or no programming knowledge. A notebook eliminates the friction and reduces the complexity of running code in the Cloud.
Thus, Databricks Notebook users can deliver value quickly without experiencing any engineering bottlenecks.
2 Key Databricks Notebooks Operations
Let us discuss some of the common operations involved when working with a Databrick Notebook.
A) Creating a Databricks Notebook
You can use the Create button to create new Notebooks in your default folder. Follow the steps given below:
Step 1: Click the “Create” button from the sidebar and choose “Notebook” from the menu. The Create Notebook dialogue will appear.
Step 2: Give the Notebook a name and choose its default language.
Step 3: If you have any clusters running, they will get displayed in the Cluster drop-down.
Image Source
Choose the cluster that you need to attach to the new Databricks Notebook.
Step 4: Click the “Create” button.
If you need to create your Databricks Notebooks` in any folder, follow the steps given below:
- Step 1: Click the “Workspace” icon from the sidebar.
- Step 2: Click the drop-down button to the right of any folder text, then choose “Create” and then “Notebook”.
- Step 3: Click the drop-down icon in the user folder or workspace, and choose “Create”, then “Notebook”.
- Step 4: Follow the steps you followed in the above section for using the “Create” button.
B) Importing a Databricks Notebook
Databricks allow you to import a Notebook from a file or URL. It also allows you to import Zipped Notebooks that have been exported from a Databricks workspace.
To import a Databricks Notebook, follow the steps given below:
Step 1: Click the “Workspace” icon from the sidebar.
Step 2: Click the dropdown button to the right side of any folder and choose “Import”.
Step 3: In the user folder or workspace, click the dropdown button and choose “Import”.
Image Source
Step 4: Navigate to the location of the file with the Notebooks in the Databricks workspace or simply specify the URL.
Step 5: Click the “Import” button.
If you select a single Databricks Notebook, it will be imported to your current folder. However, if you choose a ZIP archive or a DBC, its folder structure will be recreated in the current folder and every Databricks Notebook will be imported.
Hevo Data is a No-code Data Pipeline that offers a fully-managed solution to set up data integration from 100+ Data Sources (including 40+ free data sources) and will let you directly load data to Databricks or a Data Warehouse/Destination of your choice. It will automate your data flow in minutes without writing any line of code. Its Fault-Tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully automated solution to manage data in real-time and always have analysis-ready data.
Get Started with Hevo for Free
Its completely automated Data Pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.
Check out some of the cool features of Hevo:
- Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
- Connectors: Hevo supports 100+ Integrations to SaaS platforms, Files, Databases, BI tools, and Native REST API & Webhooks Connectors. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake Data Warehouses; Amazon S3 Data Lakes; Databricks; and MySQL, SQL Server, TokuDB, DynamoDB, PostgreSQL Databases to name a few.
- Real-Time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
- 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
- Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (Including 40+ Free Sources) that can help you scale your data infrastructure as required.
- 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!
Building Visualizations in Notebooks
You can use Databricks Notebooks to visualize your data. This is possible using the display and displayHTML functions. Let’s discuss how to use these two functions to create visualizations.
A) Using the display Function
You can use the display function to create different types of visualizations from different data types. For instance, to visualize data stored in a dataframe, you can use the function with the following syntax:
display(<dataframe-name>)
Suppose you have a Spark Dataframe named dia_df with data about diamonds grouped by diamond colour, you can calculate the average price as follows:
from pyspark.sql.functions import avg
dia_df = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header="true", inferSchema="true")
display(dia_df.select("color","price").groupBy("color").agg(avg("price")))
It will return a table showing diamonds’ colour against their average price.
B) Using the displayHTML Function
This is another function that can help you to visualize your data in a Databricks Notebook. Start by creating a new Notebook to be the console for executing code to process and visualize data.
To begin, you need to create a dashboard. First, give the dashboard a title:
displayHTML(“””<font size=”6″ color=”red” face=”sans-serif”><center>Sample Dashboard</center></font>”””)
Now source the data. Here, you will be using the BikeShare Datasets provided on DBFS from Azure Databricks. Let’s read the data:
df = spark.read.format(“csv”).option(“inferSchema”, “true”).option(“header”, “true”).load(“dbfs:/databricks-datasets/bikeSharing/data-001/day.csv”) df.registerTempTable(“mytable”)
You have now read the data and registered a temporary table named mytable. If you run the Databricks notebook, you will see the loaded data.
You can now start to visualize the data. Run the code given below to aggregate the data by season for the temperature, humidity, and wind speed fields:
display(spark.sql(“SELECT season, MAX(temp) as temperature, MAX(hum) as humidity, MAX(windspeed) as windspeed FROM mytable GROUP BY season ORDER BY SEASON”))
The code will return the results in a tabular format. However, your goal is to visualize the data using graphs or charts. Expand the graph icon below the results and see all the charts supported by Databricks. You can choose Quantile Chart, Box plot, Histogram, Pivotable Charts, and Quantile-Quantile (Q-Q) plot.
Working with Widgets in Databricks
Input widgets enable you to add parameters to your Dashboards and Notebooks. The Widget API supports calls for creating different types of widgets, removing them, and getting bound values.
Types of Widgets
There are 4 different types of widgets for Databricks Notebooks:
- Text: Enables you to input a value into a text box.
- Dropdown: Enables you to select a value from a list of available values.
- Combobox: It is a combination of text and dropdown. It enables you to choose a value from the available list or input one in the text box.
- Multiselect: Enables you to select one or more values from a list of available values.
The Widget API
The Widget API is consistent in R, Python, and Scala. The widgets are managed via the Databricks “utilities” interface:
dbutils.widgets.dropdown("A1", "1", [str(x) for x in range(1, 10)])
dbutils.widgets.dropdown("1", "1", [str(x) for x in range(1, 10)], "A sample widget")
dbutils.widgets.dropdown("a12", "1", [str(x) for x in range(1, 10)], "A sample widget")
dbutils.widgets.dropdown("a123", "1", [str(x) for x in range(1, 10)], "A sample widget")
Here is how you can create a simple dropdown widget:
dbutils.widgets.dropdown("A", "1", [str(X) for x in range(1, 10)])
You can use the get() method to access the current value of a widget:
dbutils.widgets.get("X")
The following commands can help you to remove a widget from your notebook:
dbutils.widgets.remove("X")
The following commands can help you to remove all widgets from your notebook:
dbuti`ls.widgets.removeAll()
And that is how to use a Databricks Notebook!
Conclusion
In this article, you have learned about Databricks Notebooks, its key operations and how to create and visualize Databricks Notebooks as well as the Widgets respectively.
As your firm grows and attracts more customers, tremendous volumes of data start generating at an exponential rate. Efficiently handling this massive amount of data across numerous applications used in your business can be a challenging and resource-intensive task.
You would require to devote a section of your Engineering Bandwidth to Integrate, Clean, Transform and Load your data into your Data lake like Databricks, Data Warehouse, or a destination of your choice for further Business analysis. This can be effortlessly automated by a Cloud-Based ETL Tool like Hevo Data.
Visit our Website to Explore Hevo
Hevo Data, a No-code Data Pipeline that assists you in fluently transferring data from a 100’s of Data Sources into a Data Lake like Databricks, Data Warehouse, or a Destination of your choice to be visualized in a BI Tool. It is a secure, reliable, and fully automated service that doesn’t require you to write any code!
Want to Take Hevo for a spin? Sign Up for a 14-day free trial and simplify your Data Integration process. Check out the pricing details to get a better understanding of which plan suits you the most.
Share with us your experience of working with Databricks Notebooks. Let us know in the comments section below!