Databricks Spark: Ultimate Guide for Data Engineers in 2022

on Apache Spark, Databricks • April 13th, 2022 • Write for Hevo

Databricks Spark FI

Databricks is an Enterprise Software company that was founded by the creators of Apache Spark. It is known for combining the best of Data Lakes and Data Warehouses in a Lakehouse Architecture. Apache Spark is renowned as a Cluster Computing System that is lightning quick. It provides its users with a comprehensive suite of High-Level APIs focused on Application Development. This includes Python, Scala, Java, and R. Apache Spark is a handy framework that can be used to execute Spark applications seamlessly.

This blog talks in detail about how you can leverage Databricks Spark for your business use case and improve your workflow efficiency. It also gives a brief introduction to Databricks and Apache Spark before diving into the Databricks Spark DataFrames and Datasets.

Table of Contents

What is Databricks?

Databricks Spark: Databricks Logo
Image Source

Databricks is a Cloud-based Data platform powered by Apache Spark. It primarily focuses on Big Data Analytics and Collaboration. With Databricks’ Machine Learning Runtime, Managed ML Flow, and Collaborative Notebooks, you can avail a complete Data Science Workspace for Business Analysts, Data Scientists, and Data Engineers to collaborate. Databricks houses the DataFrames and Spark SQL libraries that allow you to interact with Structured data.

With Databricks, you can easily gain insights from your existing data while also assisting you in the development of Artificial Intelligence solutions. Databricks also include Machine Learning libraries for training and creating Machine Learning Models, such as Tensorflow, Pytorch, and many more. Various enterprise customers use Databricks to conduct large-scale production operations across a vast multitude of use cases and industries, including Healthcare, Media and Entertainment, Financial Services, Retail, and so much more.

Key Features of Databricks

Databricks has carved a name for itself as an industry-leading solution for Data Analysts and Data Scientists owing to its ability to transform and handle large amounts of data. Here are a few key features of Databricks:

  • Delta Lake: Databricks houses an Open-source Transactional Storage Layer meant to be used for the whole Data Lifecycle. You can use this layer to bring in Data Scalability and Reliability to your existing Data Lake.
  • Optimized Spark Engine: Databricks allows you to avail the most recent versions of Apache Spark. You can also effortlessly integrate various Open-source libraries with Databricks. Armed with the availability and scalability of multiple Cloud service providers, you can easily set up Clusters and build a fully managed Apache Spark environment. Databricks allow you to configure, set up, and fine-tune Clusters without having to monitor them to ensure peak performance and reliability. 
  • Machine Learning: Databricks offers you one-click access to preconfigure Machine Learning environments with the help of cutting-edge frameworks like Tensorflow, Scikit-Learn, and Pytorch. From a central repository, you can share and track experiments, manage models collaboratively, and reproduce runs. 
  • Collaborative Notebooks: Armed with the tools and the language of your choice, you can instantly analyze and access your data, collectively build models, discover and share new actionable insights. Databricks allows you to code in any language of your choice including Scala, R, SQL, and Python.

What is Apache Spark?

Databricks Spark: Spark Logo
Image Source

Apache Spark leverages Hadoop for two functionalities: Process Management and Storage. Since Apache Spark houses efficient Cluster Management, it utilizes Hadoop for storage. Hadoop is primarily used by enterprises to examine large data volumes since it is based on a simple programming model (MapReduce). This allows Hadoop to provide a more fault-tolerant, flexible, scalable, and cost-effective computing solution. Its primary focus is to maintain speed in processing voluminous Datasets in terms of Program Execution Time and Query Response Time.

Spark was released by Apache Spark Corporation to improve the speed of the Hadoop computational computing software process. Apache Spark is known in the market as a Distributed General-Purpose Computing Engine that can be leveraged to analyze and process large data files from multiple sources like S3, Azure, HDFS, etc., among others.  

What are the Benefits of Apache Spark?

Here are a few key benefits of leveraging Apache Spark for your business use case:

  1. Speed
  2. Real-time Stream Processing
  3. Supports Multiple Workloads
  4. Increased Usability
  5. Advanced Analytics

1. Speed

Apache Spark can process data across Resilient Distributed Datasets (RDDs) and reduce the time it takes to execute I/O operations to a greater extent as compared to MapReduce. It has the capability of performing 100x faster in memory, and 10x faster on disk. You can also leverage Apache Spark to sort 100 TB of data 3x faster than Hadoop MapReduce on one-tenth of the machines.

2. Real-time Stream Processing

With Spark’s Language-Integrated API, you can easily manipulate and process data in real-time, as opposed to Hadoop’s MapReduce where you could only process data present in Hadoop Clusters. 

3. Supports Multiple Workloads

Apache Spark allows you to easily develop parallel applications with over 80 high-level operators to choose from.

4. Increased Usability

Apache Spark also supports a vast array of programming languages to write your scalable applications. On top of its user-friendliness, you can also reuse the code for Batch Processing, running ad-hoc queries within the stream state or joining streams against historical data.  

5. Advanced Analytics

Apache Spark can easily assist in performing complex analytics including Graph Processing and Machine Learning. Spark’s extensive libraries like MLib (for Machine Learning) and SQL & DataFrames, Spark Streaming, and GraphX have greatly helped businesses handle sophisticated problems. With Apache Spark, you also get better speed for analytics. This is because Spark stores data in the RAM of the servers which can then be easily accessed.

Simplify Databricks ETL using Hevo’s No-code Data Pipelines

A fully managed No-code Data Pipeline platform like Hevo helps you integrate data from 100+ data sources (including 40+ Free Data Sources) to a destination of your choice such as Databricks in real-time in an effortless manner. Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. Its strong integration with umpteenth sources provides users with the flexibility to bring in data of different kinds, in a smooth fashion without having to code a single line. 

Get Started with Hevo for Free

Check out some of the cool features of Hevo:

  • Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
  • Real-Time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
  • 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources that can help you scale your data infrastructure as required.
  • 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

What are Spark Interfaces?

Apache Spark offers its users a comprehensive suite of Web User Interfaces (UIs) that can be leveraged to monitor the resource consumption and status of your Spark Cluster. Here are a few key Apache Spark Interfaces that you should know about: Dataset, DataFrame, and Resilient Distributed Dataset.

  • Dataset: This Spark Interface is a combination of RDD and DataFrame. It offers the typed interface that is found in RDDs while offering the convenience of the DataFrame. The Dataset API can be leveraged using the Scala and Java languages.
  • DataFrame: This interface bears resemblance to the DataFrames in R language and the pandas Python library. The DataFrame API is available in the Python, R, Java, and Scala languages.
  • Resilient Distributed Dataset: RDD is the first Apache Spark abstraction that is an interface to a sequence of data objects that consist of one or more types that are located across a collection of machines (a cluster). RDDs can be developed in a vast multitude of ways and are the “lowest level” API that you can find. While RDDs were the original data structure for Apache Spark, you should focus mainly on the DataFrame API, which is a superset of the RDD functionality. You can find the RDD API in Python, Java, and Scala languages.  

What is Apache Spark as a Service?

Databricks hosts its optimized version of Apache Spark as a Service across multiple clouds. It comes with a veritable set of built-in applications that can help you analyze and access data faster. Apache Spark as a Service leverages Spark’s numerous capabilities of operating on Big Data similar to its capability of working with streaming data while performing graph computation, offering SQL on Hadoop along with its Machine Learning functionality.

Spark as a Service helps eliminate the infrastructure challenges and ramps up the process by doing away with most of the effort and cost involved. There are already various providers that offer Spark as a Service, which makes this framework fast and easy to deploy. This solution works great for short-term data analytics projects that can be set up quickly with a high return on investment.

Main Advantages of using Spark as a Service

The primary advantages of leveraging Spark as a Service for your business operation are as follows:

  • Lower costs.
  • Spark as a Service offers an easy way to access Spark data.
  • You don’t need any specialized coding skills to start using Spark as a Service. It can therefore be used by both business and technical users.

How to Create a Basic Spark Application?

You need to add code to the cells of a Databricks Notebook if you wish to write your first Apache Spark application. In this instance, you will be using Python for the same. For additional information, you can refer to the Apache Spark Quick Start Guide. Here is the code snippet:

# Take a look at the file system
display(dbutils.fs.ls("/databricks-datasets/samples/docs/"))
Databricks Spark: Path
Image Source

The next command utilizes Spark, the SparkSession that can be found in every Databricks notebook, to read the README.md text file. It will also be used to create a DataFrame called textFile:

textFile = spark.read.text("/databricks-datasets/samples/docs/README.md")

If you wish to count the number of lines in your text file, you can apply the count action to the DataFrame:

textFile.count()

Here is what it would look like:

Databricks Spark: textfile.count()
Image Source

A thing to notice here is that the second command that read the text file did not generate any output while the third command that performed the count, did. This is because the first command is a transformation while the second one is an action. Transformations are lazy and can run only when an action is run. This allows Spark to properly optimize for performance (for instance, run a filter before a join), as opposed to running commands serially.

How to Use Apache Spark Databricks DataFrames?

Here are a few ways in which you can use Databricks Spark DataFrames:

Databricks Spark DataFrames: Loading Data

You can easily start working with Databricks Spark DataFrames by using an example Databricks dataset that can be found in the /databricks-datasets folder which can be accessed within the Databricks Workspace. Say, if you want to access the file that compares the city population against the Median Sales prices of homes, you can load this file:

/databricks-datasets/samples/population-vs-price/data_geo.csv

Since the sample notebook is a SQL notebook, the next few commands will use the %python magic command. Here is the code snippet for the same:

# Use the Spark CSV datasource with options specifying:
# - First line of file is a header
# - Automatically infer the schema of the data
%python
data = spark.read.csv("/databricks-datasets/samples/population-vs-price/data_geo.csv", header="true", inferSchema="true")
data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values

Databricks Spark DataFrames: Viewing a DataFrame

Once you’ve created the ‘data’ DataFrame, you can now access the data through the standard Spark commands such as take(). For instance, you can use the command data.take(10) to look at the first 10 rows of the data DataFrame.

%python
data.take(10)

Here is what that looks like:

Databricks Spark: data.take(10)
Image Source

If you wish to look at this Databricks Spark data in a tabular format, you can use the Databricks display() command as opposed to exporting the data to a third-party tool.

%python
display(data)

This is what it looks like:

Databricks Spark: databricks.display()
Image Source

Databricks Spark DataFrames: Running SQL Queries

Before you proceed to issue SQL queries, you need to save your ‘data’ Databricks Spark DataFrame either as a temporary view or as a table:

# Register table so it is accessible via SQL Context
%python
data.createOrReplaceTempView("data_geo")

Next, in a new cell, simply specify a SQL query to list the 2015 median sales price organized by state:

select `State Code`, `2015 median sales price` from data_geo
Databricks Spark: Sales Price organized by State
Image Source

Similarly, you can query the population estimate for the state of Washington:

select City, `2014 Population estimate` from data_geo where `State Code` = 'WA';
Databricks Spark: Population for Washington
Image Source

Databricks Spark DataFrames: Visualizing Data

An added benefit of utilizing the Databricks Spark display() command is that you can quickly view this data with a vast multitude of embedded visualizations. You can click the down arrow next to the graph icon to show a list of visualization types:

Databricks Spark: Plot Options
Image Source

Next, select the map icon to create a Map Visualization of the sale price SQL query from the previous section.

Databricks Spark: Map Visualization of Sales Price
Image Source

How to Use Apache Spark Databricks Datasets?

Here are the steps you can follow to use Databricks Spark Datasets:

Databricks Spark Datasets: Creating a Sample Dataset

You can create datasets in two ways: by reading from a JSON file by using SparkSession or dynamically. First, for primitive types in demos or examples, you can easily create datasets within a Python or Scala Notebook or in your sample Spark application. For instance, here’s a way you can create a Dataset of 100 integers in a single notebook. You need to use the spark variable to create 100 integers as Dataset[Long].

// range of 100 numbers to create a Dataset.
val range100 = spark.range(100)
range100.collect()
Databricks Spark: range100.collect()
Image Source

Databricks Spark Datasets: Loading a Sample Dataset

A common way is to read a data file from an external data source, such as local filesystem, object storage, HDFS, NoSQL, RDBMS. Spark supports various formats such as CSV, Parquet, Text, JSON, ORC, and many more. To read a JSON file, you can easily use the SparkSession variable spark.

An easy way to start working with Datasets is to utilize an example Databricks dataset that can be found in the /databricks-datasets folder that can be accessed in the Databricks workspace.

val df = spark.read.json("/databricks-datasets/samples/people/people.json")

When reading your JSON file, Spark has no idea about the structure of your data. This means that it doesn’t know how you wish to organize your data into a typed-specific JVM object. Therefore, it tries to infer the schema from the JSON file and creates a DataFrame = Dataset[Row] of generic Row objects.

You can also convert your DataFrame into a Dataset, that reflects a Scala Class Object by defining a domain-specific Scala case class and converting the DataFrame into the mentioned type:

// First, define a case class that represents a type-specific Scala JVM Object
case class Person (name: String, age: Long)
// Read the JSON file, convert the DataFrames into a type-specific JVM Scala object
// Person. At this stage Spark, upon reading JSON, created a generic
// DataFrame = Dataset[Rows]. By explicitly converting DataFrame into Dataset
// results in a type-specific rows or collection of objects of type Person
val ds = spark.read.json("/databricks-datasets/samples/people/people.json").as[Person]

You can do the same with IoT device information that has been captured in a JSON file, simply define a case class, read the JSON file, and convert the 

DataFrame = Dataset[DeviceIoTData]

Similar to the Person example, the following code snippet creates a case class that contains the Scala object. If you wish to access the file that contains IoT data, load the file /databricks-datasets/iot/iot_devices.json :

// define a case class that represents the device data.
case class DeviceIoTData (
  battery_level: Long,
  c02_level: Long,
  cca2: String,
  cca3: String,
  cn: String,
  device_id: Long,
  device_name: String,
  humidity: Long,
  ip: String,
  latitude: Double,
  longitude: Double,
  scale: String,
  temp: Long,
  timestamp: Long
)
// read the JSON file and create the Dataset from the ``case class`` DeviceIoTData
// ds is now a collection of JVM Scala objects DeviceIoTData
val ds = spark.read.json("/databricks-datasets/iot/iot_devices.json").as[DeviceIoTData]

Databricks Spark Datasets: Viewing a Sample Dataset

If you want to view the data in a tabular format, you can use the display() command. Once you’ve loaded the JSON data and converted it into a Dataset for your type-specific collection of JVM objects, you can see them as you would look at a DataFrame. For this, you can use the display() command or standard Spark commands, such as foreach(), take(), and println() API calls:

// display the dataset table just read in from the JSON file
display(ds)
// Using the standard Spark commands, take() and foreach(), print the first
// 10 rows of the Datasets.
ds.take(10).foreach(println(_))
// Print first 10 rows of a dataset

This is what it looks like:

Databricks Spark: Sample Dataset
Image Source

Conclusion

This blog talks about the different aspects of leveraging Apache Spark with Databricks Datasets and DataFrames. It also gives a brief introduction to the features of Apache Spark and Databricks.

Visit our Website to Explore Hevo

Extracting complex data from a diverse set of data sources can be challenging, and this is where Hevo saves the day! Hevo offers a faster way to move data from 100+ Data Sources like Databases or SaaS applications into your Data Warehouses such as Databricks to be visualized in a BI tool of your choice. Hevo is fully automated and hence does not require you to code.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

No-Code Data Pipeline for Databricks