With the huge volumes of Big Data generated today, the need for Data Processing tools is on the rise. Databricks is a Data Processing and Data Engineering platform created by Apache Spark team members.
With Databricks, it is easy for you to improve the quality of your data and extract insights from it. These insights can help you to make sound decisions as far as running your business is concerned.
Databricks Clusters are a collection of Computation Resources and Configurations that you can use to run data through various fields.
When using Databricks, you will need a number of resources and a set of configurations to run your Data Processing operations.
A Databricks Cluster makes this easy for you. It brings together computation resources and configurations to help you run your Data Science, Data Engineering, and Data Analytics workloads, like Streaming Analytics, ETL Pipelines, Machine Learning, and Ad-hoc Analytics. In this article, we will be discussing Databricks Clusters in detail.
Prerequisites
What is Databricks?
Let us start by answering this main question of What is Databricks. Databricks, developed by the creators of Apache Spark, is a Web-based platform, which is also a one-stop product for all Data requirements, like Storage and Analysis.
What are Databricks Clusters?
A Databricks Cluster is a combination of computation resources and configurations on which you can run jobs and notebooks. Some of the workloads that you can run on a Databricks Cluster include Streaming Analytics, ETL Pipelines, Machine Learning, and Ad-hoc analytics.
The workloads are run as commands in a notebook or as automated tasks. There are two types of Databricks Clusters:
- All-purpose Clusters: These types of Clusters are used to analyze data collaboratively via interactive notebooks. They are created using the CLI, UI, or REST API. An All-purpose Cluster can be terminated and restarted manually. They can also be shared by multiple users to do collaborative tasks interactively.
- Job Clusters: These types of clusters are used for running fast and robust automated tasks. They are created when you run a job on your new Job Cluster and terminate the Cluster once the job ends. A Job Cluster cannot be restarted.
Hevo Data, a No-code Data Pipeline, helps load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. Hevo supports 150+ Data Sources (Including 40+ Free Data Sources) and helps load data to Databricks or the desired Data Warehouse/destination.
SIGN UP HERE FOR A 14-DAY FREE TRIAL!
What are types of clusters are there in Databricks?
Standard, High Concurrency, and Single Node clusters are supported by Azure Databricks. Cluster mode is set to Standard by default.
Standard Clusters
For a single user, a Standard cluster is ideal. Workloads written in Python, SQL, R, and Scala can all be run on standard clusters.
High Concurrency Clusters
A managed cloud resource is a high-concurrency cluster. High-concurrency clusters have the advantage of fine-grained resource sharing for maximum resource utilisation and low query latencies.
Workloads written in SQL, Python, and R can be run on high-concurrency clusters. Running user code in separate processes, which is not possible in Scala, improves the performance and security of High Concurrency clusters.
Table access control is also only available on High Concurrency clusters.
Set Cluster Mode to High Concurrency to create a High Concurrency cluster.
Single Node clusters
Spark jobs run on the driver node in a Single Node cluster, which has no workers.
To execute Spark jobs in a Standard cluster, at least one Spark worker node is required in addition to the driver node.
Set the Cluster Mode to Single Node to make a single node cluster.
What is Cluster Node Types?
One worker node and zero or more driver nodes make up a cluster.
Although the driver and worker nodes can use different cloud provider instance types, by default, the driver and worker nodes use the same instance type. Various instance types are appropriate for various use cases, such as memory-intensive or compute-intensive workloads.
Driver node
The driver node keeps track of the state of all notebooks in the cluster. The driver node also runs the Apache Spark master, which coordinates with the Spark executors and maintains the SparkContext.
The driver node type’s default value is the same as the worker node type’s. If you plan to collect() a large amount of data from Spark workers and analyze it in the notebook, you can choose a larger driver node type with more memory.
Worker node
The Spark executors and other services required for the clusters’ proper functioning are run by Databricks worker nodes. When you use Spark to distribute your workload, all of the distributed processing takes place on worker nodes.
Because Databricks only has one executor per worker node, the terms executor and worker are interchangeable in the Databricks architecture.
GPU instance types
Databricks supports clusters accelerated with graphics processing units for computationally difficult tasks that demand high performance, such as those associated with deep learning (GPUs).
What are Cluster size and Autoscaling?
You can provide a fixed number of workers for a Databricks cluster or a minimum and a maximum number of workers for the cluster when you create it.
Databricks guarantee that your cluster has the specified number of workers when you specify a fixed size cluster. When you specify a range for the number of workers, Databricks determines the number of workers needed to complete your task. Autoscaling is a term for this.
Databricks uses autoscaling to dynamically reallocate workers based on the job’s requirements. Databricks automatically adds extra workers during these phases of your job (and removes them when they’re no longer needed) because certain parts of your pipeline may be more computationally demanding than others.
Because you don’t have to provision the cluster to match a workload, autoscaling makes it easier to achieve high cluster utilization.
This is especially true for workloads with changing requirements over time (such as exploring a dataset over the course of a day), but it can also be true for a one-time, shorter workload with unknown provisioning requirements. As a result, autoscaling has two advantages:
- When compared to a constant-size under-provisioned cluster, workloads can run faster.
- When compared to a statically-sized cluster, autoscaling clusters can save money.
Autoscaling can provide one or both of these benefits depending on the cluster’s constant size and workload. When the cloud provider terminates instances, the cluster size can drop below the minimum number of workers chosen.
In this case, Databricks tries to re-provision instances on a regular basis in order to keep the minimum number of workers.
What are the types of Autoscaling?
Cluster node autoscaling is available in two flavors: standard and optimized, according to Databricks. See our blog post on Optimized Autoscaling for more information on the advantages of this technique.
Optimized autoscaling is used by automated (job) clusters all of the time. The workspace configuration determines which type of autoscaling is used on all-purpose clusters.
All-purpose clusters in the Premium plan benefit from optimized autoscaling (or, for customers who subscribed to Databricks before March 3, 2020, the Operational Security package). All all-purpose clusters on the Standard plan have standard autoscaling enabled.
Optimized autoscaling
- Scales up in two steps from minimum to maximum.
- By looking at the shuffle file state, you can scale down even if the cluster isn’t idle.
- Reduces the number of nodes by a percentage.
- If a job cluster has been underutilized for the last 40 seconds, it scales down.
- If the cluster has been underutilized for the last 150 seconds on all-purpose clusters, it scales down.
- The Spark configuration property “spark.databricks.aggressiveWindowDownS” specifies how often a cluster makes down-scaling decisions in seconds. When the value is increased, the cluster scales down more slowly. The maximum number allowed is 600.
Standard autoscaling
- Adds 8 nodes to begin with. After that, it grows exponentially, but it may take a long time to reach the maximum. The “spark.databricks.autoscaling.standardFirstStepUp” Spark configuration property allows you to customize the first step.
- Only scales down when the cluster is completely idle and has been idle for the previous 10 minutes.
- Beginning with 1 node scales down exponentially.
What are Pools in Databricks Clusters?
You can attach a cluster to a pool of idle instances for the driver and worker nodes to speed up cluster startup time. Instances from the pools are used to form the cluster. If a pool’s idle resources are insufficient to create the requested driver or worker nodes, the pool expands by allocating new instances from the instance provider. When an attached cluster ends, the instances it used are returned to the pools and can be reused by another cluster.
If you choose a pool for worker nodes but not for the driver node, the pool from the worker node configuration is passed on to the driver node.
How to Create a Databricks Cluster?
In this section, we will be discussing the two approaches to creating an all-purpose Cluster in Databricks:
A) Using the Create Button
This is the easiest way of creating a Cluster. Follow the steps given below:
Step 1: Click the “Create” button from the sidebar and choose “Cluster” from the menu. The Create Cluster page will be shown.
Step 2: Give a name to the Cluster. Note that there are many configuration options that you must fill as shown in the following image:
Step 3: Click “Create Cluster”.
You will see the progress indicator as the Cluster is being created. Once the Cluster is created, the progress indicator will turn into a green-filled circle. This is an indication that the Cluster is running and you can attach a notebook to it and start running commands and queries.
B) Using the Cluster UI
You can also create Databricks Clusters using the Cluster UI. Follow the steps given below:
Step 1: Click the “Compute” icon from the sidebar.
Step 2: Click “Create Cluster”.
Step 3: Follow steps 2 and 3 in the section for using the Create button. Your Cluster will then be created.
How to Manage Databricks Clusters?
Different activities are involved in managing the Clusters. Let’s discuss them:
A) Displaying Clusters
To see all the Databricks Clusters in your workspace, click the “Compute” icon from the sidebar. The clusters will be displayed in two tabs, All-Purpose Clusters, and Job Clusters.
The following details will be shown for each Cluster:
- Cluster Name
- State
- Number of Nodes
- Databricks Runtime Version
- Type of Driver and Worker Nodes
- Cluster Creator or Job Owner
The All-Purpose Clusters tab also shows the number of notebooks that have been attached to the Cluster.
B) Filtering the Cluster List
To filter the Cluster list in your Databricks workspace, use the buttons and filter field located at the top right.
- To see the Clusters that you have created in your account, choose “Created by Me”.
- To see the Clusters that only you can access (if you have enabled Cluster Control), choose “Accessible by me”
- To filter the Clusters by a string present in any field, enter the string in the “Filter” text box.
How to Configure a Databricks Cluster?
There are different configuration options available to you when creating and editing Databricks Clusters. Let’s discuss them:
A) Cluster Policy
A Cluster policy uses a set of rules to limit the ability to configure Clusters. The rules limit the number of attributes or attribute values that are available during Cluster creation. The policies have Account Control Lists (ACLs) that limit the use of Clusters to particular users and groups, thus, limiting the kind of policies that you can choose during Cluster creation.
To configure a Cluster policy, click the “Policy” dropdown button and choose the policy.
You will find that there are different Cluster policies that you can choose from:
- If you have the Cluster Create Permission, choose the “Unrestricted” policy option and create Clusters that are fully configurable. This policy doesn’t limit any attributes or attribute values.
- If you have both the Cluster Preate Permission and Access to Cluster policies, choose the Unrestricted policy plus the policies you can access.
- If you only have the access to Cluster policies, choose the policies that you have access to.
B) Cluster Mode
There are three Cluster Modes in Databricks, these are, Standard, High Concurrency, and Single Mode. The default cluster mode is Standard.
A Standard Cluster is good for a single user. They can run workloads created in languages such as SQL, Python, Scala, and R.
A High Concurrency Databricks Cluster is a managed Cloud resource. They are good for sharing as they enable minimum query latencies and maximum resource utilization. They can run workloads created in R, SQL, and Python. In these types of Databricks Clusters, security and performance are provided by running the user code in different processes. Scala doesn’t support this.
Note that it’s only High Concurrency Clusters that support table access control. To create this type of Cluster, choose “High Concurrency” for Cluster-Mode.
A Single Node Cluster doesn’t have workers and it runs Spark jobs in the driver mode. To create one, choose “Single Node” for the Cluster-Mode.
Conclusion
In this article, you have learned about Databricks Clusters and how to create, configure, and manage Databricks Clusters.
Extracting complex data from a diverse set of data sources and loading it to your desired destination such as Databricks Clusters can be quite challenging and cumbersome. This is where an easier alternative like Hevo saves your day!
Want to take Hevo for a spin? SIGN UP for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.
Nicholas Samuel is a technical writing specialist with a passion for data, having more than 14+ years of experience in the field. With his skills in data analysis, data visualization, and business intelligence, he has delivered over 200 blogs. In his early years as a systems software developer at Airtel Kenya, he developed applications, using Java, Android platform, and web applications with PHP. He also performed Oracle database backups, recovery operations, and performance tuning. Nicholas was also involved in projects that demanded in-depth knowledge of Unix system administration, specifically with HP-UX servers. Through his writing, he intends to share the hands-on experience he gained to make the lives of data practitioners better.