Building Machine Learning applications require companies to compile several tasks that are often cost-ineffective and consume more time. As a result, companies look for superior automation systems that can simplify the execution of mundane tasks. Databricks is a Data Analytics platform and enterprise software that provides companies with platform leverages with the pre-configured Machine Learning environment and helps them boos the Data Analysis process.
Companies use various platforms for their daily business activities and all these platforms store valuable business data. Also, to analyze the data, companies need to accomplish their tasks via Jobs that run code in Databricks clusters. To access Databricks Jobs from any 3rd party tools or external source, companies need to access Databricks Jobs API.
Databricks Jobs API allows businesses to do several tasks, including ETL tasks, on a given schedule, reducing the manual efforts required while working with data-related processes. In this article, you will learn about Databricks and the basic operations of Databricks Jobs API. It also introduces you to the fundamental elements included in the Databricks Workspace and some best practices you can follow while using Databricks Jobs API.
Prerequisites
- Understanding of Cloud Data Engineering.
- An active Databricks account.
- An idea of API requests.
Introduction to Databricks
Databricks is a Cloud-based Data Engineering and Big Data processing platform that unifies data to handle analytics and AI workloads. As companies collect large volumes of data for performing analytics tasks, data architects at Databricks designed a LakeHouse platform, which combines the reliability and governance of Data Warehouses with the flexibility of Data Lakes to perform SQL Analytics, BI, Data Science, and Machine Learning.
To learn more about Databricks, click here.
Introduction Databricks Workspace
Databricks Workspace is an environment for accessing all the Databricks assets. The workspace organizes objects into folders while providing access to data and computational resources. Below is the list of Databricks Workspace assets:
1) Clusters
A Cluster provides a set of computation resources and configurations for executing a particular process. It helps users run ETL pipelines, streaming analytics, Ad-hoc Analytics, Machine Learning, and other use cases.
2) Notebooks
Notebook is a web-based interface for developers to run codes, generate visualizations, and write narrative text. It allows users to import files and tables that run in sequential order to produce output from one or more previously run commands.
3) Jobs
Job is another programming approach that runs code in a Databricks cluster other than notebooks. A job consists of a single task that can be scheduled or run interactively using notebook UI (user interface). If you enable (recommended) ‘orchestration of multiple tasks in a workspace, Databricks runs multiple tasks like production pipelines to manage task orchestration. This ability simplifies the creation, management, monitoring, and error reporting for a job.
Here are some important limitations while configuring a job:
- A job can be created only in a Data Science and Engineering workspace or a Machine Learning workspace.
- A workspace is limited to 1000 concurrent job runs.
- At any given hour, the number of jobs a workspace can create is limited to 5000.
4) Libraries
Library makes third-party or locally built code available for notebooks and jobs running on a cluster. Databricks allows users to install libraries in three modes — workspace, cluster, and notebook that are scoped to their respective environments for a given session.
5) Data
Data is imported into Databricks workspace to perform desired operations in notebooks and clusters. It is brought into distributed file systems like Databricks File System (DBFS) to analyze small data files present in local machines. Whereas, for large data files, Databricks provides a wide variety of Apache Spark data sources like Avro, Hive Table, and many more that can be found here.
6) Repos
Repos provide a repository-level integration with Git to support best practices for data science code development. Organizations sync algorithms created in Databricks notebooks with a remote Git repository to collaborate and version control. With Databricks Repos, developers can leverage Git functionality to clone repositories, manage branches, push or pull changes, and visually compare differences in a commit.
7) Models
MLflow in Databricks offers ML engineers an integrated experience to track and secure ML models. Model in Databricks refers to a model registered in MLflow Model Registry, which is a centralized model store to manage the end-to-end machine learning life cycle of models.
Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDKs, external connectors for REST API, and Streaming Services and simplifies the ETL process. It supports 100+ data sources including external connectors for Magento REST APIs and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.
Get Started with Hevo for Free
Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.
Check out why Hevo is the Best:
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!
Operations in Databricks Jobs API
Job is one of the workspace assets that runs a task in a Databricks cluster. A job can be configured using UI, CLI (command line interface), and invoking the Databricks Jobs API. The Databricks Jobs API allows you to create, edit, and delete jobs with a maximum permitted request size of up to 10MB.
The Databricks Jobs API follows the guiding principles of representational state transfer (REST) architecture. Authentication and access to Databricks REST API can be processed either using Databricks personal access token or password. Databricks recommends using Databricks Jobs API 2.1 (recent release) that not only supports jobs with multiple tasks but also guides users to update their existing API clients. The steps that will help you understand basic operations in Databricks Jobs API are listed below:
1) Creating a New Job
To create a new job, users can send a request to the server. Databricks Jobs APIs use HTTP POST requests methods that consist of request body schema as discussed below:
Schema | Data Type | Description |
name | String | It helps developers to give an optional name for a job, and the default name is ‘untitled.’ |
tasks | Array | It comprises a list of task specifications |
job_clusters | Array | It consists of a unique name for a job cluster |
email_notification | Object | It consists of email addresses to be notified for a given event like job success or failure. |
timeout_seconds | integer | It is an optional schema that can be applied to each run of the job. By default, there is no time limit. |
schedule | Object | It helps to trigger a schedule of jobs for a given time zone. |
max_concurrent_runs | integer | It is an optional schema that helps to execute multiple runs on the same job concurrently. |
format | String | It helps to describe the type of job that usually takes two values, either ‘SINGLE_TASK’ or ‘MULTI_TASK.’ |
access_control_list | Array | It comprises a list of permissions to be set on a job. |
A ‘post’ request from Databricks Jobs API results in either of the four responses having response schema as shown below:
- 200: This response code explains that the job was successfully created.
- 400: This response code shows that the request was malformed. It consists of an error code and a human-readable error message that displays the cause of the error.
- 401: This response code denotes that the request was unauthorized.
- 500: This response code reveals that the request was not handled correctly due to a server error.
For instance, if you want to create a job that runs a JAR task at 10:15 pm each night, you can follow the below-mentioned steps:
- For creating a job, use a ‘post’ HTTP method and an endpoint that can be the URL of the server as shown below:
- A ‘post‘ request consists of the following code that depends on the JSON parameters in the ‘create-job‘ file.
- Based on the requirement, desired parameters can be configured by using the ‘create-job.json’ file that contains the request schema as shown below:
- If the specified parameters are correct, response code ‘200’ will be shown with a result as below:
2) Listing All the Jobs
A list of all jobs can be found with the ‘get’ HTTP request method as shown below:
The ‘get‘ request uses a .netrc file for auto-login in a server and a jq (java query) file to extract data from JSON data as shown below:
If the ‘get‘ request is correctly queried, a response will be generated as shown below:
3) Updating and Resetting Jobs
Based on the requirements of organizations, the number of tasks in a job can change. A job can be changed partially or completely, which is called the ‘update‘ and ‘reset‘ operation in Databricks Jobs API, respectively.
A ‘reset‘ operation calls the ‘reset-job.json‘ file to perform a ‘post‘ HTTP request. For instance, if you want to make job 2 identical to job 1, use the below script:
Use the ‘update‘ command for events that require alteration in a specific set of a job, like add, change, or remove operations in a current job. For instance, if your task is to remove libraries and instead add email notification to job1, you can use the below script:
4) Deleting a Job and Task
Deleting a job is carried out by a ‘post‘ HTTP request that not only removes the entire details but also the historical run of a job in Databricks Jobs API. However, no action would occur if a delete operation is performed on a job that is already removed. For any other parametric values, an error message will be displayed as shown below:
Best Practices for Databricks Jobs API
A few best practices for Databricks Jobs API are listed below:
Cluster Configuration
The cluster configuration is an essential parameter while operationalizing a job. Databricks recommends developers use new clusters so that each task runs in a fully isolated environment. If you run a task in a new cluster, it is treated as Data Engineering workloads, whereas a task running on an existing all-purpose cluster is treated as analytics workload.
Notebook
Typically, the entire notebook cell and individual cell output are bound to 20MB and 8MB size limits. If you exceed these limits, a job run is canceled and marked as failed. To optimize the cell performance, Databricks Jobs API recommends using an all-purpose cluster along with the notebook autosave technique.
Streaming
The Streaming jobs should not have maximum concurrent runs greater than one. As the streaming task runs continuously, it should always be the final task, and no retries should be appreciated.
Library Dependencies
The Spark driver has certain library dependencies that cannot be overridden. Such libraries gain more priority compared to any of the other libraries that conflict with them. When dealing with library dependency, a good rule of thumb is to list Spark and Hadoop as provided dependencies while creating JARs for jobs.
Conclusion
In this article, you learnt about Databricks and the basic operations of Databricks Jobs API. You also read about Databricks Workspace and how Databricks Jobs API helps companies automate some of their processes by accessing Databricks with code. Businesses are flooded with an enormous number of tasks while handling Data Science and Machine Learning processes. Combining such tasks can be efficient for enterprises to prohibit manual interventions. Databricks Jobs API stacks a large number of smaller tasks to provide seamless operation automatically. With the help of Databricks Jobs API, organizations can leverage multiple parallel processing capabilities that execute the entire process efficiently.
Visit our Website to Explore Hevo
Companies need to analyze their business data stored in multiple data sources. The data needs to be loaded to the Data Warehouse to get a holistic view of the data. Hevo Data is a No-code Data Pipeline solution that helps to transfer data from 100+ sources to desired Data Warehouse. It fully automates the process of transforming and transferring data to a destination without writing a single line of code.
Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.
Share your experience of learning about Databricks Jobs API in the comments section below!
Amit Kulkarni specializes in creating informative and engaging content on data science, leveraging his problem-solving and analytical thinking skills. He excels in delivering AI and automation solutions, developing generative chatbots, and providing data-driven AI & ML solutions. Amit holds a Master's degree and a Bachelor's degree in Electrical Engineering, consistently achieving distinction in his studies.