Docker is a platform that enables system administrators and developers to build distributed applications. It uses virtualization to deliver applications in packages known as containers. Docker provides developers with an easy and light way to deliver Docker files known as Docker images that can be used to run code within a Docker container. Databricks use clusters to bring computation resources and configurations together. 

When creating a Databricks cluster, the Databricks Container Services allows you to give you an opportunity to specify a Docker image. This comes with a lot of benefits, including full control over the installed libraries, a golden container environment that will never change, and the ability to integrate Databricks Docker CI/CD pipelines. In this article, you will learn the steps to Databricks Docker Integration.

Prerequisites

  • Databricks Runtime 6.1 or above. 
  • Databricks Container Services should be enabled in your Databricks workspace. 
  • Recent Docker daemon.

Introduction to Databricks

Databricks Logo

Databricks is a Cloud-based Data Engineering platform that is used for processing and transforming large volumes of datasets and exploring data with the help of Machine Learning models. It was recently added to Azure, making it the latest Big Data processing tool for Microsoft Cloud. Databricks allow organizations to achieve the full potential of merging their data, Machine Learning, and ELT processes. 

Databricks uses a distributed system, meaning that it automatically divides workloads across different processors and scales up and down depending on demand. It helps you to save time and money when doing massive tasks.

Key Features of Databricks

A few key features of Databricks are listed below:

  • Collaborative Notebooks: Databricks come with tools that allow you to access and analyze data using the language of your choices such as Scala, R, SQL, and Python. You can build models, discover and generate insights for your company. 
  • Interactive Workspace: Databricks offers a user-friendly and interactive workspace environment that makes it easier for teams to collaborate and manage complex projects.
  • Machine Learning: Databricks provides pre-configured Machine Learning environments leveraged with powerful frameworks and libraries such as Tensorflow, Scikit-Learn, and Pytorch.

Introduction to Docker

Docker Logo

Docker is a containerization platform. It provides developers with a way to package their applications by combining the application source code with the libraries and dependencies needed to run the application in a production environment. They simplify and fasten the delivery of distributed applications, and they are very popular today because organizations have shifted to cloud-native development. 

Although developers don’t need Docker to create containers, Docker makes it easier, safer, and simpler to create, deploy, and manage the containers. It supports a set of commands for performing various operations and has a single API for work-saving automation. 

Key Features of Docker

Docker help developers create distributed application in a simplified manner. A few features of Docker are listed below:

  • Easy and Faster Configuration: Docker allows users to configure the system in a hassle-free manner. The code deployment time is less, and the infrastructure is not linked with the environment of the application.
  • Swarm: Docker comes with its clustering and scheduling tool known as Swarm. It uses Docker API as a frontend to use various tools to control it. With the help of Swarm, users can control a cluster of Docker hosts as a single virtual host.
  • Application Isolation: Docker containers allow applications to run in an isolated environment. Containers are independent of other containers that allow users to run any application in a separate container without affecting other applications.

To learn more about Docker.

Steps to Set Up Databricks Docker Integration

Databricks Docker Integration Cover Image
Image Source: Self

Now that you have understood about Databricks and Docker. In this section, You will learn how to specify a Docker image when creating a Databricks cluster and the steps to set up Databricks Docker Integration. The steps to integrate Databricks Docker are listed below:

Step 1: Create your Base

Databricks has a set of minimum requirements to start a cluster successfully. Databricks recommends creating a Docker base from a base that it has already created and tested. The following example uses the 9.x tag as the image will target a cluster with runtime version Databricks Runtime 9.0 or above:

FROM databricksruntime/standard:9.x

To add more Python libraries, like the latest versions of urllib and pandas, use the pip’s container-specific version. If you are using datatabricksruntime/standard:9.x, add the following:

RUN /databricks/python3/bin/pip install urllib3
RUN /databricks/python3/bin/pip install pandas

If you are using datatabricksruntime/standard:8.x or lower, use this:

RUN /databricks/conda/envs/dcs-minimal/bin/pip install urllib3
RUN /databricks/conda/envs/dcs-minimal/bin/pip install pandas

You can choose to build your Docker image from scratch, or use the minimal image provided by Databricks. The Databricks Docker image URL is databricksruntime/minimal

If you choose to build your own base image for Databricks Docker Integration, make sure it meets the following minimum requirements:

  • JDK 8u191 as Java on the system PATH
  • iproute2 (ubuntu iproute)
  • bash
  • coreutils (ubuntu coreutils, alpine coreutils)
  • procps (ubuntu procps, alpine procps)
  • Ubuntu or Alpine Linux
  • sudo (ubuntu sudo, alpine sudo)

Step 2: Push your Base Image

It’s now time to push the base image to the Docker registry. The following registries support this:

  • Docker Hub (without auth or basic auth)
  • Amazon Elastic Container Registry (ECR) with IAM 
  • Azure Container Registry (with basic auth)

You can also use Docker registries that use basic auth or no auth. 

Step 3: Start the Databricks Docker Cluster

To start the Databricks Docker cluster, you can use either the UI or the API. 

The following steps that can help you to start the Databricks Docker cluster using the UI are listed below:

  • Choose a Databricks Runtime version that can support Databricks Container Services, as shown in the image below. 
Select Databricks Runtime
Image Source
  • Choose “Use your own Docker container”. 
  • Enter your Custom Docker Image in the field for “Docker Image URL”. 
  • Select the type of authentication to be used. 

The following steps will help you to start the Databricks Docker cluster via the API are listed below:

  • Generate the API token. 
  • Start your Databricks Docker cluster using Cluster API 2.0 and your custom Docker Base with the given code below:
curl -X POST -H "Authorization: Bearer <your-token>" https://<databricks-instance>/api/2.0/clusters/create -d '{
  "cluster_name": "<your-cluster_name>",
  "num_workers": 0,
  "node_type_id": "i3.xlarge",
  "docker_image": {
    "url": "databricksruntime/standard:latest",
    "basic_auth": {
      "username": "<your-docker-registry-username>",
      "password": "<your-docker-registry-password>"
    }
  },
  "spark_version": "7.3.x-scala2.12",
  "aws_attributes": {
    "availability": "ON_DEMAND",
    "instance_profile_arn": "arn:aws:iam::<your-aws-account-number>:instance-profile/<iam-role>"
  }
}'

The requirements for basic_auth will depend on the Docker. You should not add the basic_auth field if you are using a public Docker image. However, you must add this field for private Docker images together with a service principal ID and password (in the form of username and password). 

Don’t add the basic_auth field for Amazon ECR images. You should launch your cluster using an instance profile with permissions to pull Docker images from the Docker repository in which the image is stored. 

For Azure Container Registry, the basic_auth field should be set to ID and password of the service principal. 

The following example shows an IAM role with permission to pull any type of image. The repository has been specified by <repository-name> in the given code below:

{
   "Version": "2021-11-10",
   "Statement": [
     {
       "Effect": "Allow",
       "Action": [
           "ecr:GetAuthorizationToken"
       ],
     "Resource": "*"
   },
   {
     "Effect": "Allow",
     "Action": [
         "ecr:BatchCheckLayerAvailability",
         "ecr:GetDownloadUrlForLayer",
         "ecr:GetrepositoryPolicy",
         "ecr:DescribeRepositories",
         "ecr:DescribeImages",
         "ecr:ListImages",
         "ecr:BatchGetImage"
       ],
       "Resource": [ "<repository-name>" ]
     }
   ]
 }

That’s it! You have completed the Databricks Docker Integration.

Conclusion

  • In this article, you learnt about Databricks, Docker, and the steps to set up Databricks Docker Integration. You also read about how the Databricks Docker Integration.
  • It helps developers get full access to the system libraries they want to install.
  • With the help of Databricks Docker Integration companies can deliver faster and scalable Continuous Integration solutions for businesses.
  • Docker and Databricks are widely used tools that make the development job lot easier.
  • Share your experience of learning about Databricks Docker Integration in the comments section below!
Nicholas Samuel
Technical Content Writer, Hevo Data

Nicholas Samuel is a technical writing specialist with a passion for data, having more than 14+ years of experience in the field. With his skills in data analysis, data visualization, and business intelligence, he has delivered over 200 blogs. In his early years as a systems software developer at Airtel Kenya, he developed applications, using Java, Android platform, and web applications with PHP. He also performed Oracle database backups, recovery operations, and performance tuning. Nicholas was also involved in projects that demanded in-depth knowledge of Unix system administration, specifically with HP-UX servers. Through his writing, he intends to share the hands-on experience he gained to make the lives of data practitioners better.

No-code Data Pipeline For your Data Warehouse