Setting Up Databricks Docker Integration: 3 Easy Steps

By: Published: November 11, 2021

Databricks Docker Integration - FEATURE IMAGE

To deliver fast development solutions to businesses in a fast-paced environment need Continuous Integration and Business Analytics be leveraged with Machine Learning environments. Databricks is a cloud platform that brings together Data Science, Data Engineering, and Business Intelligence. It can provide you with a platform where you can store your data. You can also use its tools to analyze your data and extract meaningful insights for decision-making. This makes Databricks a very useful platform at a time when organizations are generating huge volumes of Big Data. Companies use Databricks Docker Integration to create custom deep learning environments on clusters with GPU devices.

Docker is a platform that enables system administrators and developers to build distributed applications. It uses virtualization to deliver applications in packages known as containers. Docker provides developers with an easy and light way to deliver Docker files known as Docker images that can be used to run code within a Docker container. Databricks use clusters to bring computation resources and configurations together. 

When creating a Databricks cluster, the Databricks Container Services allows you to give you an opportunity to specify a Docker image. This comes with a lot of benefits, including full control over the installed libraries, a golden container environment that will never change, and the ability to integrate Databricks Docker CI/CD pipelines. In this article, you will learn the steps to Databricks Docker Integration.

Table of Contents

Prerequisites

  • Databricks Runtime 6.1 or above. 
  • Databricks Container Services should be enabled in your Databricks workspace. 
  • Recent Docker daemon.

Introduction to Databricks

Databricks Logo
Image Source

Databricks is a Cloud-based Data Engineering platform that is used for processing and transforming large volumes of datasets and exploring data with the help of Machine Learning models. It was recently added to Azure, making it the latest Big Data processing tool for Microsoft Cloud. Databricks allow organizations to achieve the full potential of merging their data, Machine Learning, and ELT processes. 

Databricks uses a distributed system, meaning that it automatically divides workloads across different processors and scales up and down depending on demand. It helps you to save time and money when doing massive tasks.

Key Features of Databricks

A few key features of Databricks are listed below:

  • Collaborative Notebooks: Databricks come with tools that allow you to access and analyze data using the language of your choices such as Scala, R, SQL, and Python. You can build models, discover and generate insights for your company. 
  • Interactive Workspace: Databricks offers a user-friendly and interactive workspace environment that makes it easier for teams to collaborate and manage complex projects.
  • Machine Learning: Databricks provides pre-configured Machine Learning environments leveraged with powerful frameworks and libraries such as Tensorflow, Scikit-Learn, and Pytorch.

To learn more about Databricks, click here.

Introduction to Docker

Docker Logo
Image Source

Docker is a containerization platform. It provides developers with a way to package their applications by combining the application source code with the libraries and dependencies needed to run the application in a production environment. They simplify and fasten the delivery of distributed applications, and they are very popular today because organizations have shifted to cloud-native development. 

Although developers don’t need Docker to create containers, Docker makes it easier, safer, and simpler to create, deploy, and manage the containers. It supports a set of commands for performing various operations and has a single API for work-saving automation. 

Key Features of Docker

Docker help developers create distributed application in a simplified manner. A few features of Docker are listed below:

  • Easy and Faster Configuration: Docker allows users to configure the system in a hassle-free manner. The code deployment time is less, and the infrastructure is not linked with the environment of the application.
  • Swarm: Docker comes with its clustering and scheduling tool known as Swarm. It uses Docker API as a frontend to use various tools to control it. With the help of Swarm, users can control a cluster of Docker hosts as a single virtual host.
  • Application Isolation: Docker containers allow applications to run in an isolated environment. Containers are independent of other containers that allow users to run any application in a separate container without affecting other applications.

To learn more about Docker, click here.

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ data sources (including 30+ free data sources) and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

Get Started with Hevo for Free

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  1. Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  2. Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  3. Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  4. Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  5. Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  6. Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  7. Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Steps to Set Up Databricks Docker Integration

Databricks Docker Integration Cover Image
Image Source: Self

Now that you have understood about Databricks and Docker. In this section, You will learn how to specify a Docker image when creating a Databricks cluster and the steps to set up Databricks Docker Integration. The steps to integrate Databricks Docker are listed below:

Step 1: Create your Base

Databricks has a set of minimum requirements to start a cluster successfully. Databricks recommends creating a Docker base from a base that it has already created and tested. The following example uses the 9.x tag as the image will target a cluster with runtime version Databricks Runtime 9.0 or above:

FROM databricksruntime/standard:9.x

To add more Python libraries, like the latest versions of urllib and pandas, use the pip’s container-specific version. If you are using datatabricksruntime/standard:9.x, add the following:

RUN /databricks/python3/bin/pip install urllib3
RUN /databricks/python3/bin/pip install pandas

If you are using datatabricksruntime/standard:8.x or lower, use this:

RUN /databricks/conda/envs/dcs-minimal/bin/pip install urllib3
RUN /databricks/conda/envs/dcs-minimal/bin/pip install pandas

You can choose to build your Docker image from scratch, or use the minimal image provided by Databricks. The Databricks Docker image URL is databricksruntime/minimal

If you choose to build your own base image for Databricks Docker Integration, make sure it meets the following minimum requirements:

  • JDK 8u191 as Java on the system PATH
  • iproute2 (ubuntu iproute)
  • bash
  • coreutils (ubuntu coreutils, alpine coreutils)
  • procps (ubuntu procps, alpine procps)
  • Ubuntu or Alpine Linux
  • sudo (ubuntu sudo, alpine sudo)

Step 2: Push your Base Image

It’s now time to push the base image to the Docker registry. The following registries support this:

  • Docker Hub (without auth or basic auth)
  • Amazon Elastic Container Registry (ECR) with IAM 
  • Azure Container Registry (with basic auth)

You can also use Docker registries that use basic auth or no auth. 

Step 3: Start the Databricks Docker Cluster

To start the Databricks Docker cluster, you can use either the UI or the API. 

The following steps that can help you to start the Databricks Docker cluster using the UI are listed below:

  • Choose a Databricks Runtime version that can support Databricks Container Services, as shown in the image below. 
Select Databricks Runtime
Image Source
  • Choose “Use your own Docker container”. 
  • Enter your Custom Docker Image in the field for “Docker Image URL”. 
  • Select the type of authentication to be used. 

The following steps will help you to start the Databricks Docker cluster via the API are listed below:

  • Generate the API token. 
  • Start your Databricks Docker cluster using Cluster API 2.0 and your custom Docker Base with the given code below:
curl -X POST -H "Authorization: Bearer <your-token>" https://<databricks-instance>/api/2.0/clusters/create -d '{
  "cluster_name": "<your-cluster_name>",
  "num_workers": 0,
  "node_type_id": "i3.xlarge",
  "docker_image": {
    "url": "databricksruntime/standard:latest",
    "basic_auth": {
      "username": "<your-docker-registry-username>",
      "password": "<your-docker-registry-password>"
    }
  },
  "spark_version": "7.3.x-scala2.12",
  "aws_attributes": {
    "availability": "ON_DEMAND",
    "instance_profile_arn": "arn:aws:iam::<your-aws-account-number>:instance-profile/<iam-role>"
  }
}'

The requirements for basic_auth will depend on the Docker. You should not add the basic_auth field if you are using a public Docker image. However, you must add this field for private Docker images together with a service principal ID and password (in the form of username and password). 

Don’t add the basic_auth field for Amazon ECR images. You should launch your cluster using an instance profile with permissions to pull Docker images from the Docker repository in which the image is stored. 

For Azure Container Registry, the basic_auth field should be set to ID and password of the service principal. 

The following example shows an IAM role with permission to pull any type of image. The repository has been specified by <repository-name> in the given code below:

{
   "Version": "2021-11-10",
   "Statement": [
     {
       "Effect": "Allow",
       "Action": [
           "ecr:GetAuthorizationToken"
       ],
     "Resource": "*"
   },
   {
     "Effect": "Allow",
     "Action": [
         "ecr:BatchCheckLayerAvailability",
         "ecr:GetDownloadUrlForLayer",
         "ecr:GetrepositoryPolicy",
         "ecr:DescribeRepositories",
         "ecr:DescribeImages",
         "ecr:ListImages",
         "ecr:BatchGetImage"
       ],
       "Resource": [ "<repository-name>" ]
     }
   ]
 }

That’s it! You have completed the Databricks Docker Integration.

Conclusion

In this article, you learnt about Databricks, Docker, and the steps to set up Databricks Docker Integration. You also read about how the Databricks Docker Integration. It helps developers get full access to the system libraries they want to install. With the help of Databricks Docker Integration companies can deliver faster and scalable Continuous Integration solutions for businesses. Docker and Databricks are widely used tools that make the development job lot easier.

Visit our Website to Explore Hevo

Companies need to analyze their business data stored in multiple data sources. The data needs to be loaded to the Data Warehouse to get a holistic view of the data. Hevo Data is a No-code Data Pipeline solution that helps to transfer data from 100+ sources to desired Data Warehouse. It fully automates the process of transforming and transferring data to a destination without writing a single line of code.

Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your experience of learning about Databricks Docker Integration in the comments section below!

Nicholas Samuel
Technical Content Writer, Hevo Data

Skilled in freelance writing within the data industry, Nicholas is passionate about unraveling the complexities of data integration and data analysis through informative content for those delving deeper into these subjects. He has written more than 150+ blogs on databases, processes, and tutorials that help data practitioners solve their day-to-day problems.

No-code Data Pipeline For your Data Warehouse