Setting Up Databricks GitHub Integration: 2 Easy Methods

on Data Integration, Databricks, Databricks Notebooks, Git, Github, Version Control System • November 22nd, 2021 • Write for Hevo

Companies from every sector use Data Analysis and Big Data to make data-driven business decisions. Large volumes of data flow from the source systems to the Data Warehouse or any Analytics tool to process and generate insights from it. Enterprises need a fast, reliable, scalable, and easy-to-use workspace for Data Engineers, Data Analysts, and Data Scientists. Databricks is a Cloud-based Data Engineering tool that is widely used by companies to process and transform massive quantities of data and explore the data.

When multiple Developers work on the Databricks Notebooks, there is a need for controlling the versions and collaborating efficiently. GitHub is a version control tool used by Developers to keep their Software Development life cycle hassle-free. Databricks GitHub Integration allows Developers to maintain version control of their Databricks Notebooks directly from the notebook workspace.

Databricks GitHub Integration optimizes your workflow and lets Developers access the history panel of notebooks from the UI (User Interface). Multiple Developers working on the same notebook can collaborate and maintain version control. In this article, you will learn the steps to set up Databricks GitHub Integration. You will also read about a few benefits of using Databricks GitHub Integration and how it helps Developers in optimizing their workflows and collaborating with other Developers.

Table of Contents

Prerequisites

  • An active Databricks account.
  • An active GitHub account.

Introduction to GitHub

GitHub Logo
Image Source

GitHub is a web-based code hosting platform for version control and Software Development collaboration. Microsoft acquired GitHub for a whopping $7.5 billion in 2018 because Microsoft was one of the active users of GitHub. GitHub is widely used by companies, coding communities, individuals, and teams to collaborate with other people and maintain version control of their projects. GitHub is established on Git, which is an open-source version control system that makes software builds faster.

Apart from version control, GitHub features forking, pull requests, issues, branching, committing changes, and allowing Developers to specify, discuss, and review changes with their teams effectively. GitHub offers its on-premise version of the software and SaaS application as well. The Enterprise versions come with a diverse range of third-party apps and services. GitHub provides various integration services for Continous Integration, code performance, code review automation, Error Monitoring, and Task Management for Project Management.

Key Features of GitHub

GitHub helps Developers maintain version control and boost the Software Development lifecycle. A few features of GitHub are listed below:

  • Integrations: GitHub provides integrations with many 3rd party tools and software to sync data, streamline the workflow and manage projects. It also supports integration with various code editors that allow Developers to manage the repository directly from the editor and commit changes.
  • Code Safety: GitHub uses specialized technologies to find and evaluate flaws in the code. Development teams from all around the world collaborate to safeguard the software supply chain.
  • Version Control: With the help of GitHub, Developers can easily maintain different versions of their code effectively on the Cloud. It eliminates the need to maintain a copy of every project version on local storage. 
  • Skill Showcasing: Developers can create new repositories, upload their projects to showcase their knowledge and experience. It helps companies to know better about the Developer and helps both at the time of hiring.

To learn more about GitHub, click here.

Introduction to Databricks 

Databricks Logo
Image Source

Databricks is a Data Analytics platform and enterprise software developed by creators of Apache Spark for Data Engineering, Machine Learning, and Collaborative Data Science. It offers a Workspace environment to access all the Databricks assets for Data Engineers, Data Scientists, Business Analysts, and Data Analysts. Developers use Databricks as a web-based platform to work with Spark that provides automated cluster management and Collaborative Notebooks, Machine Learning Runtime, and managed ML Flow.

Databricks help ease the process of data preparation for experimentation and machine learning application deployment. It collects data from multiple sources and delivers faster performance using SparkSQL and SparkML for predictive Analytics and valuable insights. Delta Lakes is an Open Format Storage Layer offered by Databricks that handles scalable Metadata, unifies streams, and batch data processing. Databricks supports integration with Cloud service providers such as Amazon Web Services, Microsoft Azure, and Google Cloud Platform for easy data flow.

Key Features of Databricks

A few key features of Databricks are listed below:

  • Collaborative Notebooks: Databricks supports many languages such as Python, Scala, R, and SQL that allow users to access data, analyze it, explore and discover new insights. It helps Developers build new Machine Learning models using iPython style notebooks.
  • Dashboards: Databricks come with a collection of reports that allow you visually access data in tabular or CSV format. The dashboards are fully customizable that makes the environment user-friendly boosts the workflow. It allows you to consume visual insights about data by changing the parameters of queries.
  • Delta Lake: Databricks provides an Open Format Storage Layer where you can introduce data reliability and scalability to your existing Data Lake. 

To learn more about Databricks, click here.

Ways to Integrate Databricks GitHub

Method 1: Manually Integrating Databricks GitHub

In this method, you learn to set up Databricks GitHub Integration by manually generating Access Token from GitHub and saving it in Databricks, and linking Databricks Notebook with GitHub repo.

Method 2: Setting Up Databricks GitHub Using Hevo Data

A fully managed, No-code Data Pipeline platform like Hevo Data, helps you load data from 100+ Data Sources (including 40+ free sources) to Databricks in real-time, in an effortless manner. Hevo, with its minimal learning curve, can be set up in a matter of minutes, making the users ready to load data without compromising performance. Its strong integration with various sources such as databases, files, analytics engines, etc. gives users the flexibility to bring in data of all different kinds in a way that’s as smooth as possible, without having to write a single line of code.

Methods to Set Up Databricks GitHub Integration 

Now that you have understood about Databricks and GitHub. In this section, you will learn the steps to set up Databricks GitHub Integration. Here in this Databricks GitHub Integration, you will learn how to set up version control for Databricks notebooks using GitHub. The 2 methods to integrate Databricks GitHub are listed below:

Method 1: Manually Integrating Databricks GitHub

The steps to manually set up Databricks GitHub Integration using Access Token are listed below:

Steps 1: Getting an Access Token From GitHub

  • Log in to your GitHub account here.
  • Navigate to your profile photo located at the top right corner of the screen. Here click on the “Settings” option, as shown in the image below.
Settings in GitHub - Databricks GitHub Integration
Image Source: Self
  • Next, click on the “Developer settings” on the side navigation bar, as shown in the image below.
Developer Settings in GitHub - Databricks GitHub Integration
Image Source: Self
  • Now, select the “Personal access tokens” option. It will GitHub access toke settings.
  • Here, click on the “Generate new token” button to create a new personal access token for Databricks GitHub Integration, as shown in the image below.
Creating New Access Token - Databricks GitHub Integration
Image Source: Self
  • Describe the access token and set the expiration date according to your convenience.
  • Check the “repo” option in the “Select scopes” option, as shown in the image below.
Generating New Access Token - Databricks GitHub Integration
Image Source: Self
  • Click on the “Generate token” button.
  • Copy the generated access token.

Step 2: Saving GitHub Access Token to Databricks

  • Log in to your Databricks account here.
  • Navigate to your workspace, then click on the “Settings” option located at the bottom left of the screen. Then click on the “User Settings” option. 
  • Switch to the “Git Integration” tab, as shown in the image below.
Git Integration in Databricks - Databricks GitHub Integration
Image Source
  • If you have previously entered credentials, click on the “Change settings” button.
  • Here, in the “Git Provider” drop-down and select the “GitHub” option.
  • The text field for entering access token will appear. Paste the GitHub access token into the “Token” field.
  • Next, enter your GitHub username or E-Mail address into the “Git provider username or email” text field.
  • Now, click on the “Save” button.

Step 3: Linking Notebook to GitHub

  • Go to your Databricks notebook and click on the “Revision History” button located at the top right corner of the notebook. It will open the history panel, as shown in the image below.
Revision History in Databricks - Databricks GitHub Integration
Image Source
  • You will see the Git status bar displaying “Git: Not linked“.
  • Now, click on the text “Git: Not linked“, as shown in the image below. 
Git Not Linked - Databricks GitHub Integration
Image Source
  • The dialog box for Git preferences will open up, as shown in the image below. 
Git Preferences Configurations - Databricks GitHub Integration
Image Source
  • Now, click on the “Link” radio option.
  • The “Link” text file will become active, and here paste the URL of the GitHub repository from the address bar of your Github repo.
  • Then, click on the “Branch” drop-down option and select a branch or type the name of a new branch.
  • Now, in the “Path in Git Repo” text field, provide the path where you want your file in the repository.
  • Then, click on the “Save” button.

Method 2: Setting Up Databricks GitHub Using Hevo Data

Hevo Data Logo
Image Source: Self

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Salesforce, Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ data sources (including 30+ free data sources like GitHub) and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

Get Started with Hevo for Free

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  1. Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  2. Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  3. Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  4. Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  5. Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  6. Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  7. Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

That’s it! You have completed the Databricks GitHub Integration.

Benefits of Databricks GitHub Integration

A few benefits of using Databricks GitHub Integration for version control are listed below:

  • Databricks GitHub Integration allows Developers to save their Databricks Notebooks on GitHub from a UI panel in the notebook.
  • Databricks GitHub Integration syncs your history with Git repo every time the developer re-open the history panel.
  • Developers can create a new branch or work on any existing branch of the repo from the Databricks.

Conclusion 

In this article, you learnt about Databricks, GitHub, and the steps to set up Databricks GitHub Integration. You also read about some of the key benefits of using Databricks GitHub Integration and how it helps Developers collaborate easily in Data Analysis, creating new Machine Learning models, or other activities on Databricks Notebook. Databricks and GitHub are widely used platforms for Data Analysis, team collaboration, and version control. Integrating Databricks GitHub saves time and optimizes the overall process.

Visit our Website to Explore Hevo

GitHub stores many version control of a project and essential information that is useful for companies when analyzed. Hevo Data is a No-code Data Pipeline that can help you transfer data from GitHub for free to desired Data Warehouse. It fully automates the process to load and transform data from 100+ sources to a destination of your choice without writing a single line of code. 

Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your experience of learning about Databricks GitHub Integration in the comments section below!

No-code Data Pipeline For your Databricks