As a data engineer, you hold all the cards to make data easily accessible to your business teams. Your team just requested a MongoDB to Databricks connection on priority. We know you don’t wanna keep your data scientists and business analysts waiting to get critical business insights. As the most direct approach, you can go straight for the MongoDB connector. Or, hunt for a no-code tool that fully automates & manages data integration for you while you focus on your core objectives.
Table of Contents
Well, look no further. With this article, get a step-by-step guide to connecting MongoDB to Databricks effectively and quickly, delivering data to your marketing team.
Replicate Data from MongoDB to Databricks Using MongoDB Connector
To start replicating data from MongoDB to Databricks, you need to use the MongoDB connector, which is built on the Scala driver.
- Step 1: You must create a Databricks cluster and add the MongoDB connector to the Databricks library. Simply create a Databricks cluster and select the Libraries page in it. You must click the Install New options and select Maven as a source in the library section. Based on the Databricks runtime version, enter the MongoDB connector for the spark package:
- Org.mongodb.spark:mongo-spark-connector_2.12:3.0.0: for Databricks runtime 7.0.0 and above.
- org.mongodb.spark:mongo-spark-connector_2.11:2.3.4: for Databricks runtime 5.5LTS and 6.x.
- Step 2: You must upload your data to your new MongoDB Atlas instance. To do so, create a new MongoDB Atlas cluster. Basic information like cloud provider and region is needed to be selected. After creating a new instance, you must add database user and select the type of authorization. Now you simply need to whitelist the external IP addresses of your Databrciks cluster to access your MongoDB Atlas data. Now you can upload your data to the MongoDB Atlas instance.
- Step 3: The cluster created in Databricks must be configured with MongoDB using connection URI. MongoDB connection URI can be easily retrieved from MongoDB URI. Click the Connect button in MongoDB UI and click Connect Your Application option. Since Databricks is built on Spark engine and spark is written in Scala, you need to select Scala driver and select version 2.2 and above. Your connection UI string will look something like this:
In the cluster detail runner for your Databricks cluster, select the Configuration tab and Click the Edit button. Under Advanced Options, select the Spark configuration tab and update the Spark Config using the connection string you copied in the former step:
spark.mongodb.output.uri <connection-string> spark.mongodb.input.uri <connection-string>
- Step 4: There are a number of ways in which the Spark Connector can be configured to read from MongoDB. This example uses an options map. Let’s say there is a “cars” collection included in the “manufacturer” database.
import com.mongodb.spark._ val cars= spark.read.format("com.mongodb.spark.sql.DefaultSource").option("database", "manufacturer").option("manufacturer", "cars").load() import com.mongodb.spark._ cars: org.apache.spark.sql.DataFrame = [_id: struct<oid: string>, carId: int ... 3 more fields]
One of the most attractive features of MongoDB is the ability to store a variety of different models in the same collection. There is no defined schema for a given collection due to flexible schemas. Since DataFrames and Datasets require a schema, the Spark Connector will randomly sample documents from the database to infer it. The DataFrame is assigned to the inferred schema.
If you know your document structure, you can assign the Schema explicitly and avoid the need for sampling queries, which is a nice convenience. The following example shows how to define a DataFrame’s schema using a case class.
case class Car(carId: Int, carRating: Double, timestamp: Long) import spark.implicits._ val carsDS = cars.as[Rating] carsDS.cache() carsDS.show()
Using the MongoDB connector is a great way to replicate data from MongoDB to Databricks effectively. It is optimal for the following scenarios:
- Access to advanced functional programming concepts like higher-kind types, path-dependent types, type classes, currying, multiple parameter lists, etc.
- Data workflows can be automated with connector-like solutions by employing customized scripts with detailed instructions on completing each workflow stage, like MongoDB connector in this scenario. These scripts can be executed by anyone proficient in the chosen programming language.
In the following scenarios, using the MongoDB connector might be cumbersome and not a wise choice:
- Pipeline Management: Managing data pipelines might result in costly expenses across several environments (development, staging, production, etc.). A pipeline needs to be maintained regularly, the settings need to be updated, and data sync should be achieved.
- Time Consuming: If you plan to export your data frequently, creating instances and clusters, writing custom queries, and mapping and uploading the data with the connector method might not be the best choice since it takes time to carry out these processes.
How about you focus on more productive tasks than repeatedly writing custom ETL scripts? This sounds good, right?
In these cases, you can..
Automate the Data Replication process using a No-Code Tool
You can use automated pipelines to avoid such challenges. Here, are the benefits of leveraging a no-code tool:
- Automated pipelines allow you to focus on core engineering objectives while your business teams can directly work on reporting without any delays or data dependency on you.
- Automated pipelines provide a beginner-friendly UI that saves the engineering teams’ bandwidth from tedious data preparation tasks.
For instance, here’s how Hevo, a cloud-based ETL tool, makes MongoDB to Databricks data replication ridiculously easy:
Step 1: Configure MongoDB as a Source
Authenticate and Configure your MongoDB Source.
Step 2: Configure Databricks as a Destination
In the next step, we will configure Databricks as the destination.
Step 3: All Done to Setup Your ETL Pipeline
Once your MongoDB to Databricks ETL Pipeline is configured, Hevo will collect new and updated data from MongoDB every five minutes (the default pipeline frequency) and duplicate it into Databricks. Depending on your needs, you can adjust the pipeline frequency from 5 minutes to an hour.
Data Replication Frequency
|Default Pipeline Frequency||Minimum Pipeline Frequency||Maximum Pipeline Frequency||Custom Frequency Range (Hrs)|
|1 Hr||15 Mins||24 Hrs||1-24|
In a matter of minutes, you can complete this No-Code & automated approach of connecting MongoDB to Databricks using Hevo and start analyzing your data.
Hevo offers 150+ plug-and-play connectors(Including 40+ free sources). It efficiently replicates your data from MongoDB to Databricks, databases, data warehouses, or a destination of your choice in a completely hassle-free & automated manner. Hevo’s fault-tolerant architecture ensures that the data is handled securely and consistently with zero data loss. It also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.
Hevo’s reliable data pipeline platform enables you to set up zero-code and zero-maintenance data pipelines that just work. Here’s what allows Hevo to stand out in the marketplace:
- Fully Managed: You don’t need to dedicate time to building your pipelines. With Hevo’s dashboard, you can monitor all the processes in your pipeline, thus giving you complete control over it.
- Data Transformation: Hevo provides a simple interface to cleanse, modify, and transform your data through drag-and-drop features and Python scripts. It can accommodate multiple use cases with its pre-load and post-load transformation capabilities.
- Faster Insight Generation: Hevo offers near real-time data replication, so you have access to real-time insight generation and faster decision-making.
- Schema Management: With Hevo’s auto schema mapping feature, all your mappings will be automatically detected and managed to the destination schema.
- Scalable Infrastructure: With the increase in the number of sources and volume of data, Hevo can automatically scale horizontally, handling millions of records per minute with minimal latency.
- Transparent pricing: You can select your pricing plan based on your requirements. Different plans are clearly put together on its website, along with all the features it supports. You can adjust your credit limits and spend notifications for any increased data flow.
- Live Support: The support team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Take our 14-day free trial to experience a better way to manage data pipelines.Get started for Free with Hevo!
What Can You Achieve by Migrating Your Data from MongoDB to Databricks?
Here’s a little something for the data analyst on your team. We’ve mentioned a few core insights you could get by replicating data from MongoDB to Databricks. Does your use case make the list?
- Aggregate the data of individual interactions of the product for any event.
- Finding the customer journey within the product.
- Integrating transactional data from different functional groups (Sales, marketing, product, Human Resources) and finding answers. For example:
- Which Development features were responsible for an App Outage in a given duration?
- Which product categories on your website were most profitable?
- How does Failure Rate in individual assembly units affect Inventory Turnover?
Summing It Up
MongoDB connector is the right path for you when your team needs data from MongoDB once in a while. However, a custom ETL solution becomes necessary for the increasing data demands of your product or marketing channel. You can free your engineering bandwidth from these repetitive & resource-intensive tasks by selecting Hevo’s 150+ plug-and-play integrations.Visit our Website to Explore Hevo
Saving countless hours of manual data cleaning & standardizing, Hevo’s pre-load data transformations get it done in minutes via a simple drag n drop interface or your custom python scripts. No need to go to your data warehouse for post-load transformations. You can simply run complex SQL transformations from the comfort of Hevo’s interface and get your data in the final analysis-ready form.
Want to take Hevo for a ride? Sign Up for a 14-day free trial and simplify your data integration process. Check out the pricing details to understand which plan fulfills all your business needs.
Share your experience of replicating data from MongoDB to Databricks! Let us know in the comments section below!