Organizations might want to move their data from different sources to the destination of their choice for multiple reasons. For instance, a data warehouse unifies data from disparate sources to provide a single source of truth to enable you to make informed data-driven decisions. Similarly, a data lake supports all forms of data–structure, semi-structured and unstructured–and helps businesses make use of their data as and when they require it. In this blog post, we’re going to discuss one such scenario where you may want to migrate your data from MongoDB Atlas to Databricks

MongoDB Atlas is a managed cloud database service for MongoDB, but it may have limitations in terms of scalability and performance for certain workloads. Databricks, on the other hand, offers a highly scalable and distributed data processing environment that can handle large volumes of data and perform complex computations efficiently. If an organization’s data and analytics requirements have grown exponentially, it may opt for a MongoDB Atlas Databricks integration.

On that note, let’s dive right into the article and look at three simple ways to move data from MongoDB to Databricks.

Method 1: Using the MongoDB Connector for Spark in Your Databricks Cluster

Step 1: The first step is to install the MongoDB Connector for Spark–

  • Head to the Libraries page in your Databricks workspace.
  • Click on the Install New button.
  • Next, you’ll need to select Maven as the source.
  • Enter org.mongodb.spark in the Group ID field.
  • Next, enter mongo-spark-connector_2.12 in the Artifact ID field.
  • Next, enter the version of the MongoDB Connector for Spark in the Version field that you’d want to install.
  • Finally, hit the Install button to complete the installation process of the connector.

Step 2: Once the MongoDB Connector for Spark is installed, you need to create a new notebook in Databricks and import these libraries.

import org.mongodb.spark.sql.MongoSpark
import org.apache.spark.sql.functions._

Step 3: The next step would be to create a connection string to your MongoDB Atlas cluster.

Step 4: Once the connection is created, use the following code to load the data from MongoDB into a Spark DataFrame.

val df = MongoSpark.load(sc, connectionString)

Step 5: Execute the following code to write the DataFrame to a table in Databricks.

df.write.saveAsTable("myTable")

Step 6: Finally, run the notebook, and your MongoDB Atlas and Databricks data will be integrated.

Method 2: Using Kafka to Build an In-House Data Pipeline

Follow these steps to build an in-house pipeline to migrate your MongoDB data to Databricks using Apache Kafka

Step 1: First, you’ll need to create a Kafka topic.

Step 2: Next, write data to the Kafka topic from MongoDB Atlas after configuring a Kafka producer.

Step 3: You’ll also need to configure a Kafka consumer to read data from the topic and then successfully write the same to Databricks.

Step 4: Finally, start the producer and the consumer.

Your pipeline is now up and running and data will be continuously streamed from MongoDB Atlas to Databricks.

You can use this code to build a pipeline using Kafka and stream your data from MongoDB Atlas to Databricks–

// Create a Kafka topic.
val topic = "my-topic"

// Configure a Kafka producer to write data to the topic from MongoDB Atlas.
val producer = new KafkaProducer[String, String](
  bootstrapServers = "localhost:9092",
  keySerializer = new StringSerializer(),
  valueSerializer = new StringSerializer()
)

// Configure a Kafka consumer to read data from the topic and write it to Databricks.
val consumer = new KafkaConsumer[String, String](
  bootstrapServers = "localhost:9092",
  keyDeserializer = new StringDeserializer(),
  valueDeserializer = new StringDeserializer()
)

// Start the producer and consumer.
producer.start()
consumer.start()

// Write data to the topic from MongoDB Atlas.
for (document <- MongoDB.find(collection)) {
  producer.send(new ProducerRecord(topic, document.toJson))
}

// Read data from the topic and write it to Databricks.
while (true) {
  val record = consumer.poll(100)
  if (record != null) {
    Databricks.write(record.value)
  }
}

You must have realized by now that these manual methods of moving data from MongoDB Atlas to Databricks is quite cumbersome and requires a lot of engineering bandwidth. Opting for a fully-managed automated data pipeline can save your team’s time and efforts as well as your business’ expenses. On that note, let’s check out this time-saving and simpler method in the next section.

Method 3: Using an Automated Data Pipeline

Using third-party data pipelines, like Hevo Data, you can migrate your MongoDB Atlas data to Databricks seamlessly, and without writing a single line of code.

The benefits of this method are many–

  • Fully-managed: You do not have to write a custom ETL script on your own or waste any engineering bandwidth to set up an in-house pipeline.
  • Easy Data Transformation Options: Automated data pipelines like Hevo, offer a drag-and-drop console for data transformations. However, for those who would like to perform complex transformations, there is a Python console as well.
  • Near Real-time Data Replication: One of the more important perks of using an automated data pipeline is that it streams data in near real-time and with very low latency.
  • Automated Schema Management: Automated data pipelines use the auto schema mapping feature to detect and handle the incoming source schema and create the same in the destination.
  • Highly Scalable: Automated data pipelines are highly scalable and handle your organizations growing volumes of data without compromising on data security and integrity.

Hevo Data provides all these features and makes it easy for you to replicate data from MongoDB Atlas to Databricks in no time.

Step 1: Configure MongoDB Atlas as Your Source

MongoDB Atlas to Databricks: Configure MongoDB Atlas as Source
Image Source

Step 2: Configure Databricks as Your Destination.

MongoDB Atlas to Databricks: Configure Databricks as Destination
Image Source

And that is all! Within minutes, your pipeline will be set up and automatically start streaming data in near real-time.

What Can You Achieve by Replicating Data from MongoD Atlas to Databricks?

  • If an organization wants to leverage the advanced analytics features of Databricks on their MongoDB data, migrating this data to Databricks is the best option.
  • Databricks offers seamless integration with various data sources. If an organization needs to combine their MongoDB data with data from other sources, and perform complex transformations, migrating to Databricks can enable better integration and data consolidation.
  • Databricks has a rich ecosystem of tools and libraries for data analytics, including support for popular programming languages like Python, R, and Scala. Organizations that want to take advantage of this ecosystem might choose to migrate their MongoDB Atlas data to Databricks.

Conclusion

The rich feature loaded platform of Databricks makes it a great choice for organizations that want to unify their data under one hood and make prompt data-driven decisions. And using an automated data pipeline takes away the stress of managing an in-house pipeline or executing a custom ETL script without any errors.

You can enjoy a smooth ride with Hevo Data’s 150+ plug-and-play integrations (including 50+ free sources) like MongoDB Atlas to Firebolt. Hevo Data is helping many customers take data-driven decisions through its no-code data pipeline solution for several such integrations. 

Saving countless hours of manual data cleaning and standardizing, Hevo Data’s pre-load data transformations to connect MongoDB Atlas to Firebolt gets it done in minutes via a simple drag and drop interface or your custom Python scripts. No need to go to Firebolt for post-load transformations. You can simply run complex SQL transformations from the comfort of Hevo Data’s interface and get your data in the final analysis-ready form. Sign up for a 14-day free trial today and check out our unbeatable pricing to choose a plan that fits your requirements best.

mm
Former Content Marketing Specialist, Hevo Data

Anwesha is experienced in curating content and executing content marketing strategies through a data-driven approach. She has more than 5 years of experience in writing about ML, AL, and data science.

All your customer data in one place.