Setting up Debezium MongoDB Connector: 3 Critical Aspects

on Data Integration, Database Management Systems, Debezium, MongoDB, Tutorials • January 28th, 2022 • Write for Hevo

debezium mongodb - Featured Image

Databases are the backbone of all applications in the digital world. Therefore, it is essential to record every minor update to databases for further using the information to build mission-critical applications. These updates or changes can be insertion or deletion, modification of the record, the search of the record, etc. Many applications use Debezium to keep track of all these changes in databases. Debezium is an Open-Source Distributed Platform that makes use of the CDC approach. CDC stands for Change Data Capture, which allows the Real-Time Replication of the changes on databases such as MongoDB.

MongoDB is a widely used database for Unstructured Data in several Organizations. You can track the modifications in your MongoDB database by setting up the Debezium MongoDB Connector. It continuously records and checks MongoDB Replica Sets or the MongoDB Sharded Cluster for changes in databases.

In this article, you will learn how to efficiently set up the Debezium MongoDB Connector for your MongoDB Databases.

Table of contents

What is MongoDB?

debezium mongodb - mongodb logo
Image Source

MongoDB is an Unstructured Database that can store data like images, text, audio, video, and more. Since the data is stored in keys and values, fetching queries in MongoDB is very fast due to simple indexing. Today, MongoDB is a perfect alternative to the traditional approach of Relational Database Management Systems due to the increase in unstructured data in the digital world. 

What is Debezium and Debezium MongoDB Connector?

debezium mongodb - debezium logo
Image Source

Debezium is an Open-Source Distributed Platform created mainly to stream events. It is a CDC(Change Data Capture) tool that helps transform traditional Databases into Event Streams. Built on top of Kafka, Debezium is one of the most popular solutions for tracking changes in databases to enable other applications to run Real-Time tasks. In Debezium, the changes are recorded at the row level, and events are generated from these changes. To record such changes, Debezium relies on various connectors based on the databases being tracked. 

The Debezium MongoDB Connector monitors MongoDB Replica Sets or the MongoDB Sharded Cluster for changes in databases. Replica Sets are the MongoDB processes that maintain the same data as the MongoDB databases. And Sharded Clusters are different servers through which MongoDB data is distributed. The changes in these Replica Sets or Sharded Clusters are created and stored as events in Kafka topic, which applications can use individually. Due to different servers in the MongoDB database, replication provides a high level of fault tolerance if any database server is lost.

Simplify MongoDB ETL Using Hevo’s No-code Data Pipeline

Hevo Data is a No-code Data Pipeline that offers a fully managed solution to set up Data Integration for 100+ Data Sources (Including 40+ Free sources) and will let you directly load data from sources like MongoDB to a Data Warehouse or the Destination of your choice. It will automate your data flow in minutes without writing any line of code. Its fault-tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully automated solution to manage data in real-time and always have analysis-ready data. 

Get Started with Hevo for Free

Let’s look at some of the salient features of Hevo:

  • Fully Managed: It requires no management and maintenance as Hevo is a fully automated platform.
  • Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer. 
  • Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
  • Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
  • Connectors: Hevo supports 100+ Integrations to SaaS platforms such as WordPress, FTP/SFTP, Files, Databases, BI tools, and Native REST API & Webhooks Connectors. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake, Firebolt, Data Warehouses; Amazon S3 Data Lakes; Databricks, MySQL, SQL Server, TokuDB, DynamoDB, MongoDB PostgreSQL Databases to name a few.  
  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Live Monitoring: Advanced monitoring gives you a one-stop view to watch all the activities that occur within Data Pipelines.
  • Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-Day Free Trial!

Why use Debezium over Poll-based CDC?

Debezium is a log-based data capturing tool. The advantage of using a log-based change data capturing tool over simply polling for updated records can be understood by looking at the advantages and features a log-based data capturing tool such as Debezium provides. 

Let’s start with the basic difference between straightforward CDC polling and the log-based data capturing approach. With polling-based(or query-based) CDC queries need to be run continuously to keep track of new insert or update operations performed on rows of the table. On the other hand, Log-based CDC captures the changes in the database’s log files ( e.g. MongoDB’s op log).

The top 5 benefits of log-based CDC over polling-based CDC are as follows:

  1. All Data Changes are Captured

The database’s log provides you with all data changes and a complete list in the exact order of application. You get the complete history of record changes in the database through log-based CDC. With poll-based CDC intermediary data changes might get missed. For example, a change in the record may happen between two polls and this record would not be captured by the poll-based CDC.

2. Low Delays of Events While Avoiding CPU Load

Now you understand that with poll-based CDC intermediary updates might get missed so improving polling speed can be the solution. Increasing the frequency of polling seeks to reduce the possibility of missing intermediary updates. This method works to some extent but polling too frequently results in performance issues as it loads the database with queries. With Log-based CDC you can capture the data changes in near real-time without settling up the price of CPU load on executing polling queries frequently.

3. No Change in Data Model Required

To recognize the records that have been updated since the last poll, the polling-based CDC requires some measures. All the captured tables in the database require to have some indicator column like LAST_UPDATE_TIMESTAMP which keeps track of changed rows. This is an extra column that might be undesirable in some cases. An overhead to update timestamps correctly needs to be maintained in all the tables.

 4. Deletes can be Captured

In polling, identification of any records that have been deleted since the last poll is not possible. This often happens in use cases like replication where the source database and the replication target require the same record. This means that if a record has been deleted in the source database, then the record on the sink side must also be deleted.

5. Old Record State and Metadata can be Captured

With polling, you’ll get the data of current row stats only, whereas based on the source database’s capabilities, Log-based CDC allows you to leverage the old record state as well.

Besides old record state access, the log-based approach also allows you to track changes in the schema.

How to set up Debezium MongoDB Connector?

The Debezium MongoDB connectors work with the set of replicas or clusters as it uses the replication mechanism. It does not work on a standalone server due to the absence of an Operation Log (oplog) in MongoDB for monitoring data changes in databases. Therefore, the connector is connected with the Replica Set. The Replica Set consists of primary, secondary, and arbitrary nodes. However, there is only one primary node, many secondary nodes, and one arbitrary node in replica.

The Arbitrary Node does not hold any data and is only a part of the replica. But, the Primary Node is used to receive all the write operations, and it also stores all the changes to databases in the oplog. And the Secondary Node replicates the primary’s oplog and stores all operations in the primary node to its dataset. When the connector starts, it will perform the initial sync of all data in the Replica Sets. Initial sync moves data from one to other nodes in a replica. As a result, it will start reading the Replica Set’s oplog with events like insert, update, and delete on the records.

When the Debezium MongoDB Connector processes the oplog it will record the position in the oplog where the event originated. If the Debezium MongoDB connector stops, it will restart from the same position where it had read the oplog last. The Debezium MongoDB connector has every detail of the Replica Set. Since each replica has its independent oplog, it will try to use a separate task. The Debezium MongoDB connector can limit the maximum number of tasks for every replica. If the tasks are fewer, the connector will assign multiple replicas to each task.

The Debezium MongoDB Connector consists of the following features:

  • At least one delivery to Kafka: The Debezium MongoDB Connector ensures that the records are delivered to the Kafka topic. Whenever there are some duplicate records in the Kafka topic, the connector will restart.
  • Supports one task: When you use the Debezium MongoDB source connector, it supports only one task.

To get started with the Debezium MongoDB Connector, you can go through the following aspects:

1) Installing Debezium MongoDB Connector

  • Step 1: Install the Debezium MongoDB Connector. You can get the Confluent Platform through the Confluent directory. To install the latest version of the Connector, you must go to the Confluent Platform installation directory and run the command given below:
confluent-hub install debezium/debezium-connector-mongodb:latest
  • Step 2: You can also install a specific version by replacing the keyword “latest” with the required version number as shown in the following command:
confluent-hub install debezium/debezium-connector-mongodb:0.9.4

For the complete list of the configuration property, you can refer to MongoDB Configuration Property.

2) Configuring Replication Mechanism

Since the Debezium MongoDB Connector can capture data changes from one Replica Set, you have to start the container with at least one running replica server to initiate the replica. If the replica is not initiated, you have to initiate the replica and add servers to it.

You have to start the container with the name of the Replica Set in the environment variable. Here, you use the name MONGO_n (where n=1,2,3, etc.), where ‘n’ is the number of servers connected to the replica. Follow the simple steps given below to start configuring the Replication Mechanism:

  • Step 1: Use the following command to start the container.
docker run -it --name mongo-init --rm -e REPLICASET=rs0 --link data1:mongo1 --link data2:mongo2 --link data3:mongo3 debezium/mongo-initiator

After initiating the Replica Set, you have to now initiate the Sharded Cluster.

The MongoDB Sharded Cluster has the following properties:

  • The separate replica acts as a configuration server of the cluster.
  • One or more shards.
  • One or more mongos or routers through which the client can connect and request the shards.
  • The Sharded Cluster consists of the configured connector with the host address of the configuration server Replica Set.

The container can be added as a shard between two MongoDB routers in the Replica Set.

Consider three MongoDB servers, shard1, shard2, shard3,  running in a container, and two MongoDB routers as router1 and router2. All three routers will ensure that they are correctly initiated as the Replica Set named shardA is added to the container as a shard.

  • Step 2: Therefore, use the following command to initiate a replica.
docker run -it --name mongo-init --rm -e REPLICASET=shardA --link shardA1:mongo1 --link shardA2:mongo2 --link shardA3:mongo3 --link router1 --link router2 debezium/mongo-initiator
  • Step 3: To add more shard Replica Sets, you need to run more containers. You can run it with the following command.
docker run -it --name mongo-init --rm -e REPLICASET=shardB --link shardB1:mongo1 --link shardB2:mongo2

When the connector is connected to the above Replica Set, it acts as a configuration server for a Sharded Cluster. It also discovers information about each Replica Set used as a shard.

The connector then starts a separate task to capture changes for each Replica Set. If shards are removed from the clusters or new shards are added, the connector automatically updates the task.

3) QuickStart your Debezium MongoDB Connector

After installing the Debezium MongoDB Connector you can easily Quick Start your connector. In this tutorial, you do not need the Docker images as the Confluent Platform is used here. Confluent is a Data Streaming Platform that enables you to manage large amounts of data from many sources, and it has Kafka’s features that provide high performance. 

  • Step 1: To add a new connector plugin, you require Restarting the Local Confluent Connect. Hence, use the below command.
confluent local services connect stop && confluent local services connect start
Using CONFLUENT_CURRENT: /Users/username/Sandbox/confluent-snapshots/var/confluent.NuZHxXfq
Starting Zookeeper
Zookeeper is [UP]
Starting Kafka
Kafka is [UP]
Starting Schema Registry
Schema Registry is [UP]
Starting Kafka REST
Kafka REST is [UP]
Starting Connect
Connect is [UP]
  • Step 2: To check if the MongoDB plugin has been installed correctly, you have to use the plugin loader.
curl -sS localhost:8083/connector-plugins | jq '.[].class' | grep mongodb
"io.debezium.connector.mongodb.MongoDbConnector"
  • Step 3: If you are not using the Confluent Platform, you can Download MongoDB using Docker with the below commands
# Create the MongoDB data directory
mkdir -p path/to/project/data/db

# Pull the Docker image
docker pull mongo

# Run the container, where `mongodb` is the name assigned to the conatiner
docker run --name mongodb -v $(pwd)/data/db:/data/db -p 27017:27017 -d mongo  --replSet debezium

# Start a new bash process in the running container
docker exec -it mongodb bash

# Start the mongo process
mongo

# Initialize MongoDB replica set

docker exec -it mongodb mongo --eval 'rs.initiate({_id: "debezium", members:[{_id: 0, host: "localhost:27017"}]})'

# Create a user profile
use admin
db.createUser(
{
user: "debeziumUser",
pwd: "dfskskdjf",
roles: ["dbOwner"]
}
)

# Insert a record
use inventory
db.customers.insert([
{ _id : 1456, first_name : 'Sarah', last_name : 'Connor', email : 'sarahconnor@sample.com' }
]);

# View records
db.customers.find().pretty();
  • Step 4: To start the Debezium MongoDB connector, you need to Create a JSON file and store the following configuration.
{
 "name": "inventory-connector",
 "config": {
     "connector.class" : "io.debezium.connector.mongodb.MongoDbConnector",
     "tasks.max" : "1",
     "mongodb.hosts" : "debezium/localhost:27017",
     "mongodb.name" : "dbserver1",
     "mongodb.user" : "debeziumUser",
     "mongodb.password" : "dfskskdjf",
     }
 }
  • Step 5: You can Start the Debezium MongoDB Connector with the command given below.
# Start MongoDB connector
curl -i -X POST -H "Accept:application/json" -H  "Content-Type:application/json" http://localhost:8083/connectors/ -d @register-mongodb.json
  • Step 6: Producers are the ones that create or write events for Kafka, whereas consumers are those that request to read these events. To Start Kafka Consumer, use the following command.
confluent local services kafka consume dbserver1.inventory.customers --from-beginning
  • Step 7: Start the MongoDB Server and run the following queries to Insert or Update records in the database. When you write the queries here in MongoDB, the Kafka consumer terminal displays the messages of these queries.
db.customers.insert([
 { _id : 2401, first_name : 'John', last_name : 'Connor', email : 'johnconnor@sample.com' }
 ]);
  • Step 8: You can Delete the Connector and stop the service with the command given below.
curl -X DELETE localhost:8083/connectors/inventory-connector
confluent local stop
  • Step 9: Stop the MongoDB Container using the command given below.
docker stop mongodb

Which Companies are using Debezium?

A wide range of companies and organizations are using Debezium in their production. For example companies such as Airwallex, Auto Trader UK, Bolt, Flipkart, GoHealt, Reddit, OYO, Shopify, Zomato, Vimeo, and many other organizations are using Debezium.

Conclusion

In this article, you have learned how to easily set up the Debezium MongoDB connector. The Confluent Platform is used here, thereby eliminating the need for installing ZooKeeper and Kafka Connect. Debezium MongoDB connector provides high redundancy and high availability using Sharded Clusters. You can use the Docker image or the Confluent platform to connect Debezium to the MongoDB server.

For holistic-business performance analysis, you need to consolidate Structured, Semi-Structured & Unstructured Data from MongoDB and all the other applications used across your business. As your firm grows, it becomes a time-consuming and resource-intensive task to effectively handle and process exponentially rising data. Ultimately, this requires you to invest a portion of your Engineering Bandwidth to Integrate Data from all sources, Clean & Transform it, and finally, Load it to a Cloud Data Warehouse or a destination of your choice for further Business Analytics. Alternatively, all of these challenges can be comfortably solved by a Cloud-Based ETL tool such as Hevo Data.    

Visit our Website to Explore Hevo

Hevo Data, a No-code Data Pipeline can seamlessly transfer data from a vast sea of 100+ sources such as MongoDB, MongoDB Atlas, Confluent Cloud & Kafka to a Data Warehouse or a Destination of your choice to be visualized in a BI Tool. It is a reliable, completely automated, and secure service that doesn’t require you to write any code!  

If you are using MongoDB as your NoSQL Database Management System and searching for a no-fuss alternative to Manual Data Integration, then Hevo can effortlessly automate this for you. Hevo, with its strong integration with 100+ sources & BI tools(Including 40+ Free Sources), allows you to not only export & load data but also transform & enrich your data & make it analysis-ready in a jiffy.

Want to take Hevo for a ride? Sign Up for a 14-day free trial and simplify your Data Integration process. Do check out the pricing details to understand which plan fulfills all your business needs.

Tell us about your experience of setting up Debezium MongoDB Connector! Share your thoughts with us in the comments section below.

No-code Data Pipeline for MongoDB