Databases are the backbone of all applications in the digital world. Therefore, it is essential to record every minor update to databases for further using the information to build mission-critical applications.

These updates or changes can be insertion or deletion, modification of the record, the search of the record, etc. Many applications use Debezium to keep track of all these changes in databases. Debezium is an Open-Source Distributed Platform that makes use of the CDC approach.

CDC stands for Change Data Capture, which allows the Real-Time Replication of the changes on databases such as MongoDB.

MongoDB is a widely used database for Unstructured Data in several Organizations. You can track the modifications in your MongoDB database by setting up the Debezium MongoDB Connector.

What is MongoDB?

debezium mongodb - mongodb logo
Image Source

MongoDB is an Unstructured Database that can store data like images, text, audio, video, and more. Since the data is stored in keys and values, fetching queries in MongoDB is very fast due to simple indexing.

Today, MongoDB is a perfect alternative to the traditional approach of Relational Database Management Systems due to the increase in unstructured data in the digital world. 

What is Debezium and Debezium MongoDB Connector?

debezium mongodb - debezium logo
Image Source

Debezium is an Open-Source Distributed Platform created mainly to stream events. It is a CDC(Change Data Capture) tool that helps transform traditional Databases into Event Streams. Built on top of Kafka, Debezium is one of the most popular solutions for tracking changes in databases to enable other applications to run Real-Time tasks.

In Debezium, the changes are recorded at the row level, and events are generated from these changes. To record such changes, Debezium relies on various connectors based on the databases being tracked. 

The Debezium MongoDB Connector monitors MongoDB Replica Sets or the MongoDB Sharded Cluster for changes in databases. Replica Sets are the MongoDB processes that maintain the same data as the MongoDB databases.

And Sharded Clusters are different servers through which MongoDB data is distributed. The changes in these Replica Sets or Sharded Clusters are created and stored as events in Kafka topic, which applications can use individually. Due to different servers in the MongoDB database, replication provides a high level of fault tolerance if any database server is lost.

Why use Debezium over Poll-based CDC?

Debezium is a log-based data capturing tool. The advantage of using a log-based change data capturing tool over simply polling for updated records can be understood by looking at the advantages and features a log-based data capturing tool such as Debezium provides. 

Let’s start with the basic difference between straightforward CDC polling and the log-based data capturing approach. With polling-based(or query-based) CDC queries need to be run continuously to keep track of new insert or update operations performed on rows of the table. On the other hand, Log-based CDC captures the changes in the database’s log files ( e.g. MongoDB’s op log).

The top 5 benefits of log-based CDC over polling-based CDC are as follows:

1. All Data Changes are Captured

  • The database’s log provides you with all data changes and a complete list in the exact order of application.
  • You get the complete history of record changes in the database through log-based CDC. With poll-based CDC intermediary data changes might get missed.
  • For example, a change in the record may happen between two polls and this record would not be captured by the poll-based CDC.

2. Low Delays of Events While Avoiding CPU Load

  • Now you understand that with poll-based CDC intermediary updates might get missed so improving polling speed can be the solution. Increasing the frequency of polling seeks to reduce the possibility of missing intermediary updates.
  • This method works to some extent but polling too frequently results in performance issues as it loads the database with queries.
  • With Log-based CDC you can capture the data changes in near real-time without settling up the price of CPU load on executing polling queries frequently.

3. No Change in Data Model Required

  • To recognize the records that have been updated since the last poll, the polling-based CDC requires some measures.
  • All the captured tables in the database require to have some indicator column like LAST_UPDATE_TIMESTAMP which keeps track of changed rows.
  • This is an extra column that might be undesirable in some cases. An overhead to update timestamps correctly needs to be maintained in all the tables.

 4. Deletes can be Captured

  • In polling, identification of any records that have been deleted since the last poll is not possible.
  • This often happens in use cases like replication where the source database and the replication target require the same record.
  • This means that if a record has been deleted in the source database, then the record on the sink side must also be deleted.

5. Old Record State and Metadata can be Captured

With polling, you’ll get the data of current row stats only, whereas based on the source database’s capabilities, Log-based CDC allows you to leverage the old record state as well.

Besides old record state access, the log-based approach also allows you to track changes in the schema.

Connect MongoDB to BigQuery
Connect MongoDB to Snowflake
Connect Google Analytics to Redshift

How to set up Debezium MongoDB Connector?

It work with the set of replicas or clusters as it uses the replication mechanism. It does not work on a standalone server due to the absence of an Operation Log (oplog) in MongoDB for monitoring data changes in databases.

Therefore, the connector is connected with the Replica Set. The Replica Set consists of primary, secondary, and arbitrary nodes. However, there is only one primary node, many secondary nodes, and one arbitrary node in replica.

The Arbitrary Node does not hold any data and is only a part of the replica. But, the Primary Node is used to receive all the write operations, and it also stores all the changes to databases in the oplog. And the Secondary Node replicates the primary’s oplog and stores all operations in the primary node to its dataset.

When the connector starts, it will perform the initial sync of all data in the Replica Sets. Initial sync moves data from one to other nodes in a replica. As a result, it will start reading the Replica Set’s oplog with events like insert, update, and delete on the records.

When the Debezium MongoDB Connector processes the oplog it will record the position in the oplog where the event originated. If the connector stops, it will restart from the same position where it had read the oplog last.

The Debezium MongoDB connector has every detail of the Replica Set. Since each replica has its independent oplog, it will try to use a separate task. The Debezium MongoDB connector can limit the maximum number of tasks for every replica. If the tasks are fewer, the connector will assign multiple replicas to each task.

It Consists of the following features:

  • At least one delivery to Kafka: The Debezium MongoDB Connector ensures that the records are delivered to the Kafka topic. Whenever there are some duplicate records in the Kafka topic, the connector will restart.
  • Supports one task: When you use the Debezium MongoDB source connector, it supports only one task.

To get started with the Debezium MongoDB Connector, you can go through the following aspects:

1) Installing Debezium MongoDB Connector

  • Step 1: Install the Debezium MongoDB Connector. You can get the Confluent Platform through the Confluent directory. To install the latest version of the Connector, you must go to the Confluent Platform installation directory and run the command given below:
confluent-hub install debezium/debezium-connector-mongodb:latest
  • Step 2: You can also install a specific version by replacing the keyword “latest” with the required version number as shown in the following command:
confluent-hub install debezium/debezium-connector-mongodb:0.9.4

For the complete list of the configuration property, you can refer to MongoDB Configuration Property.

2) Configuring Replication Mechanism

Since the Debezium MongoDB Connector can capture data changes from one Replica Set, you have to start the container with at least one running replica server to initiate the replica. If the replica is not initiated, you have to initiate the replica and add servers to it.

You have to start the container with the name of the Replica Set in the environment variable. Here, you use the name MONGO_n (where n=1,2,3, etc.), where ‘n’ is the number of servers connected to the replica. Follow the simple steps given below to start configuring the Replication Mechanism:

  • Step 1: Use the following command to start the container.
docker run -it --name mongo-init --rm -e REPLICASET=rs0 --link data1:mongo1 --link data2:mongo2 --link data3:mongo3 debezium/mongo-initiator

After initiating the Replica Set, you have to now initiate the Sharded Cluster.

The MongoDB Sharded Cluster has the following properties:

  • The separate replica acts as a configuration server of the cluster.
  • One or more shards.
  • One or more mongos or routers through which the client can connect and request the shards.
  • The Sharded Cluster consists of the configured connector with the host address of the configuration server Replica Set.

The container can be added as a shard between two MongoDB routers in the Replica Set.

Consider three MongoDB servers, shard1, shard2, shard3,  running in a container, and two MongoDB routers as router1 and router2. All three routers will ensure that they are correctly initiated as the Replica Set named shardA is added to the container as a shard.

  • Step 2: Therefore, use the following command to initiate a replica.
docker run -it --name mongo-init --rm -e REPLICASET=shardA --link shardA1:mongo1 --link shardA2:mongo2 --link shardA3:mongo3 --link router1 --link router2 debezium/mongo-initiator
  • Step 3: To add more shard Replica Sets, you need to run more containers. You can run it with the following command.
docker run -it --name mongo-init --rm -e REPLICASET=shardB --link shardB1:mongo1 --link shardB2:mongo2

When the connector is connected to the above Replica Set, it acts as a configuration server for a Sharded Cluster. It also discovers information about each Replica Set used as a shard.

The connector then starts a separate task to capture changes for each Replica Set. If shards are removed from the clusters or new shards are added, the connector automatically updates the task.

3) QuickStart your Debezium MongoDB Connector

After installing the Debezium MongoDB Connector you can easily Quick Start your connector. In this tutorial, you do not need the Docker images as the Confluent Platform is used here.

Confluent is a Data Streaming Platform that enables you to manage large amounts of data from many sources, and it has Kafka’s features that provide high performance. 

  • Step 1: To add a new connector plugin, you require Restarting the Local Confluent Connect. Hence, use the below command.
confluent local services connect stop && confluent local services connect start
Using CONFLUENT_CURRENT: /Users/username/Sandbox/confluent-snapshots/var/confluent.NuZHxXfq
Starting Zookeeper
Zookeeper is [UP]
Starting Kafka
Kafka is [UP]
Starting Schema Registry
Schema Registry is [UP]
Starting Kafka REST
Kafka REST is [UP]
Starting Connect
Connect is [UP]
  • Step 2: To check if the MongoDB plugin has been installed correctly, you have to use the plugin loader.
curl -sS localhost:8083/connector-plugins | jq '.[].class' | grep mongodb
"io.debezium.connector.mongodb.MongoDbConnector"
  • Step 3: If you are not using the Confluent Platform, you can Download MongoDB using Docker with the below commands
# Create the MongoDB data directory
mkdir -p path/to/project/data/db

# Pull the Docker image
docker pull mongo

# Run the container, where `mongodb` is the name assigned to the conatiner
docker run --name mongodb -v $(pwd)/data/db:/data/db -p 27017:27017 -d mongo  --replSet debezium

# Start a new bash process in the running container
docker exec -it mongodb bash

# Start the mongo process
mongo

# Initialize MongoDB replica set

docker exec -it mongodb mongo --eval 'rs.initiate({_id: "debezium", members:[{_id: 0, host: "localhost:27017"}]})'

# Create a user profile
use admin
db.createUser(
{
user: "debeziumUser",
pwd: "dfskskdjf",
roles: ["dbOwner"]
}
)

# Insert a record
use inventory
db.customers.insert([
{ _id : 1456, first_name : 'Sarah', last_name : 'Connor', email : 'sarahconnor@sample.com' }
]);

# View records
db.customers.find().pretty();
  • Step 4: To start the Debezium MongoDB connector, you need to Create a JSON file and store the following configuration.
{
 "name": "inventory-connector",
 "config": {
     "connector.class" : "io.debezium.connector.mongodb.MongoDbConnector",
     "tasks.max" : "1",
     "mongodb.hosts" : "debezium/localhost:27017",
     "mongodb.name" : "dbserver1",
     "mongodb.user" : "debeziumUser",
     "mongodb.password" : "dfskskdjf",
     }
 }
  • Step 5: You can Start the Debezium MongoDB Connector with the command given below.
# Start MongoDB connector
curl -i -X POST -H "Accept:application/json" -H  "Content-Type:application/json" http://localhost:8083/connectors/ -d @register-mongodb.json
  • Step 6: Producers are the ones that create or write events for Kafka, whereas consumers are those that request to read these events. To Start Kafka Consumer, use the following command.
confluent local services kafka consume dbserver1.inventory.customers --from-beginning
  • Step 7: Start the MongoDB Server and run the following queries to Insert or Update records in the database. When you write the queries here in MongoDB, the Kafka consumer terminal displays the messages of these queries.
db.customers.insert([
 { _id : 2401, first_name : 'John', last_name : 'Connor', email : 'johnconnor@sample.com' }
 ]);
  • Step 8: You can Delete the Connector and stop the service with the command given below.
curl -X DELETE localhost:8083/connectors/inventory-connector
confluent local stop
  • Step 9: Stop the MongoDB Container using the command given below.
docker stop mongodb

Which Companies are using Debezium?

A wide range of companies and organizations are using Debezium in their production. For example, companies such as Airwallex, Auto Trader UK, Bolt, Flipkart, GoHealt, Reddit, OYO, Shopify, Zomato, Vimeo, and many other organizations are using Debezium.

Learn More About:

Conclusion

In this article, you have learned how to set up the Debezium MongoDB connector easily. The Confluent Platform is used here, thereby eliminating the need for installing ZooKeeper and Kafka Connect.

Debezium MongoDB connector provides high redundancy and high availability using Sharded Clusters. You can use the Docker image or the Confluent platform to connect Debezium to the MongoDB server.

For holistic-business performance analysis, you need to consolidate Structured, Semi-Structured & Unstructured Data from MongoDB and all the other applications used across your business.

Looking for a better way to manage your work? Get started with a free Hevo trial. Try for free

Manjiri Gaikwad
Technical Content Writer, Hevo Data

Manjiri is a proficient technical writer and a data science enthusiast. She holds an M.Tech degree and leverages the knowledge acquired through that to write insightful content on AI, ML, and data engineering concepts. She enjoys breaking down the complex topics of data integration and other challenges in data engineering to help data professionals solve their everyday problems.