Press "Enter" to skip to content

MongoDB CDC: How to Set Up Real-time Sync

MongoDB CDC

MongoDB is a cross-platform document-oriented database that belongs in the class of NoSQL databases. NoSQL databases differ from the traditional SQL databases in the sense that they do not enforce a strict table structure. This flexibility and a mature querying layer make MongoDB a popular choice for modern application development.  In most cases, there will also be other databases integrated with the application to serve analytics and reporting use cases. This leads to a situation where data from MongoDB needs to be relayed to other databases at regular intervals to keep them up to date. In time-critical reporting or analytics use cases, this data syncing will need to be run on a real-time basis. There are other use cases as well where companion applications need to immediately react to database changes and act accordingly. All these requirements are usually handled through the architecture paradigm called change data capture. This post will cover the basics of setting up MongoDB CDC copy data to other target databases.

Methods to Set Up Change Data Capture (CDC) in MongoDB

Changes in MongoDB data can be captured in three ways:

  1. Using a Timestamp Column (Manual Approach)
  2. Using MongoDB change stream functionality (Manual Approach)
  3. Using Hevo – A Cloud-based Automated ETL Platform (Automated Approach)

MongoDB CDC: Using a Timestamp Column

The most simple way to execute a MongoDB change data capture is to have a timestamp column in collections that changes with insert and update operations. This requires the developers to implement a script that will poll the collections and create the database commands for the target database. In the case of MongoDB, the object_id is representative of the time at which the row was created. But, there is no default mechanism to track updates to a row. To accomplish this a specific timestamp column needs to be created and updated through application logic every time there is a change in a row.

As you may have already guessed, this is not always practical and creates a huge overhead in the application logic. A better method will be to use MongoDB’s change stream implementation that is explained in the next section.

MongoDB CDC: Using Change Streams

Change streams work by listening to the operational log, that contains all information related to writes in the database on a storage level. Change streams only work if replica sets are enabled in MongoDB. A replica set is a second instance that helps application maintain high availability in case of failures. Replication is accomplished through operation log and Change streams rely on the same operation log. Let’s begin the process by enabling a replica set for the Mongo database. Shutdown your current MongoDB instance and restart it with a replica set.

  1. To start MongoDB server with a  replica set, use the below command.

    mongod --dbpath ./mongodata/db --replSet "rs"

    /mongodata/db is the folder in which MongoDB stores data.

  2. Go to the MongoDB Shell and type the below command to start the replication.

    s.initiate()

    After this step, every ‘write’ operation in MongoDB will emit a change stream. We will need a simple script to listen to these change streams and act accordingly. For now, let’s consider a small javascript snippet to print all the changes that are happening in the database.

  3. First, create a connection to the MongoDB instance. In the below example, organizatio_db is our MongoDB database and the employee is our collection in which data is stored.

    conn = new Mongo("mongodb://localhost:27017/demo?replicaSet=rs");
    
    db = conn.getDB("organization_db");
    
    collection = db.employee;
  4. Create a change stream cursor. This cursor will help us listen to the change streams.

    const changeCursor = collection.watch();
  5. Create a function to get the streams.

    function getStream() {
    
      while (!changeCursor.isExhausted()) {
    
        if (changeCursor.hasNext()) {
    
          changed_value = changeCursor.next();
    
          print(JSON.stringify(changeCursor));
    
        }
    
      }
    
      getStream(changeCursor);
    
    }

    This function listens to the streams and prints the changes in JSON format. For copying data from MongoDB to another database, the developer will need to write code to transform these events to target database commands.

This is an oversimplified version of how to implement a MongoDB change stream-based synchronization. In reality, there are multiple pain points to using this approach to frequently sync data from MongoDB to other databases. A few of the downsides are mentioned below.

MongoDB CDC Using Custom Code: Limitations

  1. The stream will break if there is a loss of connection to the MongoDB cluster. This happens mostly during a timeout or network error. In such cases, the script will need to have the logic to resume the listening. MongoDB provides an option called resume_token which is the id of the last change received to fetch the changes after a point. This logic will also need to be built into the code to make the operation reliable.
  2. Using this custom code approach to copy data to different target databases needs expert developers who have knowledge about both the source and target databases. There will be a learning curve before implementing the script.

An Easier Approach to Implement MongoDB Change Data Capture

Compared to the above approaches which involved creating custom scripts, a better way will be to use a completely automated cloud-based ETL tool like Hevo Data for MongoDB CDC. This lets the developers focus on the application and business logic without thinking about the complexities of implementing a reliable synchronization process. Hevo’s easy to use interface helps to synchronize data between MongoDB and various target data warehouses (BigQuery, Redshift, Snowflake) and databases (PostgreSQL, MySQL) with the least time to production.

Change data capture for MongoDB can be achieved in just 3 steps:

  1. Authenticate and connect your MongoDB data source
  2. Select Change Data Capture as your replication mode
  3. Point to the destination where you want to move data

Sign up for a 14-day free trial to experience hassle-free MongoDB CDC. 

That is all! Hevo will now take care of automatically loading data from MongoDB to your destination in real-time. Additionally, Hevo can handle some of the most complex MongoDB ETL challenges out-of-the-box, making data load from MongoDB a cakewalk for you. Here are the advantages: 

  1. Automatic Schema Handling: MongoDB is a schema-less database. This takes away the predictability from the ETL process as you would not know the concrete number of fields that will come in the data. Hevo dynamically detects schema as it extracts MongoDB documents updates the earlier encountered schema to incorporate new collections and fields. Hevo also creates the required tables and columns in the destination database as and when the schema changes.
  2. Splitting Nested JSON: If you are looking to move data from MongoDB to a data warehouse for analytics, naturally, you might want to split nested JSON and flatten them out. Hevo provides a way for you to handle this easily using a data transformation layer. You can read more about Hevo’s data transformation layer here. 
  3. Detailed Logging and Monitoring: Clearly, the data being moved from MongoDB to your destination is critical for your business. Any leakage or break in the flow of data can cause irretrievable damage. To handle this, Hevo comes with granular activity logs that allow you to monitor the flow of data in real-time.
  4. Timely Notifications: Hevo sends email notifications to the user whenever any incompatibilities are detected in the MongoDB source. This will allow users to take necessary actions in a timely fashion and ensure that data MongoDB data in the destination is always up to date.

Closing Note:

The first approach discussed in this blog is practical only in a limited number of scenarios because of the hard requirement to have a timestamp column which changes with every write operation. MongoDB’s change streams provide an excellent way of capturing writes without the developer having to write complex code to listen to the operation log. It abstracts away all these by providing a few easy to use functions. Even with the change stream functionality, it is the responsibility of the developer to write code to process the streams and translate them to target database commands. A solution like Hevo goes one step above and allows the developers to carry out this process by only specifying a few details about the source MongoDB instance and target database. Explore the complete feature set of Hevo here.

ETL Data to Redshift, Bigquery, Snowflake

Move Data from any Source to Warehouse in Real-time

Sign up today to get $500 Free Credits to try Hevo!
Start Free Trial