MongoDB is a cross-platform document-oriented database that belongs in the class of NoSQL databases. NoSQL databases differ from the traditional SQL databases in the sense that they do not enforce a strict table structure. This flexibility and a mature querying layer make MongoDB a popular choice for modern application development.
In most cases, there will also be other databases integrated with the application to serve analytics and reporting use cases. This leads to a situation where data from MongoDB needs to be relayed to other databases at regular intervals to keep them up to date. In time-critical reporting or analytics use cases, this data syncing will need to be run on a real-time basis.
There are other use cases as well where companion applications need to immediately react to database changes and act accordingly. All these requirements are usually handled through the architecture paradigm called change data capture. This post will cover the basics of setting up MongoDB CDC copy data to other target databases.
MongoDB CDC: Using a Timestamp Column
The most simple way to execute a MongoDB change data capture is to have a timestamp column in collections that changes with insert and update operations. This requires the developers to implement a script that will poll the collections and create the database commands for the target database. In the case of MongoDB, the object_id is representative of the time at which the row was created.
But, there is no default mechanism to track updates to a row. To accomplish this a specific timestamp column needs to be created and updated through application logic every time there is a change in a row.
As you may have already guessed, this is not always practical and creates a huge overhead in the application logic. A better method will be to use MongoDB’s change stream implementation which is explained in the next section.
Integrate MongoDB to BigQuery
Integrate MongoDB to Snowflake
Integrate MongoDB to Redshift
MongoDB CDC: Using Change Streams
Replication is accomplished through an operation log and Change streams rely on the same operation log. Let’s begin the process by enabling a replica set for the Mongo database. Shut down your current MongoDB instance and restart it with a replica set.
- To start MongoDB server with a replica set, use the below command.
mongod --dbpath ./mongodata/db --replSet "rs"
/mongodata/db is the folder in which MongoDB stores data.
Change streams work by listening to the operational log, which contains all information related to writes in the database on a storage level. Change streams only work if replica sets are enabled in MongoDB. A replica set is a second instance that helps the application maintain high availability in case of failures.
- Go to the MongoDB Shell and type the below command to start the replication.
s.initiate()
After this step, every ‘write’ operation in MongoDB will emit a change stream. We will need a simple script to listen to these change streams and act accordingly. For now, let’s consider a small javascript snippet to print all the changes that are happening in the database.
- First, create a connection to the MongoDB instance. In the below example, organizatio_db is our MongoDB database and the employee is our collection in which data is stored.
conn = new Mongo("mongodb://localhost:27017/demo?replicaSet=rs"); db = conn.getDB("organization_db"); collection = db.employee;
- Create a change stream cursor. This cursor will help us listen to the change streams.
const changeCursor = collection.watch();
- Create a function to get the streams.
function getStream() { while (!changeCursor.isExhausted()) { if (changeCursor.hasNext()) { changed_value = changeCursor.next(); print(JSON.stringify(changeCursor)); } } getStream(changeCursor); }
This function listens to the streams and prints the changes in JSON format. For copying data from MongoDB to another database, the developer will need to write code to transform these events to target database commands.
This is an oversimplified version of how to implement a MongoDB change stream-based synchronization. In reality, there are multiple pain points to using this approach to frequently sync data from MongoDB to other databases. A few of the downsides are mentioned below.
Time to stop hand-coding your data pipelines and start using Hevo’s No-Code, Fully Automated ETL solution. With Hevo, you can replicate data from a growing library of 150+ plug-and-play integrations and 15+ destinations — SaaS apps, databases, data warehouses, and much more.
Hevo’s ETL empowers your data and business teams to integrate multiple data sources or prepare your data for transformation. Hevo’s Pre and Post Load Transformations accelerates your business team to have analysis ready data without writing a single line of code!
Gain faster insights, build a competitive edge, and improve data-driven decision-making with a modern ETL solution. Hevo is the easiest and most reliable data replication platform that will save your engineering bandwidth and time multifold.
Start your data journey with the fastest ETL on the cloud!
Sign up here for a 14-day free trial!
MongoDB CDC Using Custom Code: Limitations
- The stream will break if there is a loss of connection to the MongoDB cluster. This happens mostly during a timeout or network error. In such cases, the script will need to have the logic to resume the listening. MongoDB provides an option called resume_token which is the id of the last change received to fetch the changes after a point. This logic will also need to be built into the code to make the operation reliable.
- Using this custom code approach to copy data to different target databases needs expert developers who have knowledge about both the source and target databases. There will be a learning curve before implementing the script using MongoDB CDC connector.
Compared to the above approaches which involved creating custom scripts, a better way will be to use a completely automated cloud-based ETL tool like Hevo Data for MongoDB CDC. This lets the developers focus on the application and business logic without thinking about the complexities of implementing a reliable synchronization process. Hevo’s easy-to-use interface helps to synchronize data between MongoDB and various target data warehouses (BigQuery, Redshift, Snowflake) and databases (PostgreSQL, MySQL) with the least time to production.
Change data capture for MongoDB can be achieved in just 3 steps:
- Authenticate and connect your MongoDB data source
- Select Change Data Capture as your replication mode
- Point to the destination where you want to move data
Get started with hevo for free
Sign up for a 14-day free trial to experience hassle-free MongoDB CDC.
That is all! Hevo will now take care of automatically loading data from MongoDB to your destination in real-time. Additionally, Hevo can handle some of the most complex MongoDB ETL challenges out-of-the-box, making data load from MongoDB a cakewalk for you. Here are the advantages:
- Automatic Schema Handling: MongoDB is a schema-less database. This takes away the predictability from the ETL process as you would not know the concrete number of fields that will come in the data. Hevo dynamically detects schema as it extracts MongoDB documents updates the earlier encountered schema to incorporate new collections and fields. Hevo also creates the required tables and columns in the destination database as and when the schema changes.
- Splitting Nested JSON: If you are looking to move data from MongoDB to a data warehouse for analytics, naturally, you might want to split nested JSON and flatten them out. Hevo provides a way for you to handle this easily using a data transformation layer. You can read more about Hevo’s data transformation layer here.
- Detailed Logging and Monitoring: Clearly, the data being moved from MongoDB to your destination is critical for your business. Any leakage or break in the flow of data can cause irretrievable damage. To handle this, Hevo comes with granular activity logs that allow you to monitor the flow of data in real-time.
- Timely Notifications: Hevo sends email notifications to the user whenever any incompatibilities are detected in the MongoDB source. This will allow users to take necessary actions in a timely fashion and ensure that data MongoDB data in the destination is always up to date.
Closing Note
The first approach of cdc MongoDB discussed in this blog is practical only in a limited number of scenarios because of the hard requirement to have a timestamp column that changes with every write operation. MongoDB’s change streams provide an excellent way of capturing writes without the developer having to write complex code to listen to the operation log. It abstracts away all these by providing a few easy-to-use functions.
Even with the change stream functionality, it is the responsibility of the developer to write code to process the streams and translate them to target database commands. A solution like Hevo goes one step above and allows the developers to carry out this process by only specifying a few details about the source MongoDB instance and target database. Explore the complete feature set of Hevo here.
visit our website to explore hevo
Hevo Data, with its strong integration with 150+ Sources & BI tools allow you to not only export data from sources & load data in the destinations, but also transform & enrich your data, & make it analysis-ready so that you can focus only on your key business needs and perform insightful analysis using BI tools. You can easily extract data from HubSpot using Hevo to track the performance of your business and optimize it further to increase revenue.
Give Hevo Data a try and sign up for a 14-day free trial today. Hevo offers plans & pricing for different use cases and business needs, check them out!
With around a decade of experience, Sarad has designed and developed fundamental components of Hevo. His expertise lies in building lean solutions for various software problems, mentoring fellow engineers and exploring new technologies