Integrating Elasticsearch and MongoDB: Made Easy

on Data Integration • October 29th, 2020 • Write for Hevo

In this blog, we will discuss ElasticSearch, MongoDB, and how you can connect MongoDB to ElasticSearch for indexing and searching for massive datasets. Here’s the detailed list that you’ll be covering in this blog.

Hevo, A Simpler Alternative to Integrate your Data for Analysis

Hevo offers a faster way to move data from databases or SaaS applications into your data warehouse to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.

Check out some of the cool features of Hevo:

  • Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
  • Real-time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
  • 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources that can help you scale your data infrastructure as required.
  • 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support call.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.

You can try Hevo for free by signing up for a 14-day free trial.

Introduction to ElasticSearch

ElasticSearch is an open-source tool designed to index the data and provide a near real-time search. It is a distributed search engine and is capable of indexing Herculean size data. Basic concepts of elastic search are NRT, Cluster, Node, Index, Type, Document, Shards & Replicas.

ElasticSearch can be used as a search and analytics engine for all types of data like numerical, textual, geospatial, unstructured, and structured. ElasticSearch is generally used in a stack known as ELK (ElasticSearch, LogStash, and Kibana) and is known for its speed, scalability, RestAPI, and distributed nature.

Use of ElasticSearch

The distributed nature, speed, scalability, and ability to index any document makes the usage of ElasticSearch almost with everything. It can be used for several use cases like- 

  1. Application search
  2. Website search
  3. Enterprise search
  4. Logging and log analytics
  5. Infrastructure metrics and container monitoring
  6. Application performance monitoring
  7. Geospatial data analysis and visualization
  8. Security analytics
  9. Business analytics

Introduction to MongoDB

MongoDB is an open-source NoSQL database that uses a document-oriented data model to store the data and supports NoSQL query language to query the data. MongoDB is widely used among organizations and is one of the most potent NoSQL databases in the market.

NoSQL means it does not use the concept of rows and columns to store the data; instead, it stores the data in the documents and maintains a collection of documents. The data stored in the document consists of a set of key-value pairs and allows it to scale vertically and stores them into a storage format known as BSON (Binary Style of JSON document). 

MongoDB allows you to modify the schemas without having any downtime, and it highly elastic that lets you combine and store data of multivariate types without having to compromise on the powerful indexing options, data access, and validation rules.

Integrate ElasticSearch and MongoDB

MongoDB is used for storage, and ElasticSearch is used to perform full-text indexing over the data. Hence, the combination of MongoDB for storing and ElasticSearch for indexing is a common architecture that many organizations follow.

There are various tools available that you can use to replicate the data from MongoDB to ElasticSearch for indexing. Let’s have a look at some of the top plugins or tools to copy or synchronize data from MongoDB to ElasticSearch.

MongoDB River Plugin

ElasticSearch-River-MongoDB is a plugin used to synchronize the data between ElasticSearch and MongoDB. 

In MongoDB, whenever the document is inserted into the database, the schema is updated and all the operations like Insert, Update, Delete are stored in Operation Log (oplog) collection as a rolling record. River plugin monitors the oplog collection and syncs them with ElasticSearch based on the configuration automatically. Once the data is synced with ElasticSearch, the indexes are updated automatically within ElasticSearch.

River

For the source code of River Plugin, you can refer to the GitHub link here – elasticsearch-river-mongodb

Steps to use Mongo River Connector

This plugin requires MongoDB as the source and ElasticSearch as the target to migrate and sync data between these two sources.

  1. To install the plugin, execute the below command at the MongoDB installation location – 
bin/plugin --install com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/2.0.9
  1. Check the compatibility of the connector with the ElasticSearch version here.
  2. Create the indexing river with the below curl syntax – 
curl -XPUT 'http://localhost:9200/_river/mongodb/_meta' -d '{
    "type": "mongodb", 
    "mongodb": { 
      "db": "DATABASE_NAME", 
      "collection": "COLLECTION", 
      "gridfs": true
    }, 
    "index": { 
      "name": "ES_INDEX_NAME", 
      "type": "ES_TYPE_NAME" 
    }
  }'

Example – 

 curl -XPUT 'http://localhost:9200/_river/mongodb/_meta' -d '{ 
    "type": "mongodb", 
    "mongodb": { 
      "db": "testmongo", 
      "collection": "person"
    }, 
    "index": {
      "name": "mongoindex", 
      "type": "person" 
    }
  }'
  1. To view indexed data in ElasticSearch, use the following command – 
http://localhost:9200/mongoindex/person/_search?preety

To know more about MongoDB-River Plugin usage, you can check the Github repo here.

LogStash

LogStash is an open-source tool from the ELK stack and is used to unify the data from multiple sources and then normalizes the data on the destinations. LogStash inputs the data from the source, modifies them using filters, and then outputs the data to the destination.

As LogStash is the tool from the ELK stack, it has excellent capabilities to connect with ElasticSearch, you can use LogStash to take input from MongoDB by using JDBC connector, and output to ElasticSearch. With the help of filters, you can modify the data in transit to ES(if required).

LogStash

To get started with LogStash and to read more about LogStash, you can look here.

Steps to use LogStash

  1.  To connect ElasticSearch and MongoDB via LogStash, you need the “logstash-input-mongodb” input plugin.
  1. Navigate to the LogStash Installation directory and perform the following commands – 
cd /usr/share/logstash
bin/logstash-plugin install logstash-input-mongodb
  1. Once the installation is successful, you need to create a configuration file that will take MongoDB as input and ElasticSearch as an output. A sample configuration file will look like as shown below – 
input {
        uri => 'mongodb://username:password@xxxx-00-00-nxxxn.mongodb.net:27017/xxxx?ssl=true'
        placeholder_db_dir => '/opt/logstash-mongodb/'
        placeholder_db_name => 'logstash_sqlite.db'
        collection => 'users'
        batch_size => 5000
}
filter {

}
output {
        stdout {
                codec => rubydebug
        }
        elasticsearch {
                action => "index"
                index => "mongo_log_data"
                hosts => ["localhost:9200"]
        }
}
  1. Once the configuration file is successfully set up, you can execute the below command to start the pipeline.
bin/logstash -f /etc/logstash/conf.d/mongodata.conf
  1. The above command will start fetching data from the MongoDB collection and will push to ElasticSearch for indexing. In ElasticSearch, an index named “mongo_log_data” will be created.

Mongo Connector

Mongo-Connector is the proprietary tool by MongoDB and a real-time sync system built on Python that allows you to copy the documents from MongoDB to target systems. 

On startup, it connects MongoDB to target systems and copies the data. Afterward, it regularly checks for the update and performs continuous updates on the target system to keep everything in sync. MongoDB connector creates a pipeline from one MongoDB cluster to target systems like ElasticSearch, Solr.

To sync the data to ElasticSearch, MongoDB needs to run in replica-set mode. Once the initial sync is completed, it then tails the Mongo oplog(Operation Log) to keep everything in sync in real-time.

mongo- connector

To know more about Mongo Connector, you can look at the official page here – mongo-connector

Steps to use Mongo Connector

  1. Download the ElasticSearch Doc Manger. DocManager is a lightweight, and simple to write class that defines a limited number of CRUD operations for the target system. To download the Doc Manager for ElasticSearch, follow the guide here – 

Elastic 1.x doc manager: https://github.com/mongodb-labs/elastic-doc-manager

Elastic 2.x doc manager: https://github.com/mongodb-labs/elastic2-doc-manager

  1. Install the Mongo Connector based on the type of ElasticSearch you’re using. Following metrics will help you to decide the correct installation based on the versions – 
ElasticSearch VersionInstallation Command
Elasticsearch 1.xpip install ‘mongo-connector[elastic]’
Amazon Elasticsearch 1.x Servicepip install ‘mongo-connector[elastic-aws]’
Elasticsearch 2.xpip install ‘mongo-connector[elastic2]’
Amazon Elasticsearch 2.x Servicepip install ‘mongo-connector[elastic2-aws]’
Elasticsearch 5.xpip install ‘mongo-connector[elastic5]’
  1. MongoDB connector uses oplog from MongoDB to replicate the operations, so a replica set must be running before startup. To create one node replica set, execute the below command – 
mongod --replSet myDevReplSet
rs.initiate()
  1. Once the replica set is up and running, you can invoke the connector as – 
mongo-connector -m <mongodb server hostname>:<replica set port> -t <replication endpoint URL, e.g. http://localhost:8983/es> -d <name of doc manager, e.g., elasticsearch_doc_manager>

To know more about MongoDB-Connector ElasticSearch usage, follow the Github guide here.

Conclusion

In this blog post, we have discussed how easily you can connect ElasticSearch and MongoDB for continuous indexing and searching of the documents. However, if you’re looking for a more straightforward solution, you can use Hevo Data – a No Code Data pipeline that you can use to build an ETL pipeline in an instant. Hevo integrates with 100+ sources including SaaS applications, databases, BI tools, etc.

Sign-up for a free trial here!

No-code Data Pipeline for your Data Warehouse