How to Setup a MongoDB Aggregation Pipeline? | A Comprehensive Guide 101

on Data Aggregation, Database Management Systems, MongoDB • April 14th, 2022 • Write for Hevo

MongoDB database is a NoSQL, general-purpose program that uses JSON-like documents to store its data. The JSON-like documents provide a flexible and dynamic schema while maintaining simplicity, unlike relational databases that use tabular relationships. Since NoSQL databases can scale horizontally, they are suitable for real-time and big data applications.

This article talks about the different steps you can follow to set up a MongoDB Pipeline seamlessly. It also gives a brief introduction to MongoDB and its key features, key operators, and stages of Aggregation Pipeline MongoDB Lookup Pipeline, examples, limitations, and best practices to name a few.

Table of Contents

What is MongoDB?

MongoDB Aggregation Pipeline: MongoDB Logo
Image Source

MongoDB is a NoSQL open-source document-oriented database developed for storing and processing high volumes of data. Compared to the conventional relational databases, MongoDB makes use of collections and documents instead of tables consisting of rows and columns. The Collections consist of several documents and documents containing the basic units of data in terms of key and value pairs.

Introduced in February 2009, the MongoDB database is designed, maintained, and managed by MongoDB.Inc under SSPL(Server Side Public License). Organizations such as Facebook, Nokia, eBay, Adobe, Google, etc. prefer it for efficiently handling and storing their exponentially growing data. It offers complete support for programming languages such as C, C++, C#, Go, Java, Node.js, Perl, PHP, Python, Motor, Ruby, Scala, Swift, and Mongoid.

Key Features of MongoDB

With constant efforts from the online community, MongoDB has evolved over the years. Some of its eye-catching features are:

  • High Data Availability & Stability: MongoDB’s Replication Feature provides multiple servers for disaster recovery and backup. Since several servers store the same data or shards of data, MongoDB provides greater data availability & stability. This ensures all-time data access and security in case of server crashes, service interruptions, or even good old hardware failure. 
  • Accelerated Analytics: You may need to consider thousands to millions of variables while running Ad-hoc queries. MongoDB indexes BSON documents and utilizes the MongoDB Query Language (MQL) that allows you to update Ad-hoc queries in real-time. MongoDB provides complete support for field queries, range queries, and regular expression searches along with user-defined functions.
  • Indexing: With a wide range of indices and features with language-specific sort orders that support complex access patterns to datasets, MongoDB provides optimal performance for every query. For the real-time ever-evolving query patterns and application requirements, MongoDB also provisions On-demand Indices Creation.
  • Horizontal Scalability: With the help of Sharding, MongoDB provides horizontal scalability by distributing data on multiple servers using the Shard Key. Each shard in every MongoDB Cluster stores parts of the data, thereby acting as a separate database. This collection of comprehensive databases allows efficient handling of growing volumes of data with zero downtime. The complete Sharding Ecosystem is maintained and managed by Mongos that direct queries to the correct shard based on the Shard Key.
  • Load Balancing: Real-time Replication and Sharding contribute toward large-scale Load Balancing. Ensuring top-notch Concurrency Controls and Locking Protocols, MongoDB can effectively handle multiple concurrent read and write requests for the same data.  
  • Aggregation: Similar to the SQL Group By clause, MongoDB can easily batch process data and present a single result even after executing several other operations on the group data. MongoDB’s Aggregation framework consists of 3 types of aggregations i.e. Aggregation Pipeline, Map-Reduce Function, and Single-Purpose Aggregation methods.

Replicate Data From MongoDB in Minutes Using Hevo’s No-Code Data Pipeline

Hevo Data, a Fully-managed Data Pipeline platform, can help you automate, simplify & enrich your data replication process in a few clicks. With Hevo’s wide variety of connectors and blazing-fast Data Pipelines, you can extract & load data from 100+ Data Sources (including 40+ free sources) such as MongoDB straight into your Data Warehouse or any Databases. To further streamline and prepare your data for analysis, you can process and enrich raw granular data using Hevo’s robust & built-in Transformation Layer without writing a single line of code!

GET STARTED WITH HEVO FOR FREE

Hevo is the fastest, easiest, and most reliable data replication platform that will save your engineering bandwidth and time multifold. Try our 14-day full access free trial today to experience an entirely automated hassle-free Data Replication!

What is the MongoDB Aggregation Pipeline?

When dealing with a database management system, any time you extract data from the database you need to execute an operation called a query. However, queries only return the data that already exists in the database. Therefore, to analyze your data to zero in on patterns or other information about the data – instead of the data itself – you’ll often need to perform another kind of operation called an aggregation.

MongoDB allows you to perform aggregation operations through a mechanism called MongoDB Aggregation Pipelines. These are essentially built as a sequential series of declarative data operations called stages.

Each stage can then inspect and transform the documents as they pass through the pipeline, putting the transformed data results into the subsequent stages for further processing. Documents from a chosen collection get into a pipeline and go through each stage, where the output coming from each stage becomes the input for the next stage, and the final result is obtained at the end of the pipeline.

Stages can help you perform operations like:

  • Sorting: You can reorder the documents based on a chosen field.
  • Filtering: This operation resembles queries, where the list of documents can be narrowed down through a set of criteria. 
  • Grouping: With this operation, you can process multiple documents together to generate a summarized result. 
  • Transforming: Transforming refers to the ability to modify the structure of documents. This means you can rename or remove certain fields, or perhaps group or rename fields within an embedded document for legibility.    

What are the Operators in MongoDB Aggregation Pipeline?

MongoDB provides you with an exhaustive list of operators that you can use across various aggregation stages. Each of these operators can be used to construct expressions for use in the aggregation pipeline stages. Operator expressions are similar to functions that use arguments. Generally, these expressions use an array of arguments and have the following format:

{ <operator> : [ <argument1>, <argument2>, ... ] }

However, if you only want to use an operator that accepts a single argument, you can omit the array field. It can be used in the following format:

{ < operator> : <argument> }

Here are a few different operators you can choose from:

  • Comparison Expression Operators: This returns a boolean, except for $cmp, which will return a number.
  • Arithmetic Expression Operators: These operators will perform mathematical operations on numbers. 
  • Array Expression Operators: With Array Expression Operators, you can perform operations on arrays.
  • Boolean Expression Operators: These operators evaluate their argument expressions as booleans and return a boolean as a result. 
  • Literal Expression Operators: Literal Expression Operators can return a value without having to parse it first.
  • Conditional Expression Operators: With Conditional Expression Operators, you can help build conditional statements. 
  • Custom Aggregation Expression Operators: You can use custom aggregation expression operators to define custom aggregation functions.
  • Object Expression Operators: These allow you to merge or split documents. 
  • Date Expression Operators: Date expression operators returns date components or objects of a given date object.
  • Text Expression Operators: These operators allow you to access per-document metadata per aggregation.
  • String Expression Operators: With the help of these operators, you can perform well-defined behavior for strings of ASCII characters.
  • Trigonometry Expression Operators: These operators can perform trigonometric operations on numbers.
  • Type Expression Operators: You can use these operators to perform operations on the data type.
  • Variable Expression Operators: These operators can define variables for use within the scope of a subexpression and return the result of that subexpression.

7 Key MongoDB Aggregation Pipeline Stages

Every stage of the MongoDB Aggregation Pipeline transforms the document as the documents pass through it. However, once an input document passes through a stage, it doesn’t necessarily produce one output document. Some stages might create more than one document as a result.

MongoDB offers its users the db.collection.aggregate() method in the mongo shell along with the db.aggregate() command to run the aggregation pipeline. A stage can show up multiple times within a pipeline, with the exception of $merge, $out, and $geoNear stages.

$match

This MongoDB Aggregation Pipeline stage filters the document stream to allow only matching documents to pass unmodified into the next pipeline stage. For every input document, the output is either zero document (no match) or one document (a match).

$group

With this MongoDB Aggregation Pipeline stage, you can group input documents by a specified identifier expression and apply the accumulator(s) expressions, if mentioned, to every group. $group ends up consuming all input documents and gives one document per each distinct group. The output documents will only contain the identifier fields, and if mentioned, the accumulated fields. 

$project

This MongoDB Aggregation Pipeline stage can reshape every document in the stream, for instance, by adding new fields or getting rid of existing fields. For every input document, you can provide one document as an output. 

$sort

With $sort, you can reorder the document streams with a specified sort key. The documents are unmodified, leave for the order of the documents. For every input document, the output for this MongoDB Aggregation Pipeline stage is a single document. 

$skip

$skip allows you to skip the first n documents where n is the specified skip number and passes the remaining documents unamended to the pipeline. For every input document, the output for this MongoDB Aggregation Pipeline stage is either a zero document (after the first n documents) or one document (for the first n documents).

$limit

This MongoDB Aggregation Pipeline stage allows you to pass the first n documents unamended to the pipeline where n is the specified limit. For every input document, the output is either a zero document (after the first n documents) or one document (for the first n documents).

$unwind

This MongoDB Aggregation Pipeline can break an array field from the input documents and outputs one document for every element. Every output document will contain the same field, but the array field gets replaced by an element value per document. For every input document, $unwind will output n documents where n is the number of elements and could even be zero for an empty array.

For more information on MongoDB Aggregation Pipeline stages, you can give MongoDB Aggregation Pipeline Stages a read. 

What Makes Hevo’s ETL Process Best-In-Class

Providing a high-quality ETL solution can be a difficult task if you have a large volume of data. Hevo’s automated, No-code platform empowers you with everything you need to have for a smooth data replication experience.

  • Exceptional Security: A Fault-tolerant Architecture that ensures secure, consistent access with Zero Data Loss.
  • Built to Scale: Exceptional Horizontal Scalability with Minimal Latency for Modern-data Needs.
  • Built-in Connectors: Support for 100+ Data Sources, including Databases, SaaS Platforms, Files & More. Native Webhooks & REST API Connector available for Custom Sources.
  • Data Transformations: Best-in-class & Native Support for Complex Data Transformation at fingertips. Code & No-code Fexibilty designed for everyone.
  • Smooth Schema Mapping: Fully-managed Automated Schema Management for incoming data with the desired destination.
  • Blazing-fast Setup: Straightforward interface for new customers to work on, with minimal setup time.
Sign Up for a 14-day free trial

How to Run the MongoDB Aggregation Pipeline?

Step 1: Setting up the Connection

  • First, you need to open a connection to the database by using the command ‘mongo’. When you see ‘>’ within the prompt then you are ready to perform the commands related to the database operations.
MongoDB Aggregation Pipeline: Running Pipeline Step 1
Image Source

Step 2: Creating Database

  • You can first create a test database ‘testdb’ by leveraging the ‘use’ command.
MongoDB Aggregation Pipeline: Creating a Test Database
Image Source
  • Next, if the database exists then the above command will use that database, or else it will generate a new database.
  • You can now create a collection called ‘products’ inside this database by using the command ‘createcollection’.
MongoDB Aggregation Pipeline: Creating a Collection
Image Source
  • Now, you can insert the test documents within the collection with the help of the command ‘InsertMany’.
MongoDB Aggregation Pipeline: Inserting Test Documents
Image Source
  • This shows that the documents have been inserted successfully. You can also check the documents in the collection with the command “find” as follows:
MongoDB Aggregation Pipeline: Find Command for Pipeline
Image Source

Step 3: Creation of Aggregation Pipeline

  • In the above collection, say you want to find out the total amount of sales that took place for Samsung and Apple. To filter out the documents based on the available=” True”, you can leverage the “Match” command. Next, you’ll have to find out the “price” which would be the second stage as mentioned below. In the second stage, the grouping is carried out based on the brand and then the total sum of the price is calculated using the command “Group”.
MongoDB Aggregation Pipeline: Creating Aggregation Pipeline Group
Image Source
  • You can add one more stage to this output called sort to display the sum based on higher price to lower price as shown below. You can use “sort” for this situation. In the sort, 1 refers to ascending order and -1 refers to descending order.
MongoDB Aggregation Pipeline: Sorting the Price
Image Source

Examples of MongoDB Aggregation Pipelines

If you consider this test “posts” collection:

{
   "title" : "my first blog",
   "author" : "John",
   "likes" : 4,
   "tags" : ["angular", "react", "python"]
},
{
   "title" : "my second blog",
   "author" : "John",
   "likes" : 7,
   "tags" : ["javascript", "ruby", "vue"]
},
{
   "title" : "hello city",
   "author" : "Ruth",
   "likes" : 3,
   "tags" : ["vue", "react"]
}

$group

This is what $group would look like on this:

db.posts.aggregate([
    { $group: { _id:"$author", titles: { $push:"$title"}} }
])

The output for this command would be as follows:

{
    "_id" : "Ruth",
    "titles" : [
        "hello city"
    ]
},
{
    "_id" : "John",
    "titles" : [
        "my first blog",
        "my second blog"
    ]
}

$match

This is what the command would look like for $match:

db.posts.aggregate([
    { $match: { author:"John"} }
])

This is what the result would look like for this command:

{
    "_id" : ObjectId("5c58e5bf186d4fe7f31c652e"),
    "title" : "my first blog",
    "author" : "John",
    "likes" : 4.0,
    "tags" : [
        "angular",
        "react",
        "python"
    ]
},
{
    "_id" : ObjectId("5c58e5bf186d4fe7f31c652f"),
    "title" : "my second blog",
    "author" : "John",
    "likes" : 7.0,
    "tags" : [
        "javascript",
        "ruby",
        "vue"
    ]
}

$sum

For this example set, we can execute this command as follows:

db.posts.aggregate([
   { $group: { _id: "$author", total_likes: { $sum: "$likes" } } }
])

This is what the output of this command would look like:

{
   "_id" : "Ruth",
   "total_likes" : 3
},
{
   "_id" : "John",
   "total_likes" : 11
}

How to Boost MongoDB Aggregation Pipeline Performance?

Here are a few simple things to consider to boost your MongoDB Aggregation Pipeline performance:

  • The db.aggregate() command can either store the results in a collection or return a cursor. When returning a cursor or storing the results within a collection, each document in the result set is subject to the BSON Document Size Limit (16 MB currently). Therefore, if any single BSON document exceeds the BSON Document Size Limit, the command will throw an error.  
  • If you have multiple pipeline stages, it is usually better to understand the overhead attached to every stage. For example, if you have both the $match and $sort stage in your pipeline, it is highly recommended that you utilize a $match before $sort to minimize the documents that you wish to sort. 

What are the Limitations of MongoDB Aggregation Pipelines?

Despite the various advantages of leveraging MongoDB Aggregation Pipelines for your business use case, it is far from perfect. As far as limitations are concerned, the result has the same size limitations per document (16 megabytes). On top of this, every stage is limited by 100 MB of RAM.

You can work around the size limitations by leveraging the allowDiskUse option, otherwise, MongoDB might throw an error.

Conclusion

This article delves into the various salient features of MongoDB Aggregation Pipelines and the steps you can follow to set one up for your business use case seamlessly. It also gives a brief introduction to MongoDB’s features and benefits before discussing the various operators, best practices, examples, stages, and much more to give you a complete idea about MongoDB Aggregation Pipelines.

To get a complete picture of your business performance and financial health, you need to consolidate data from MongoDB and all the other applications used across your business. To achieve this you need to assign a portion of your Engineering Bandwidth to Integrate data from all sources, Clean & Transform it, and finally, Load it to a Cloud Data Warehouse or a destination of your choice for further Business Analytics. All of these challenges can be comfortably solved by a Cloud-Based ETL tool such as Hevo Data.  

Visit our Website to Explore Hevo

Hevo Data, a No-code Data Pipeline can seamlessly transfer data from a vast sea of 100+ sources such as MongoDB & MongoDB Atlas to a Data Warehouse or a Destination of your choice. It is a reliable, completely automated, and secure service that doesn’t require you to write any code! 

If you are using MongoDB as your NoSQL Database Management System and searching for a no-fuss alternative to Manual Data Integration, then Hevo can effortlessly automate this for you. Hevo, with its strong integration with 100+ sources(Including 40+ Free Sources), allows you to not only export & load data but also transform & enrich your data & make it analysis-ready in a jiffy.Want to take Hevo for a ride? Sign Up for a 14-day free trial and simplify your Data Integration process. Do check out the pricing details to understand which plan fulfills all your business needs.

No-code Data Pipeline for MongoDB