MongoDB to Redshift: 2 Easy Methods

• September 14th, 2021

MongoDB to Redshift - Featured image

MongoDB is a NoSQL database which means SQL queries cannot be directly run on MongoDB. Hence, it is necessary to bring MongoDB’s data to a relational format data warehouse like AWS Redshift to perform analytical queries. It is simple and cost-effective to efficiently analyze all your data by using a real-time data pipeline.

MongoDB is document-oriented and uses JSON-like documents to store data. It exists within a single database and doesn’t enforce a schema. MongoDB doesn’t enforce schema restrictions while storing data, the application developers can quickly change the schema, add new fields and forget about older ones that are not used anymore without worrying about tedious schema migrations. Owing to the schema-less nature of a MongoDB collection, converting data into a relational format is a non-trivial problem. This gives scope for rapid development and flexible deployments at scale.

This blog talks about the two different methods you can use to set up a connection from MongoDB to Redshift in a seamless fashion: Using Custom ETL Scripts and with the help of a third-party tool, Hevo.

Table of Contents

Introduction to MongoDB

Mongodb logo
Image Source

MongoDB distinguishes itself as a NoSQL database program. It uses JSON-like documents along with optional schemas. MongoDB is written in C++. MongoDB allows you to address a diverse set of data sets, accelerate development, and adapt quickly to change with key functionalities like horizontal scaling and automatic failover.

Here are a few key features of MongoDB:

  • File Storage: MongoDB makes use of a file system called GridFS (short for Grid File System) is included with MongoDB drivers. It divides a file into parts, each of which is stored as a separate document.
  • Load Balancing: MongoDB uses sharding to scale horizontally. The user picks out a shard key, which determines the data distribution in a collection. The data is split into ranges (based on the shard key) and distributed across various shards. You can also hash the shard key to be mapped to a shard. This results in even data distribution.
  • Replication: MongoDB offers high availability with replica sets. A replica set consists of two or more copies of the data. Each replica-set member may serve as the primary or the secondary replica at any time.

Introduction to Amazon Redshift

Redshift Logo
Image Source

Amazon Redshift is essentially a storage system that allows companies to store petabytes of data across easily accessible “Clusters” that you can query in parallel. Every Amazon Redshift Data Warehouse is fully managed which means that the administrative tasks like maintenance backups, configuration, and security are completely automated. 

Amazon Redshift is primarily designed to work with Big Data and is easily scalable due to its modular node design. It also allows users to gain more granular insight into datasets, owing to the ability of Amazon Redshift Clusters to be further divided into slices. Amazon Redshift’s multi-layered architecture allows multiple queries to be processed simultaneously thus cutting down on waiting times. Apart from these, there are a few more benefits of Amazon Redshift that are covered in the following section.

Key Features of Amazon Redshift

  • Enhanced Scalability: Amazon Redshift is known for providing consistently fast performance, even in the face of thousands of concurrent queries. Amazon Redshift Concurrency Scaling supports nearly unlimited concurrent queries and users. By leveraging Redshift’s managed storage, capacity is added to support workloads of up to 8 PB of compressed data. Scaling is just a simple API call, or a few clicks in the console away.    
  • Easy Management: Amazon Redshift automates oft-repeated maintenance tasks so that you can focus on gathering actionable insights from your data. It is fairly simple to set up and operate. A new Data Warehouse can be deployed with just a few clicks in the AWS console. Key administrative tasks like backup and replication are automated. Data in Amazon Redshift is automatically backed up to Amazon S3. Amazon Redshift can replicate your snapshots to Amazon S3 asynchronously in a different region for disaster recovery. The Automatic Table Optimization selects the best distribution keys and sort method to enhance the performance efficacy for the cluster’s workload. Amazon Redshift also gives you the flexibility to work with queries in the console, or Business Intelligence tools, libraries, and SQL client tools.    
  • Robust Security: Amazon Redshift is known for providing robust data security features at no extra cost. Amazon Redshift allows you to configure firewall rules to take control of network access to a specific Data Warehouse Cluster. Amazon Redshift also specializes in granular column and row-level security controls that ensure that users can only view data with the right type of access. Apart from these, Amazon Redshift also delivers on its promise of reliability and compliance through tokenization, end-to-end encryption, network isolation, and auditing.  

Methods to replicate data from MongoDB data to Redshift:

There are two popular methods to move data from MongoDB to Amazon Redshift.

Method 1: Using Custom Scripts to Move Data from MongoDB to Redshift

This is a simple 4 step approach from exporting data using mongoexport to loading the data from S3 to Redshift. This method has considerable advantages but suffers from a few setbacks as well.

Method 2: Using Hevo Data to Move Data from MongoDB to Redshift

Use Hevo Data to load, transform, and cleanse the data. Use this method to load large volumes of data from multiple data sources, or to load data incrementally and automatically.

Understanding the Methods to Move Data from MongoDB to Redshift

MongoDB to Redshift Integration image
Image Source

These are the methods you can use to move data from MongoDB to Redshift in a seamless fashion:

Download the Cheatsheet on How to Set Up High-performance ETL to Redshift
Download the Cheatsheet on How to Set Up High-performance ETL to Redshift
Download the Cheatsheet on How to Set Up High-performance ETL to Redshift
Learn the best practices and considerations for setting up high-performance ETL to Redshift

Method 1: Using Custom Scripts to Move Data from MongoDB to Redshift

Following are the steps to move data from MongoDB to Redshift using Custom Script:

  • Step 1: Use mongoexport to export data.
  • Step 2: Upload the .json file to the S3 bucket.
  • Step 3: Create a Table schema before loading the data into Redshift.
  • Step 4: Using the COPY command load the data from S3 to Redshift.

For the scope of this article, we have highlighted the challenges faced while migrating data from MongoDB to Amazon Redshift. Towards the end of the article, a detailed list of advantages of using approach 2 is also given. You can check out Method 1 on our other blog and know the detailed steps to migrate MongoDB to Amazon Redshift.

Limitations of using Custom Scripts to Move Data from MongoDB to Redshift

Here is a list of limitations of using the manual method of moving data from MongoDB to Redshift:

  • Schema Detection Cannot be Done Upfront: Unlike a relational database, a MongoDB collection doesn’t have a predefined schema. Hence, it is impossible to look at a collection and create a compatible table in Redshift upfront.
  • Different Documents in a Single Collection: Different documents in single collection can have a different set of fields. A document in a collection in MongoDB can have a different set of fields.
{

  "name": "John Doe",

  "age": 32,

  "gender": "Male"

}
{

  "first_name": "John",

  "last_name": "Doe",

  "age": 32,

  "gender": "Male"

}

Different documents in a single collection can have incompatible field data types. Hence, the schema of the collection cannot be determined by reading one or a few documents.

2 documents in a single MongoDB collection can have fields with values of different types.

{

  "name": "John Doe",

  "age": 32,

  "gender": "Male"

  "mobile": "(424) 226-6998"

}
{

"name": "John Doe",

"age": 32,

"gender": "Male",

"mobile": 4242266998

}

The field mobile is a string and a number in the above documents respectively. It is a completely valid state in MongoDB. In Redshift, however, both these values either will have to be converted to a string or a number before being persisted.

  • New Fields can be added to a Document at Any Point in Time: It is possible to add columns to a document in MongoDB by running a simple update to the document. In Redshift, however, the process is harder as you have to construct and run ALTER statements each time a new field is detected.
  • Character Lengths of String Columns: MongoDB doesn’t put a limit on the length of the string columns. It has a 16MB limit on the size of the entire document. However, in Redshift, it is a common practice to restrict string columns to a certain maximum length for better space utilization. Hence, each time you encounter a longer value than expected, you will have to resize the column.
  • Nested Objects and Arrays in a Document: A document can have nested objects and arrays with a dynamic structure. The most complex of MongoDB ETL problems is handling nested objects and arrays.
{

"name": "John Doe",

"age": 32,

"gender": "Male",

"address": {

"street": "1390 Market St",

"city": "San Francisco",

"state": "CA"

},

"groups": ["Sports", "Technology"]

}

MongoDB allows nesting objects and arrays to several levels. In a complex real-life scenario is may become a nightmare trying to flatten such documents into rows for a Redshift table.

  • Data Type Incompatibility between MongoDB and Redshift: Not all data types of MongoDB are compatible with Redshift. ObjectId, Regular Expression, Javascript are not supported by Redshift. While building an ETL solution to migrate data from MongoDB to Redshift from scratch, you will have to write custom code to handle these data types.

Method 2: Using Hevo Data to Move Data from MongoDB to Redshift

Hevo Logo
Image Source

Hevo Data, a No-code Data Pipeline helps to Load Data from any data source such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services and simplifies the ETL process. It supports 100+ data sources including MongoDB, etc., and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo loads the data onto the desired Data Warehouse, enriches the data, and transforms it into an analysis-ready form without writing a single line of code.

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different Business Intelligence (BI) tools as well.

Get Started with Hevo for free
  • Configure Source: Load Data from Hevo to MongoDB by entering details like Database Port, Database Host, Database User, Database Password, Pipeline Name, Connection URI, and the connection settings.
Image Source
  • Integrate Data: Load data from MongoDB to Redshift by providing your Redshift databases credentials like Database Port, Username, Password, Name, Schema, and Cluster Identifier along with the Destination Name.
Image Source

Here are a few benefits of using Hevo:

  • Real-time Data Migration in Minutes: With Hevo, you can connect to MongoDB on a point and click UI. Without writing any code, you can seamlessly move data from MongoDB to Redshift in real-time. This will eliminate the need for you to run batch ETL jobs or manage servers. Hevo takes care of all the grunt work and makes your data available in real-time for analysis by using existing business intelligence tools.
  • Automatic Schema Detection and Evolution: Hevo dynamically detects schema as it pulls documents for ETL and upgrades the earlier encountered schema to incorporate new collections and fields. Hevo also creates the required tables and columns in Redshift as and when the schema changes.
  • Notifications for Schema Incompatibilities: Hevo sends notifications to the user whenever the fields with incompatible types are detected in a collection. Hevo also allows the user to take appropriate actions to handle the incompatibility right on time.
  • Dynamic expansion for Varchar Columns: Hevo expands the existing varchar columns in Redshift dynamically as and when it encounters longer string values. This ensures that your Redshift space is used wisely without you breaking a sweat.
  • Splitting Nested Documents with Transformations: Hevo lets you split the nested MongoDB documents into multiple rows in Redshift by writing simple Python transformations. This makes MongoDB file flattening a cakewalk for users.
  • Automatic Conversion to Redshift Data Types: Hevo converts all MongoDB data types to the closest compatible data type in Redshift. This eliminates the need to write custom scripts to maintain each data type, in turn, making the migration of data from MongoDB to Redshift seamless.

Conclusion

This blog talks about the 2 different methods you can use to set up a connection from MongoDB to Redshift in a seamless fashion: Using Custom ETL Scripts and with the help of a third-party tool, Hevo.

Outside of the benefits offered by Hevo, you can use Hevo to migrate data from an array of different sources – databases, cloud applications, SDKs, and more. This will provide the flexibility to instantly replicate data from any source like MongoDB to Redshift.

You can additionally model your data, build complex aggregates and joins to create materialized views for faster query executions on Redshift. You can define the interdependencies between various models through a drag and drop interface with Hevo’s Workflows.

Visit our Website to Explore Hevo

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your experience of setting up a connection from MongoDB to Redshift in the comments section below!

No-code Data Pipeline for Redshift