Press "Enter" to skip to content

MongoDB to Redshift – ETL Challenges

MongoDB to Redshift ETL

MongoDB To Redshift ETL

MongoDB is a NoSQL database which means SQL queries cannot be directly run on MongoDB. Hence, it is necessary to bring MongoDB’s data to a relational format data warehouse like AWS Redshift to perform analytical queries.

MongoDB uses JSON like documents to store data. A collection of a group of documents. It exists within a single database and doesn’t enforce a schema. MongoDB doesn’t enforce schema restrictions while storing data, the application developers can quickly change the schema, add new fields and forget about older ones which are not used anymore without worrying about tedious schema migrations. Owing to the schema-less nature of a MongoDB collection, converting data into a relational format is a non-trivial problem. This gives scope for rapid development and flexible deployments at scale.

Methods to replicate data from MongoDB data to Redshift:

There are two popular methods to move data from MongoDB to Amazon Redshift.

Method 1: A ready to use Hevo Data Integration Platform (7 Days Free Trial).

Method 2: Write Custom ETL Scripts to move data from MongoDB to Amazon Redshift.

For the scope of this article, we have highlighted the challenges faced while migrating data from MongoDB to Redshift. Towards the end of the article, a detailed list of advantages of using approach 1 is also given.

Challenges with writing Custom ETL scripts to move data from MongoDB to Redshift:

1. Schema detection cannot be done up front

Unlike a relational database, a MongoDB collection doesn’t have a predefined schema. Hence, it is impossible to look at a collection and create a compatible table in Redshift upfront.

2. Different documents in a single collection can have a different set of fields

A document in a collection in MongoDB can have a different set of fields.

{

  "name": "John Doe",

  "age": 32,

  "gender": "Male"

}

{

  "first_name": "John",

  "last_name": "Doe",

  "age": 32,

  "gender": "Male"

}


3. Different documents in a single collection can have incompatible field data types

Hence, the schema of the collection cannot be determined by reading one or a few documents.

Two documents in a single MongoDB collection can have fields with values of different types.

{

  "name": "John Doe",

  "age": 32,

  "gender": "Male"

  "mobile": "(424) 226-6998"

}
{

"name": "John Doe",

"age": 32,

"gender": "Male",

"mobile": 4242266998

}

The field mobile is a string and a number in the above documents respectively. It is a completely valid state in MongoDB. In Redshift, however, both these values either will have to be converted to a string or a number before being persisted.

4. New fields can be added to a document at any point in time

It is possible to add columns to a document in MongoDB by running a simple update to the document. In Redshift, however, the process is harder as you have to construct and run ALTER statements each time a new field is detected.

5. Character lengths of String columns

MongoDB doesn’t put a limit on the length of the string columns. It has a 16MB limit on the size of the entire document. However, in Redshift, it is a common practice to restrict string columns to a certain maximum length for better space utilization. Hence, each time you encounter a longer value than expected, you will have to resize the column.

6. A document can have nested objects and arrays with a dynamic structure

The most complex of MongoDB ETL problems is handling nested objects and arrays.

{

"name": "John Doe",

"age": 32,

"gender": "Male",

"address": {

"street": "1390 Market St",

"city": "San Francisco",

"state": "CA"

},

"groups": ["Sports", "Technology"]

}

MongoDB allows nesting objects and arrays to a number of levels. In a complex real-life scenario is may become a nightmare trying to flatten such documents into rows for a Redshift table.

7. Data Type incompatibility between MongoDB and Redshift

Not all data types of MongoDB are compatible with Redshift. ObjectId, Regular Expression, Javascript are not supported by Redshift. While building an ETL solution to migrate data from MongoDB to Redshift from scratch, you will have to write custom code to handle these data types.

Hevo Data Integration Platform – An Easier and More Reliable Way

Hevo Data Integration Platform provides a seamless way of moving data from MongoDB to Redshift by taking care of all the blockers mentioned above. Let’s see how:

1. Real-time Data Migration in minutes

With Hevo, you can connect to MongoDB on a point and click UI. Without writing any code, you can seamlessly move data from MongoDB to Redshift in real-time. This will eliminate the need for you to run batch ETL jobs or manage servers. Hevo takes care of all the grunt work and makes your data available in real-time for analysis.

2. Automatic schema detection and evolution

Hevo dynamically detects schema as it pulls documents for ETL and upgrades the earlier encountered schema to incorporate new collections and fields. Hevo also creates the required tables and columns in Redshift as and when the schema changes.

3. Notifications for Schema Incompatibilities

Hevo sends notifications to the user whenever the fields with incompatible types are detected in a collection. Hevo also allows the user to take appropriate actions to handle the incompatibility right on time.

4. Dynamic expansion for varchar columns

Hevo expands the existing varchar columns in Redshift dynamically as and when it encounters longer string values. This ensures that your Redshift space is used wisely without you breaking a sweat.

5. Splitting nested documents with Transformations

Hevo lets you split the nested MongoDB documents into multiple rows in Redshift by writing simple Python transformations. This makes MongoDB file flattening a cakewalk for users.

6. Automatic conversion to Redshift data types

Hevo converts all MongoDB data types to the closest compatible data type in Redshift. This eliminates the need to write custom scripts to maintain each data type, in turn, making the migration of data from MongoDB to Redshift seamless.

Outside of these benefits, you can use Hevo to migrate data from an array of different sourcesdatabases, cloud applications, SDKs, and more. This will provide the flexibility to instantly replicate data from any source to Redshift.

You can additionally model your data, build complex aggregates and joins to create materialized views for faster query executions on Redshift. You can define the interdependencies between various models through a drag and drop interface with Hevo’s Workflows. Click here to explore all of Hevo’s features.

Sign-up for a 7 Day Free Trial to see how easy it is move Data from MongoDB (and 100s of other sources) to Redshift using Hevo Data Integration Platform.

ETL Data to Redshift, Bigquery, Snowflake

Move Data from any Source to Warehouse in Real-time

Sign up today to get $500 Free Credits to try Hevo!
Start Free Trial

Related Posts