Given the fact that MongoDB is a NoSQL database, SQL queries cannot be run directly on it. As a result, to run Analytical queries, MongoDB data must be transferred to a Relational Data Warehouse such as AWS Redshift. Using a real-time Data Pipeline to migrate data from MongoDB to Redshift is a straightforward and cost-effective way to effectively analyze all of your data.
This article, in particular, covers the different approaches you could use to perform data migration from MongoDB to Redshift.
Table of Contents
- Introduction to MongoDB and Redshift
- Method 1: Transfer Data from MongoDB to Redshift using Custom Code
- Limitations of the Custom ETL Scripts Approach
- Method 2: Using Hevo Data to Move Data from MongoDB to Redshift
Introduction to MongoDB and Redshift
MongoDB is an open-source NoSQL database that stores data in JSON format. It uses a document-oriented data model, and data fields can vary by document. MongoDB isn’t tied to any specified data structure. That means, there is no particular format or schema for the data in a Mongo database. This aspect makes MongoDB ETL a challenging process.
Amazon Redshift data warehouse is an enterprise-class relational database query and management system. Amazon Redshift delivers efficient storage and excellent query performance through a combination of massively parallel processing, columnar data storage, and highly effective targeted data compression encoding schemes.
You can read more about Redshift Architecture here.
Methods for MongoDB to Redshift replication
There are two popular methods to move data from MongoDB to Amazon Redshift.
Method 1: Write Custom ETL Scripts to move data from MongoDB to Amazon Redshift.
Method 2: A ready-to-use Hevo Data Integration Platform (14 Days Free Trial).
This article covers the steps involved in writing custom code to load data from MongoDB to Redshift. Towards the conclusion, the blog covers the limitations and deficiencies of the approach.
Method 1: Transfer Data from MongoDB to Redshift using Custom Code
For this demonstration, let us assume that we need to move a ‘products’ collection into Redshift. This collection is populated with the product details of a manufacturing company.
There might be two cases you will consider while transferring data from MongoDB to Redshift.
- Move data for a one-time load in Redshift. (covered in this article)
- Incrementally load data into Redshift. (when the data volume is high – not in the scope of this article)
One-Time Load from MongoDB to Redshift
The broad steps to load data from MongoDB to Redshift are as follows:
- Step 1: Use mongoexport to Export Data
- Step 2: Upload the .json File to S3 Bucket
- Step 3: Create Table Schema before Loading the Data into Redshift
- Step 4: Using COPY Command Load the Data from S3 to Redshift
Step 1: Use mongoexport to Export Data
First thing, you will need to generate the .json file of the required MongoDB collection using the Export command as follows:
Open the command prompt and go the Mongo Server installation path to run mongoexport command. Run following mongoexport command to generate the output file of the products collection.
mongoexport --host localhost -u 'username' -p 'password' --db mongodb --collection products --out + products.json
Step 2: Upload the .json File to S3 Bucket
Note: There might be several transformations required before you load this data into Redshift. However, performing this using custom code will be notably difficult. A tool like Hevo, that provides an easy environment to transform data before moving it into the warehouse might be the right solution for you.
Next, you can upload files from local machine to AWS through various means. One simple way is to upload it using file upload utility of S3. This is a more intuitive alternative. You can also achieve this AWS CLI, which provides easy commands to upload it to S3 bucket from local machine. Let us upload the file using AWS CLI.
As a prerequisite, you will need to install and configure AWS CLI if you have not already performed. You can refer to the user guide to know more about installing AWS CLI.
Run the following command to upload the file into S3 from the local machine.
aws s3 cp D:WorkArticlesproducts.json s3:///products.json
Note: You need to have the right permissions to access and upload files to S3 bucket, please refer to S3 access controls for more details.
Step 3: Create Table Schema before Loading the Data into Redshift
CREATE TABLE sales.products (sku VARCHAR(100), title VARCHAR(100), description VARCHAR(500), manufacture_details VARCHAR(1000), shipping_details VARCHAR(1000), quantity BIGINT, pricing VARCHAR(100))
After running the above query, a table structure will be created within Redshift with no records in it. To check this, run the following query:
SELECT * FROM sales.products
Step 4: Using COPY Command Load the Data from S3 to Redshift
COPY dev.sales.products FROM 's3://s3bucket011/products.json' iam_role 'Role_ARN' format as json 'auto';
Note: In case, you need to fetch some particular data from JSON, you need to use jsonpaths instead of json ‘auto’. Please refer to this link for more details:
You will need to confirm if the data has loaded successfully. You can do it by running the following query.
SELECT * FROM sales.products limit 10;
This should return the record inserted from products file.
Limitations of the Custom ETL Scripts Approach
- In cases where data needs to be moved once or in batches only, the custom ETL script method works well. This approach becomes extremely tedious if you have to copy data from MongoDB to Redshift in real-time.
- In case you are dealing with huge amounts of data, you will be required to perform an incremental load. The incremental load becomes troublesome as there are additional steps that you need to follow to achieve it.
- Transforming data before you load it into Redshift will be extremely troublesome to achieve.
- MongoDB’s dynamic structure makes it hard to deal with nested objects and arrays within a document.
- When you write code to extract a subset of data often those scripts break as the source schema keeps changing or evolving. This can result in data loss.
Method 2: Using Hevo Data to Move Data from MongoDB to Redshift
The method mentioned above to migrate data from MongoDB to Redshift is frail, erroneous, and often hard to implement and maintain. This will impact the consistency and availability of your data in Redshift.
You can read more on the ETL challenges while moving MongoDB to Redshift.
There is an easier way to replicate data from MongoDB to Redshift.
A ready-to-use Hevo Data Integration Platform helps you migrate this data without having to write any code. This is how the process will look like when done through Hevo:
- Connect to your MongoDB database.
- Select the replication mode: (a) Full Dump and Load (b) Incremental load for append-only data (c) Incremental load for mutable data.
- For each collection in MongoDB choose a table name in Redshift where it should be copied.
That’s it! You are all set. Hevo will take care of fetching the data in an incremental fashion and will upload that seamlessly from MongoDB to Redshift in real-time.
Check out why Hevo is the Best:
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled securely and consistently with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Simplify your Data Analysis with Hevo today!SIGN UP HERE FOR A 14-DAY FREE TRIAL!
In this blog, you understood the two different methods you can use to set up a connection from MongoDB to Redshift in a seamless fashion i.e. Using Custom ETL Scripts or with the help of a third-party tool, Hevo.
Additionally, you can bring data from 100+ different sources – databases, cloud applications, SDKs, and more with the Hevo Data Integration Platform. This will future proof your data integration setup and provide the flexibility to immediately replicate data from any source into Redshift.VISIT OUR WEBSITE TO EXPLORE HEVO
Want to take Hevo for a spin?
SIGN UP to see how easy it is to load data from MongoDB to Redshift (and 100s of other sources) using Hevo.
Share your experience of migrating data from MongoDB to Redshift in the comments section below!