This article, in particular, covers the different approaches you could use to perform data migration from MongoDB to Redshift.
An Overview of MongoDB and Redshift
MongoDB is an open-source NoSQL database that stores data in JSON format. It uses a document-oriented data model, and data fields can vary by the document. MongoDB isn’t tied to any specified data structure. That means, there is no particular format or schema for the data in a Mongo database. This aspect makes MongoDB ETL a challenging process.
Amazon Redshift data warehouse is an enterprise-class relational database query and management system. Amazon Redshift delivers efficient storage and excellent query performance through a combination of massively parallel processing, columnar data storage, and highly effective targeted data compression encoding schemes. You can read more about Redshift Architecture here.
Methods for MongoDB to Redshift replication
There are two popular methods to move data from MongoDB to Amazon Redshift.
Method 1: A ready to use Hevo Data Integration Platform (7 Days Free Trial).
Method 2: Write Custom ETL Scripts to move data from MongoDB to Amazon Redshift.
This article covers the steps involved in writing custom code to load data from MongoDB to Redshift. Towards the conclusion, the blog covers the limitations and deficiencies of the approach.
Transfer Data from MongoDB to Redshift using Custom Code
For this demonstration, let us assume that we need to move a ‘products’ collection into Redshift. This collection is populated with the product details of a manufacturing company.
There might be two cases you will consider while transferring data from MongoDB to Redshift.
- Move data for a one-time load in Redshift. (covered in this article)
- Incrementally load data into Redshift. (when the data volume is high – not in the scope of this article)
One-Time Load from MongoDB to Redshift
The broad steps to load data from MongoDB to Redshift are as follows:
- Use mongoexport to export data
- Upload the .json file to S3 bucket
- Create Table schema before loading the data into Redshift
- Using COPY command load the data from S3 to Redshift
#Step 1: Use mongoexport to export data
First thing, you will need to generate the .json file of the required MongoDB collection using the Export command as follows:
Open the command prompt and go the Mongo Server installation path to run mongoexport command. Run following mongoexport command to generate the output file of the products collection.
mongoexport --host localhost -u 'username' -p 'password' --db mongodb --collection products --out + products.json
#Step 2: Upload above generated .json file to S3 Bucket
Note: There might be several transformations required before you load this data into Redshift. However, performing this using custom code will be notably difficult. A tool like Hevo, that provides an easy environment to transform data before moving it into the warehouse might be the right solution for you.
Next, you can upload files from local machine to AWS through various means. One simple way is to upload it using file upload utility of S3. This is a more intuitive alternative. You can also achieve this AWS CLI, which provides easy commands to upload it to S3 bucket from local machine. Let us upload the file using AWS CLI.
As a prerequisite, you will need to install and configure AWS CLI if you have not already performed. You can refer to the user guide to know more about installing AWS CLI.
Run the following command to upload the file into S3 from the local machine.
aws s3 cp D:WorkArticlesproducts.json s3:///products.json
Note: You need to have the right permissions to access and upload files to S3 bucket, please refer to S3 access controls for more details.
#Step 3: Create Table schema before loading the data into Redshift
CREATE TABLE sales.products (sku VARCHAR(100), title VARCHAR(100), description VARCHAR(500), manufacture_details VARCHAR(1000), shipping_details VARCHAR(1000), quantity BIGINT, pricing VARCHAR(100))
After running the above query, a table structure will be created within Redshift with no records in it. To check this, run the following query:
SELECT * FROM sales.products
#Step4: Using COPY command load the data from S3 to Redshift
COPY dev.sales.products FROM 's3://s3bucket011/products.json' iam_role 'Role_ARN' format as json 'auto';
Note: In case, you need to fetch some particular data from JSON, you need to use jsonpaths instead of json ‘auto’. Please refer to this link for more details:
You will need to confirm if the data has loaded successfully. You can do it by running the following query.
SELECT * FROM sales.products limit 10;
This should return the record inserted from products file.
Limitations of the Custom ETL Scripts Approach:
- In cases where data needs to be moved once or in batches only, the custom ETL script method works well. This approach becomes extremely tedious if you have to copy data from MongoDB to Redshift in real-time.
- In case you are dealing with huge amounts of data, you will be required to perform an incremental load. The incremental load becomes troublesome as there are additional steps that you need to follow to achieve it.
- Transforming data before you load it into Redshift will be extremely troublesome to achieve.
- MongoDB’s dynamic structure makes it hard to deal with nested objects and arrays within a document.
- When you write code to extract a subset of data often those scripts break as the source schema keeps changing or evolving. This can result in data loss.
The process mentioned above is frail, erroneous and often hard to implement and maintain. This will impact the consistency and availability of your data in Redshift.
You can read more on the ETL challenges while moving MongoDB to Redshift.
There is an easier way to replicate data from MongoDB to Redshift.
A ready to use Hevo Data Integration Platform helps you migrate this data without having to write any code. This is how the process will look like when done through Hevo:
- Connect to your MongoDB database.
- Select the replication mode: (a) Full Dump and Load (b) Incremental load for append-only data (c) Incremental load for mutable data.
- For each collection in MongoDB choose a table name in Redshift where it should be copied.
That’s it! You are all set. Hevo will take care of fetching the data in an incremental fashion and will upload that seamlessly from MongoDB to Redshift in real-time.
Additionally, you can bring data from 100s of different sources – databases, cloud applications, SDKs, and more with Hevo Data Integration Platform. This will future proof your data integration set up and provide the flexibility to immediately replicate data from any source into Redshift.
Sign-up for a 7 Day Free Trial to see how easy it is to load data from MongoDB to Redshift (and 100s of other sources) using Hevo.