Organizations often need to integrate data from various sources to gain valuable insights. One common scenario is transferring data from a NoSQL database like MongoDB to a cloud data warehouse like Snowflake for advanced analytics and business intelligence.

However, this process can be challenging, especially for those new to data engineering. In this blog post, we’ll explore three easy methods to seamlessly migrate data from MongoDB to Snowflake, ensuring a smooth and efficient data integration process.

Mongodb realtime replication to snowflake ensures that data is consistently synchronized between MongoDB and Snowflake databases. Due to MongoDB’s schemaless nature, it becomes important to move the data to a warehouse-like Snowflake for meaningful analysis.

In this article, we will discuss the different methods to migrate MongoDB to Snowflake.

Note: The MongoDB snowflake connector offers a solution for real-time data synchronization challenges many organizations face.

Methods to replicate MongoDB to Snowflake

There are three popular methods to perform MongoDB to Snowflake ETL:

Method 1: Using Hevo Data to Move Data from MongoDB to Snowflake

Hevo, an official Snowflake Partner for Data Integration, simplifies the process of data transfer from MongoDB to Snowflake for free with its robust architecture and intuitive UI. You can achieve data integration without any coding experience and absolutely no manual interventions would be required during the whole process after the setup. Hevo’s pre-built integration with MongoDB along with 150+ Sources (including 50+ free Data Sources) will take full charge of the data transfer process, allowing you to set up MongoDB to Snowflake migration seamlessly and focus solely on key business activities.

GET STARTED WITH HEVO FOR FREE

Method 2: Writing Custom Scripts to Move Data from MongoDB to Snowflake

This is a simple 4-step process to move data from MongoDB to Snowflake. It starts with extracting data from MongoDB collections and ends with copying staged files to the Snowflake table. This method of moving data from MongoDB to Snowflake has significant advantages but suffers from a few setbacks as well.

Method 3: Using Native Cloud Tools and Snowpipe for MongoDB to Snowflake

In this method, we’ll leverage native cloud tools and Snowpipe, a continuous data ingestion service, to load data from MongoDB into Snowflake. This approach eliminates the need for a separate ETL tool, streamlining the data transfer process.

Introduction to MongoDB

MongoDB distinguishes itself as a NoSQL database program. It uses JSON-like documents along with optional schemas. MongoDB is written in C++. MongoDB allows you to address a diverse set of data sets, accelerate development, and adapt quickly to change with key functionalities like horizontal scaling and automatic failover.

Here are a few key features of MongoDB:

  • File Storage: MongoDB makes use of a file system called GridFS (short for Grid File System) is included with MongoDB drivers. It divides a file into parts, each of which is stored as a separate document.
  • Load Balancing: MongoDB uses sharding to scale horizontally. The user picks out a shard key, which determines the data distribution in a collection. The data is split into ranges (based on the shard key) and distributed across various shards. You can also hash the shard key to be mapped to a shard. This results in even data distribution.
  • Replication: MongoDB offers high availability with replica sets. A replica set consists of two or more copies of the data. Each replica-set member may serve as the primary or the secondary replica at any time. All the reads and writes are done on the primary replica by default. Secondary replicas maintain a copy of the data of the primary with the help of built-in replication.
  • Ad-Hoc Queries: MongoDB supports range query, field, and regular expression searches. All queries can return particular fields of documents and include user-defined JavaScript functions. You can also configure queries to return a random sample of results of a given size.

Introduction to Snowflake

Snowflake is a fully managed service that provides customers with near-infinite scalability of concurrent workloads to easily integrate, load, analyze, and securely share their data. Its common applications include data lakes, data engineering, data application development, data science, and secure consumption of shared data. 

Snowflake’s unique architecture natively integrates computing and storage. This architecture enables you to virtually enable your users and data workloads to access a single copy of your data without any detrimental effect on performance.

With Snowflake, you can seamlessly run your data solution across multiple regions and Clouds for a consistent experience. Snowflake makes it possible by abstracting the complexity of underlying Cloud infrastructures.

Here are a few features of Snowflake as a Software as a Service (SaaS) offering:

  • Accelerate Quality of Analytics and Speed: Snowflake allows you to empower your analytics pipeline by shifting from nightly batch loads to real-time data streams. You can accelerate the quality of analytics at your workplace by granting secure, concurrent, and governed access to your Data Warehouse across the organization. This allows organizations to optimize the distribution of resources to maximize revenue by saving on costs and manual effort.
  • Customized Data Exchange: Snowflake allows you to build your data exchange which lets you securely share live, governed data. It also provides an incentive to build better data relationships across your business units and with your partners and customers. It does this by achieving a 360-degree view of your customer, which provides insight into key customer attributes like interests, employment, and many more.
  • Robust Security: You can adopt a secure data lake as a single place for all compliance and cybersecurity data. Snowflake Data Lakes guarantee a fast incident response. This allows you to understand the complete picture of an incident by clubbing high-volume log data in a single location, and efficiently analyzing years of log data in seconds. You can now join semi-structured logs and structured enterprise data in one data lake. Snowflake lets you put your foot in the door without any indexing and easily manipulate and transform data once it is in Snowflake.
  • Improved Data-Driven Decision Making: Snowflake allows you to break down data silos and provide access to actionable insights across the organization. This is an essential first step to improving partner relationships, optimizing pricing, reducing operational costs, driving sales effectiveness, and much more.
  • Improved User Experiences and Product Offerings: With Snowflake in place, you can better understand user behavior and product usage. You can also leverage the full breadth of data to deliver customer success, vastly improve product offerings, and encourage data science innovation.

Understanding the Methods to Connect MongoDB to Snowflake

These are the methods you can use to move data from MongoDB to Snowflake:

Method 1: Using Hevo Data to Move Data from MongoDB to Snowflake

MongoDB to Snowflake: Hevo Logo
Image Source

There is a hassle-free, error-free alternative to achieve the same result and scale instantly. A Fully-managed Cloud Data Pipeline Platform – Hevo, an official Snowflake ETL partner, can help you move data from any data source like MongoDB to Snowflake in just a few mins.

Learn more about Hevo

All of this can be achieved on a simple point-and-click interface without writing any ETL scripts. Here are the steps to replicate MongoDB to Snowflake using Hevo:

Steps to Transfer Data from MongoDB to Snowflake via Hevo Data:

  • Identify Deployment Type: Determine if using MongoDB Atlas or Generic MongoDB.
  • Check Prerequisites: Ensure MongoDB version compatibility, create a user with read access, and set OpLog retention to at least 72 hours.
  • Create Pipeline:
    • Go to PIPELINES in Hevo.
    • Click + CREATE and choose MongoDB as the source type.
  • Configure Connection:
    • Assign a Pipeline Name.
    • Select connection method: paste the MongoDB connection string or enter details manually.
  • Select Data:
    • Choose databases and collections for replication.
  • Adjust Settings & Test:
    • Configure settings like merging collections and loading historical data.
    • Click TEST CONNECTION to verify setup.
  • Finalize Setup:
    • After a successful connection test, finalize settings and create the pipeline.
  • Following these steps will configure MongoDB as a source in Hevo Data, ready for data integration.
MongoDB to Snowflake: Configuring MongoDB to Snowflake Source
Image Source

Configure Snowflake as a Destination in Hevo Data:

  1. Check Prerequisites: Ensure you have an active Snowflake account with ACCOUNTADMIN, SECURITYADMIN, or SYSADMIN privileges.
  2. Obtain Snowflake Account URL: Log into Snowflake and retrieve the account URL from the Admin panel.
  3. Create Destination in Hevo:
    • Navigate to DESTINATIONS in Hevo.
    • Click + CREATE and select Snowflake as the destination type.
  4. Enter Snowflake Configuration Details:
    • Provide a unique Destination Name.
    • Enter the Snowflake Account URL you obtained.
    • Specify Database User, Database Password, Database Name, and Database Schema.
  5. Test & Finalize:
    • Click TEST CONNECTION to verify the setup.
    • Save the configuration once the test is successful.

After these steps, Snowflake is configured as a destination in Hevo Data.

MongoDB to Snowflake: Configuring MongoDB to Snowflake Destination
Image Source

With this, you have successfully set up MongoDB to Snowflake Integration using Hevo Data.

Here are a few advantages of using Hevo:

  • Easy Setup and Implementation – Hevo is a self-serve, managed data integration platform. You can cut down your project timelines drastically as Hevo can help you move data from SFTP/FTP to Snowflake in minutes.
  • Transformations – Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag-and-drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use.
  • Connectors – Hevo supports 150+ integrations to SaaS platforms, files, databases, analytics, and BI tools. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake Data Warehouses; Amazon S3 Data Lakes; and MySQL, MongoDB, TokuDB, DynamoDB, and PostgreSQL databases to name a few.  
  • 150+ Pre-built integrations – In addition to SFTP/FTP, Hevo can bring data from 150+ other data sources into Snowflake in real-time. This will ensure that Hevo is the perfect companion for your business’s growing data integration needs.
  • Complete Monitoring and Management – In case the FTP server or Snowflake data warehouse is not reachable, Hevo will re-attempt data loads in a set instance ensuring that you always have accurate, up-to-date data in Snowflake.
  • 24×7 Support – To ensure that you get timely help, Hevo has a dedicated support team to swiftly join data has a dedicated support team that is available 24×7 to ensure that you are successful with your project.

Simplify your Data Analysis with Hevo today! 

SIGN UP HERE FOR A 14-DAY FREE TRIAL!

Method 2: Writing Custom Scripts to Move Data from MongoDB to Snowflake

Below is a quick snapshot of the broad framework to move data from MongoDB to Snowflake using custom code.

MongoDB to Snowflake Migration
Image Source

The steps are:

Let’s take a detailed look at all the required steps for MongoDB Snowflake Integration:

Step 1: Extracting data from MongoDB Collections

mongoexport is the utility coming with MongoDB which can be used to create JSON or CSV export of the data stored in any MongoDB collection.

The following points are to be noted while using mongoexport :

  • mongoexport should be running directly in the system command line, not from the Mongo shell (the mongo shell is the command-line tool used to interact with MongoDB)
  • That the connecting user should have at least the read role on the target database. Otherwise, a permission error will be thrown.
  • mongoexport by default uses primary read (direct read operations to the primary member in a replica set) as the read preference when connected to mongos or a replica set.
  • Also, note that the default read preference which is “primary read”  can be overridden using the –readPreference option

Below is an  example showing how to export data from the collection named contact_coln to a CSV file in the location /opt/exports/csv/col_cnts.csv

mongoexport --db users --collection contact_coln --type=csv --fields empl_name,empl_address --out /opt/exports/csv/empl_contacts.csv
  • To export in CSV format, you should specify the column names in the collection to be exported. The above example specifies the empl_name and empl_address fields to export.

The output would look like this:

empl_name, empl_address

Prasad, 12 B street, Mumbai

Rose, 34544 Mysore
  • You can also specify the fields to be exported in a file as a line-separated list of fields to export – with one field per line. For example, you can specify the emplyee_name and employee_address fields in a file empl_contact_fields.txt :
empl_name,

empl_address

Then, applying the –fieldFile option, define the fields to export with the file:

mongoexport --db users --collection contact_coln --type=csv --fieldFile empl_contact_fields.txt --out /opt/backups/emplyee_contacts.csv
  • Exported CSV files will have field names as a header by default. If you don’t want a header in the output file,–noHeaderLine option can be used.
  • As in the above example –fields can be used to specify fields to be exported. It can also be used to specify nested fields. Suppose you have post_code filed with employee_address filed, it can be specified as employee_address.post_code

Incremental Data Extract From MongoDB

So far we have discussed extracting an entire MongoDB collection. It is also possible to filter the data while extracting from the collection by passing a query to filter data. This can be used for incremental data extraction. –query or -q is used to pass the query.
For example, let’s consider the above-discussed contacts collection. Suppose the ‘updated_time’ field in each document stores the last updated or inserted Unix timestamp for that document.

mongoexport -d users -c contact_coln -q '{ updated_time: { $gte: 154856788 } }'  --type=csv --fieldFile employee_contact_fields.txt  --out exportdir/emplyee_contacts.csv

The above command will extract all records from the collection with updated_time greater than the specified value,154856788. You should keep track of the last pulled updated_time separately and use that value while fetching data from MongoDB each time.

Step 2: Optional Data Type conversions and Data Formatting

Along with the application-specific logic to be applied while transferring data, the following are to be taken care of when migrating data to Snowflake.

  • Snowflake can support many of the character sets including UTF-8. For the full list of supported encodings please visit here.
  • If you have worked with cloud-based data warehousing solutions before, you might have noticed that most of them lack support constraints and standard SQL constraints like UNIQUE, PRIMARY KEY, FOREIGN KEY, NOT NULL. However, keep in mind that Snowflake supports most of the SQL constraints.
  • Snowflake data types cover all basic and semi-structured types like arrays. It also has inbuilt functions to work with semi-structured data. The below list shows Snowflake data types compatible with the various MongoDB data types.
MongoDB Data TypeSnowflake Data Type
DOUBLEDOUBLE,

 

DOUBLE PRECISION

STRINGSTRING, TEXT
BINARY DATABINARY
OBJECTIDSTRING, TEXT
BOOLEANBOOLEAN
DATEDATE
NULLNULL
32-BIT INTEGERINTEGER
TIMESTAMPTIMESTAMP
64-BIT INTEGERINT64
DECIMAL128DECIMAL
  • As you can see from this table of MongoDB vs Snowflake data types, while inserting data, Snowflake allows almost all of the date/time formats. You can explicitly specify the format while loading data with the help of the File Format Option. We will discuss this in detail later. The full list of supported date and time formats can be found here.

Step 3: Staging Data Files

If you want to insert data into a Snowflake table, the data should be uploaded to online storage like S3. This process is called staging. Generally, Snowflake supports two types of stages – internal and external.

Internal Stage

For every user and table, Snowflake will create and allocate a staging location that is used by default for staging activities and those stages are named using some conventions as mentioned below. Note that is also possible to create named internal stages.

  • The user stage is named ‘@~’
  • The name of the table stage is the name of the table.
  • The user or table stages can’t be altered or dropped.
  • It is not possible to set file format options in the default user or table stages.

Named internal stages can be created explicitly using SQL statements. While creating named internal stages, file format, and other options can be set which makes loading data to the table very easy with minimal command options.

SnowSQL comes with a lightweight CLI client which can be used to run commands like DDLs or data loads. This is available in Linux/Mac/Windows. Read more about the tool and options here.

Below are some example commands to create a stage:

Create a names stage:

create or replace stage my_mongodb_stage

copy_options = (on_error='skip_file')

file_format = (type = 'CSV' field_delimiter = '|' skip_header = 2);

The PUT command is used to stage data files to an internal stage. The syntax is straightforward – you only need to specify the file path and stage name :

PUT file://path_to_file/filename internal_stage_name

Eg:

Upload a file named emplyee_contacts.csv in the /tmp/mongodb_data/data/ directory to an internal stage named mongodb_stage

put file:////tmp/mongodb_data/data/emplyee_contacts.csv @mongodb_stage;

There are many configurations to be set to maximize data load spread while uploading the file like the number of parallelisms, automatic compression of data files, etc. More information about those options is listed here.

External Stage

AWS and Azure are the industry leaders in the public cloud market. It does not come as a surprise that Snowflake supports both Amazon S3 and Microsoft Azure for external staging locations. If the data is in S3 or Azure, all you need to do is create an external stage to point that and the data can be loaded to the table.

To create an external stage on S3, IAM credentials are to be specified. If the data in S3 is encrypted, encryption keys should also be given.

create or replace stage mongod_ext_stage url='s3://snowflake/data/mongo/load/files/'

credentials=(aws_key_id='181a233bmnm3c' aws_secret_key='a00bchjd4kkjx5y6z');

encryption=(master_key = 'e00jhjh0jzYfIjka98koiojamtNDwOaO8=');

Data to the external stage can be uploaded using respective cloud web interfaces or provided SDKs or third-party tools.

Step 4: Copying Staged Files to Snowflake Table

COPY INTO is the command used to load data from the stage area into the Snowflake table. Compute resources needed to load the data are supplied by virtual warehouses and the data loading time will depend on the size of the virtual warehouses

Eg:

To load from a named internal stage

copy into mongodb_internal_table

from @mngodb_stage;

To load from the external stage :(Here only one file is specified)

copy into mongodb_external_stage_table

from @mongodb_ext_stage/tutorials/dataloading/employee_contacts_ext.csv;

To copy directly from an external location without creating a stage:

copy into mongodb_table

from s3://mybucket/snow/mongodb/data/files

credentials=(aws_key_id='$AWS_ACCESS_KEY_ID' aws_secret_key='$AWS_SECRET_ACCESS_KEY')

encryption=(master_key = 'eSxX0jzYfIdsdsdsamtnBKOSgPH5r4BDDwOaO8=')

file_format = (format_name = csv_format);

The subset of files can be specified using patterns

copy into mongodb_table

from @mongodb_stage

file_format = (type = 'CSV')

pattern='.*/.*/.*[.]csv[.]gz';

Some common format options used in COPY command for CSV format :

  • COMPRESSION – Compression used for the input data files.
  • RECORD_DELIMITER – The character used as records or lines separator
  • FIELD_DELIMITER -Character used for separating fields in the input file.
  • SKIP_HEADER – Number of header lines to skip while loading data.
  • DATE_FORMAT – Used to specify the date format
  • TIME_FORMAT – Used to specify the time format

The full list of options is given here.

Download the Cheatsheet on How to Set Up ETL to Snowflake
Download the Cheatsheet on How to Set Up ETL to Snowflake
Download the Cheatsheet on How to Set Up ETL to Snowflake
Learn the best practices and considerations for setting up high-performance ETL to Snowflake

Step 5: Migrating to Snowflake

While discussing data extraction from MongoDB both full and incremental methods are considered. Here, we will look at how to migrate that data into Snowflake effectively.

Snowflake’s unique architecture helps to overcome many shortcomings of existing big data systems. Support for row-level updates is one such feature.

Out-of-the-box support for the row-level updates makes delta data load to the Snowflake table simple. We can extract the data incrementally, load it into a temporary table and modify records in the final table as per the data in the temporary table.

There are three popular methods to update the final table with new data after new data is loaded into the intermediate table.

  1. Update the rows in the final table with the value in a temporary table and insert new rows from the temporary table into the final table.
UPDATE  final_mongodb_table t

SET t.value = s.value

FROM  intermed_mongdb_table in

WHERE t.id = in.id;


INSERT INTO final_mongodb_table (id, value)

SELECT id, value

FROM  intermed_mongodb_table

WHERE NOT id IN (SELECT id FROM final_mongodb_table);

 2. Delete all rows from the final table which are also present in the temporary table. Then insert all rows from the intermediate table to the final table.

DELETE .final_mogodb_table f

WHERE f.id IN (SELECT id from intermed_mongodb_table);

INSERT final_mongodb_table (id, value)

SELECT id, value

FROM intermed_mongodb_table;

3. MERGE statement – Using a single MERGE statement both inserts and updates can be carried out simultaneously.  We can use this option to apply changes to the temporary table.

MERGE into final_mongodb_table t1 using tmp_mongodb_table t2 on t1.key = t2.key

WHEN matched then update set value = t2.value

WHEN not matched then INSERT (key, value) values (t2.key, t2.value);

Limitations of using Custom Scripts to Connect MongoDB to Snowflake

  • If you want to migrate data from MongoDB to Snowflake in batches, then this approach works decently well. However, if you are looking for real-time data availability, this approach becomes extremely tedious.
  • If you are working with a small amount of data, this approach will work fine. When the volume of data increases, you would have to perform an incremental load. To achieve incremental load, you would have to walk extra miles which makes the data migration harder.
  • With the current approach, you would only be able to move data from one place to another. If you would want to transform the data when in transit, this will become hard to achieve.
  • MongoDB’s dynamic structure makes it hard to deal with nested objects and arrays within a document.
  • When you write code to extract a subset of data often those scripts break as the source schema keeps changing or evolving. This can result in data loss.

The method mentioned above has a high scope of errors. This might impact the data availability and the accuracy of available data in Snowflake.

Method 3: Using Native Cloud Tools and Snowpipe for MongoDB to Snowflake

MongoDB to Snowflake
Image Source

Snowpipe, provided by Snowflake, enables a shift from the traditional scheduled batch loading jobs to a more dynamic approach. It supersedes the conventional SQL COPY command, facilitating near real-time data availability. Essentially, Snowpipe imports data into a staging area in smaller increments, working in tandem with your cloud provider’s native services, such as AWS or Azure.

For illustration, consider these scenarios for each cloud provider, detailing the integration of your platform’s infrastructure and the transfer of data from MongoDB to a Snowflake warehouse:

AWS: Utilize a Kinesis delivery stream to deposit MongoDB data into an S3 bucket. With an active SNS system, the associated successful run ID can be leveraged to import data into Snowflake using Snowpipe.

Azure: Activate Snowpipe with an Event Grid message corresponding to Blob storage events. Your MongoDB data is initially placed into an external Azure stage. Upon creating a blob storage event message, Snowpipe is alerted via Event Grid when the data is primed for Snowflake insertion. Subsequently, Snowpipe transfers the queued files into a pre-established table in Snowflake. For comprehensive guidance, Snowflake offers a detailed manual on the setup.

Limitations of Using Native Cloud Tools and Snowpipe

  • A deep understanding of NoSQL databases, Snowflake, and cloud services is crucial. Troubleshooting in a complex data pipeline environment necessitates significant domain knowledge, which may be challenging for smaller or less experienced data teams.
  • Long-term management and ownership of the approach can be problematic, as the resources used are often controlled by teams outside the Data department. This requires careful coordination with other engineering teams to establish clear ownership and ongoing responsibilities.
  • The absence of native tools for applying schema to NoSQL data presents difficulties in schematizing the data, potentially reducing its value in the data warehouse.

MongoDB to Snowflake: Use Cases

  • Snowflake’s system supports JSON natively, which is central to MongoDB’s document model. This allows direct loading of JSON data into Snowflake without needing to convert it into a fixed schema, eliminating the need for an ETL pipeline and concerns about evolving data structures.
  • Snowflake’s architecture is designed for scalability and elasticity online. It can handle large volumes of data at varying speeds without resource conflicts with analytics, supporting micro-batch loading for immediate data analysis. Scaling up a virtual warehouse can speed up data loading without causing downtime or requiring data redistribution.
  • Snowflake’s core is a powerful SQL engine that works seamlessly with BI and analytics tools. Its SQL capabilities extend beyond relational data, enabling access to MongoDB’s JSON data, with its variable schema and nested structures, through SQL. Snowflake’s extensions and the creation of relational views make this JSON data readily usable with SQL-based tools.

Conclusion

This blog talks about the three methods you can use to move data from MongoDB to Snowflake in a seamless fashion: Using Custom ETL Scripts and using Hevo, a third-party tool.

Extracting complex data from a diverse set of data sources can be a challenging task and this is where Hevo saves the day! Hevo offers a faster way to move data from Databases or SaaS applications such as MongoDB into your Data Warehouse like Snowflake to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.

VISIT OUR WEBSITE TO EXPLORE HEVO

Want to take Hevo for a spin?

SIGN UP to explore a hassle-free data migration from MongoDB to Snowflake. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Share your experience of migrating data from MongoDB to Snowflake in the comments section below!

mm
Freelance Technical Content Writer, Hevo Data

Faisal loves data science and combines his problem-solving ability and passion for writing to help data teams in solving complex business problems.

Get Started with Hevo