Press "Enter" to skip to content

MongoDB to Snowflake ETL: Steps to Migrate Data Easily

MongoDB to Snowflake ETL

Before we dive deep into MongoDB to Snowflake data migration steps, it is important to understand the unique properties of MongoDB and Snowflake that make a data migration like this both challenging and exciting.

NoSQL databases like MongoDB address very specific use cases. Data storage and access patterns will be highly optimized for fast write and retrieval and many other factors like availability and durability of data. MongoDB has its own recommended data model and access patterns which is not always flexible as standard SQL.

While coming to analyze the data, most of us prefer a system with a proper SQL interface. Seldom do people want to learn a new query language or framework. However, almost all existing reporting or visualization tools require SQL interface to extract data. This is one of the key reasons that generate traction for Data Warehousing solution with proper SQL support. Snowflake is a cloud-native data warehousing solution. Within a short span of time, Snowflake has gained the trust of many enterprises and is currently playing critical roles in their decision support systems.

Due to MongoDB’s schemaless nature, it becomes important to move the data to a warehouse like Snowflake for meaningful analysis.

Methods to replicate MongoDB to Snowflake

There are two popular methods to perform MongoDB to Snowflake ETL.

Method 1: Use a fully-managed, hassle-free solution like Hevo Data – 14 Day Free Trial (Official Snowflake ETL Partner).

Method 2: Write a Custom Code to move data from MongoDB to Snowflake.

In this post, we will deep dive on Method 2. Below is a quick snapshot of the broad framework to move data from MongoDB to Snowflake using custom code.

The steps are:

  1. Extract data from MongoDB collections – mongoexport is the utility from MongoDB which can be used for this

  2. Optional data type conversions and formatting data if needed

  3. Stage Data Files (internally or externally)

  4. Copy staged files to Snowflake table

Let’s take a detailed look at all the required steps:

  1. Extract Data From MongoDB

As discussed earlier, mongoexport is the utility coming with MongoDB which can be used to create JSON or CSV export of the data stored in any MongoDB collection.

Following points are to be noted while using mongoexport :

  • mongoexport should be running directly in the system command line, not from mongo shell (mongo shell is the command line tool used to interact with MongoDB)
  • That the connecting user should have at least the read role on the target database. Otherwise, a permission error will be thrown.
  • mongoexport by default use primary read (direct read operations to the primary member in a replica set) as the read preference when connected to mongos or a replica set.
  • Also, note that the default read preference which is “primary read”  can be overridden using the –readPreference option

Below is an  example showing how to export data from the collection named contact_coln to a CSV file in the location /opt/exports/csv/col_cnts.csv

mongoexport --db users --collection contact_coln --type=csv --fields empl_name,empl_address --out /opt/exports/csv/empl_contacts.csv
  • To export in CSV format, you should specify the column names in the collection to be exported. In the above example specifies the empl_name and empl_address fields to export.

The output would look like:

empl_name, empl_address

Prasad, 12 B street, Mumbai

Rose, 34544 Mysore
  • You can also specify the fields to be exported in a file as a line-separated list of fields to export – with one field per line. For example, you can specify the emplyee_name and employee_address fields in a file empl_contact_fields.txt :
empl_name,

empl_address

Then, applying the –fieldFile option, define the fields to export with the file:

mongoexport --db users --collection contact_coln --type=csv --fieldFile empl_contact_fields.txt --out /opt/backups/emplyee_contacts.csv
  • Exported CSV file will have field names as a header by default. If you don’t want a header in the output file,–noHeaderLine option can be used.
  • As in the above example –fields can be used to specify fields to be exported. It can also be used to specify nested fields. Suppose you have post_code filed with employee_address filed, it can be specified as employee_address.post_code

Incremental Data Extract From MongoDB

So far we discussed extracting an entire MongoDB collection. It is also possible to filter the data while extracting from the collection by passing a query to filter data. This can be used for incremental data extraction. –query or -q is used to pass the query.
For example let’s consider above-discussed contacts collection. Suppose ‘updated_time’ field in each document stores last updated or inserted unix timestamp for that document.

mongoexport -d users -c contact_coln -q '{ updated_time: { $gte: 154856788 } }'  --type=csv --fieldFile employee_contact_fields.txt  --out exportdir/emplyee_contacts.csv

Above command will extract all records from the collection with updated_time greater than the specified value,154856788. You should keep track of last pulled updated_time separately and use that value while fetching data from MongoDB each time.

2. Data type conversion and formatting

Along with the application-specific logic to be applied while transferring data, the following are to be taken care of when migrating data to Snowflake.

  • Snowflake can support many of the character sets including UTF-8. For the full list of supported encodings please visit here.
  • If you have worked with cloud-based data warehousing solutions before, you might have noticed that most of them lack support constraints and standard SQL constraints like UNIQUE, PRIMARY KEY, FOREIGN KEY, NOT NULL. However, keep in mind that Snowflake supports most of the SQL constraints.
  • Snowflake data types cover all basic and semi-structured types like arrays. It also has inbuilt functions to work with semi-structured data. Below list shows Snowflake data types compatible with the various MongoDB data types.
MongoDB Data Type Snowflake Data Type
DOUBLE DOUBLE,

DOUBLE PRECISION

STRING STRING, TEXT
BINARY DATA BINARY
OBJECTID STRING, TEXT
BOOLEAN BOOLEAN
DATE DATE
NULL NULL
32-BIT INTEGER INTEGER
TIMESTAMP TIMESTAMP
64-BIT INTEGER INT64
DECIMAL128 DECIMAL

 

  • While inserting data, Snowflake allows almost all of the date/time formats. You can explicitly specify the format while loading data with the help of File Format Option. We will discuss this in detail later. The full list of supported date and time formats can be found here.

3. Stage Data Files

If you want to insert data into a Snowflake table, the data should be uploaded to an online storage like S3. This process is called staging. Generally, Snowflake supports two types of stages – internal and external.

Internal Stage

For every user and table, Snowflake will create and allocate a staging location which is used by default for staging activities and those stages are named using some conventions as mentioned below. Note that is also possible to create named internal stages.

  • The user stage is named as ‘@~’
  • The name of the table stage is the name of the table.
  • The user or table stages can’t be altered or dropped.
  • It is not possible to set file format options in the default user or table stages.

Named internal stages can be created explicitly using SQL statements. While creating named internal stages, file format, and other options can be set which makes loading of data to the table very easy with minimal command options.

SnowSQL comes with light-weight CLI client which can be used to run commands to like DDLs or data loads. This is available in Linux/Mac/Windows. Read more about the tool and options here.

Below are some example commands to create stage:

Create a names stage:

create or replace stage my_mongodb_stage

copy_options = (on_error='skip_file')

file_format = (type = 'CSV' field_delimiter = '|' skip_header = 2);

PUT command is used to stage data files to an internal stage. The syntax is straightforward – you only need to specify the file path and stage name :

PUT file://path_to_file/filename internal_stage_name

Eg:

Upload a file named emplyee_contacts.csv in the /tmp/mongodb_data/data/ directory to an internal stage named mongodb_stage

put file:////tmp/mongodb_data/data/emplyee_contacts.csv @mongodb_stage;

There are many configurations to be set to maximize data load spread while uploading the file like the number of parallelisms, automatic compression of data files, etc. More information about those options is listed here.

External Stage

AWS and Azure are the industry leaders in the public cloud market. It does not come as a surprise that Snowflake supports both Amazon S3 and Microsoft Azure for external staging location. If the data is in S3 or Azure, all you need to do is create an external stage to point that and the data can be loaded to the table.

To create an external stage on S3, IAM credentials are to be specified. If the data in S3 is encrypted, encryption keys should also be given.

create or replace stage mongod_ext_stage url='s3://snowflake/data/mongo/load/files/'

credentials=(aws_key_id='181a233bmnm3c' aws_secret_key='a00bchjd4kkjx5y6z');

encryption=(master_key = 'e00jhjh0jzYfIjka98koiojamtNDwOaO8=');

Data to the external stage can be uploaded using respective cloud web interfaces or provided SDKs or third party tools.

4. Copy staged files to Snowflake table

COPY INTO is the command used to load data from stage area into the Snowflake table. Compute resources needed to load the data is supplied by virtual warehouses and the data loading time will depend on the size of the virtual warehouses

Eg:

To load from a named internal stage

copy into mongodb_internal_table

from @mngodb_stage;

To load from the external stage :(Here only one file is specified)

copy into mongodb_external_stage_table

from @mongodb_ext_stage/tutorials/dataloading/employee_contacts_ext.csv;

To copy directly from an external location without creating stage:

copy into mongodb_table

from s3://mybucket/snow/mongodb/data/files

credentials=(aws_key_id='$AWS_ACCESS_KEY_ID' aws_secret_key='$AWS_SECRET_ACCESS_KEY')

encryption=(master_key = 'eSxX0jzYfIdsdsdsamtnBKOSgPH5r4BDDwOaO8=')

file_format = (format_name = csv_format);

The subset of files can be specified using patterns

copy into mongodb_table

from @mongodb_stage

file_format = (type = 'CSV')

pattern='.*/.*/.*[.]csv[.]gz';

Some common format options used in COPY command for CSV format :

  • COMPRESSION – Compression used for the input data files.
  • RECORD_DELIMITER – The character used as records or lines separator
  • FIELD_DELIMITER -Character used for separating fields in the input file.
  • SKIP_HEADER – How many header lines to skipped while loading data.
  • DATE_FORMAT – Used to specify the date format
  • TIME_FORMAT – Used to specify the time format

The full list of options is given here

5. Update Snowflake Table

While discussing data extraction from MongoDB both full and incremental methods are considered. Here, we will look at how to migrate that data into a Snowflake effectively.

Snowflake’s unique architecture helps to overcome many shortcomings of existing big data systems. Support for the row-level updates is one of such features. Out of the box support for the row-level updates makes delta data load to Snowflake table simple. We can extract the data incrementally, load into a temporary table and modify records in final table as per data in the temporary table.

There are three popular methods to update the final table with new data after new data is loaded into the intermediate table.

  1. Update the rows in the final table with the value in a temporary table and insert new rows from the temporary table to the final table.
UPDATE  final_mongodb_table t

SET t.value = s.value

FROM  intermed_mongdb_table in

WHERE t.id = in.id;


INSERT INTO final_mongodb_table (id, value)

SELECT id, value

FROM  intermed_mongodb_table

WHERE NOT id IN (SELECT id FROM final_mongodb_table);

 2. Delete all rows from the final table which are also present in the temporary table. Then insert all rows from the intermediate table to the final table.

DELETE .final_mogodb_table f

WHERE f.id IN (SELECT id from intermed_mongodb_table);

INSERT final_mongodb_table (id, value)

SELECT id, value

FROM intermed_mongodb_table;

3. MERGE statement – Using a single MERGE statement both inserts and updates can be carried out simultaneously.  We can use this option to apply changes in the temporary table

MERGE into final_mongodb_table t1 using tmp_mongodb_table t2 on t1.key = t2.key

WHEN matched then update set value = t2.value

WHEN not matched then INSERT (key, value) values (t2.key, t2.value);

Limitations of the Custom ETL Scripts Approach:

  1. If you want to migrate data from MongoDB to Snowflake in batches, then this approach works decently well. However, if you are looking for real-time data availability, this approach becomes extremely tedious.
  2. If you are working with a small amount of data, this approach will work fine. When the volume of data increases, you would have to perform an incremental load. To achieve incremental load, you would have to walk extra miles which makes the data migration harder.
  3. With the current approach, you would only be able to move data from one place to another. If you would want to transform the data when in transit, this will become hard to achieve.
  4. MongoDB’s dynamic structure makes it hard to deal with nested objects and arrays within a document.
  5. When you write code to extract a subset of data often those scripts break as the source schema keeps changing or evolving. This can result in data loss.

The method mentioned above has a high scope of errors. This might impact the data availability and the accuracy of available data in Snowflake.

Simpler WAY TO MOVE DATA FROM MONGODB TO Snowflake

There is a hassle-free, error-free alternative to achieve the same result and scale instantly. A Fully managed Cloud Data Pipeline Platform – Hevo, an official Snowflake ETL partner, can help you move data from any data source into Snowflake in just a few mins. All of this can be achieved on a simple point and click interface without writing any ETL scripts. Here are the steps to replicate MongoDB to Snowflake using Hevo:

  • Authenticate and Connect to your MongoDB database.
  • Select the replication mode: (a) Full Dump and Load (b) Incremental load for append-only data (c) Incremental load for mutable data.
  • Configure the Snowflake destination where the needs to be moved.

Hevo offers a 14 Day Free Trial for you to explore a hassle-free data migration from MongoDB to Snowflake.

ETL Data to Redshift, Bigquery, Snowflake

Move Data from any Source to Warehouse in Real-time

Sign up today to get $500 Free Credits to try Hevo!
Start Free Trial