MongoDB to BigQuery ETL: 5 Comprehensive Steps

on Tutorial, Data Warehouse, Database, Google BigQuery, MongoDB • February 25th, 2019 • Write for Hevo

If you have a habit of observing technology trends in data space, I’m sure you might have heard the SQL vs NoSQL debate at least a couple of times. Around 5 years before it was at its peak while many NoSQL solutions like MongoDB were getting a lot of traction. Ironically adoption of SQL has increased more than anytime before, especially in analytics use cases. Many NoSQL systems like Cassandra and Elasticsearch are now adding SQL interfaces to interact with the data.

MongoDB is a popular NoSQL database that requires data to be modeled in JSON format. If your application’s data model has a natural fit to MongoDB’s recommended data model, it can provide good performance, flexibility, and scalability for transaction types of workloads. But while coming to analyzing data MongoDB is not a great solution – it doesn’t have proper join, getting data from other systems to MongoDB will be difficult and no native support for SQL. MongoDB’s aggregation framework is not that easy to draft complex analytics logic as in SQL. Therefore, it is highly recommended to stream data from MongoDB to BigQuery and use BigQuery for further analytical purposes.

Moving data from MongoDB to BigQuery, a Data Warehousing solution with proper SQL support makes things a lot easy. Google BigQuery is the Data Warehouse solution from Google Cloud Platform and one of the most performant and cost-effective solutions. The article introduces you to MongoDB and BigQuery. It also provides steps to migrate data from MongoDB to BigQuery. It also talks about Hevo Data that makes it easier for you to connect MongoDB to BigQuery and at the same time stream your data from MongoDB to BigQuery.

Table of Contents

Introduction to MongoDB

MongoDB logo
Image Source

MongoDB is based on a NoSQL Document Store Model. Unlike the traditional Relational Database in MongoDB, the data objects are stored as separated documents inside a collection. It is one of the most powerful ways to store and retrieve data that allows you to operate on your data more easily and swiftly.

As there is no rigid schema that MongoDB follows, it makes it highly flexible and dynamic. You can add, delete or update your data with less or no impact on your application. Moreover, you can increase its functionality by integrating it with other platforms like BigQuery, BigData Hadoop, etc.

For more information on MongoDB, click here.

Introduction to BigQuery

BigQuery logo
Image Source

BigQuery works on Shared-Nothing Massively Parallel Processing (MPP) architecture. In this architecture, the nodes work independently and do not share the same memory or storage. Different data is processed in parallel in different nodes which makes processing data more effective and efficient.

Bigquery sets apart compute and storage giving you the freedom to scale the resources as per your need. It allows both Horizontal and Vertical Scaling. In Horizontal Scaling, you can increase the number of nodes to boost the performance within the single logical unit. While in Vertical Scaling, you can add more resources to the existing nodes like increasing the disk space provided to a specific node, etc.

For more information on BigQuery, click here.

To connect MongoDB to BigQuery manually and to migrate data from MongoDB to BigQuery, keep reading!

Simplify Data Migration using Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to Load Data from any data source such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services and simplifies the ETL process. It supports 100+ data sources and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo loads the data onto the desired Data Warehouse, enriches the data, and transforms it into an analysis-ready form without writing a single line of code. Therefore is the best solution for connecting MongoDB to BigQuery and also to stream your data from MongoDB to BigQuery.

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different Business Intelligence (BI) tools as well.

Get Started with Hevo for free

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-day Free Trial!

Steps to Stream Data from MongoDB to BigQuery

MongoDB to BigQuery image
Image Source

As shown in the diagram below, there are 5 vital steps to be followed to stream MongoDB to BigQuery.

Steps to connect MongoDB to BigQuery image

Follow the steps below for connecting MongoDB to BigQuery and stream your data from MongoDB to BigQuery manually:

Let’s take a comprehensive look at all the required steps:

Step 1: Extract Data from MongoDB

This is the first step to stream your data from MongoDB to BigQuery. To extract data from MongoDB, you can use “mongoexport”. It is the utility that can be used to create JSON or CSV export of data stored in a MongoDB collection.

A general guideline for using mongoexport to stream data from MongoDB to BigQuery:

  • mongoexport must be run directly from the system command line, not from the Mongo Shell.
  • Ensure that the connecting user possesses, at a minimum, the read role on the target database.
  • mongoexport defaults to primary read preference when connected to Mongo or a replica set.
  • The default read preference can be overridden using the —readPreference option.

Below is an example that exports data from cont_collect collection to a CSV file in the location /opt/exports/csv/cnts.csv

mongoexport --db users --collection cont_collect --type=csv --fields 
contact_name,contact_address --out /opt/exports/csv/cnts.csv
  • When you export in CSV format, you must specify the fields in the documents to export. In the above example, specify the contact_name and contact_address fields to export. The output would look like this:
    contact_name, contact_address Elon Wick, 27 Manhattan John Musk, 911 Martian Town
  • Optionally you can specify the fields in a file containing the line-separated list of fields to export – with one field per line. For example, you can specify the contact_name and contact_address fields in a file contact_fields.txt:
    contact_name contact_address
  • Then, applying the —fieldFile option, define the fields to export with the file:
    mongoexport --db users --collection cont_collects --type=csv --fieldFile fcontact_ields.txt --out/opt/backups/contacts.cs
  • By default field, names are included as a header. –noHeaderLine option can be used for eliminating the field names in a CSV export.
  • As shown in the above example, –fields can be used to specify fields. Specified fields can be a field inside a sub-document.

Incremental Data Extract from MongoDB

Instead of extracting the entire collection, you can pass query and filter data. This option can be used to extract data incrementally. –query or -q is used to pass this option.

As an example, let’s consider above-discussed contacts collection. Suppose ‘last_updated_time’ field in each document stores the last updated/inserted Unix Timestamp for that document.

mongoexport -d users -c contacts -q '{ last_updated_time: { $gte: 1548930176 } }' 
--out exportdir/myRecords.json

The above command will extract all records from the collection having last_updated_time greater than the specified value, 1548930176. You should keep track of last pulled last_updated_time separately.

Step 2: Optional Clean and Transform Data

This is the most complex step in order to migrate data from MongoDB to BigQuery. Apart from transformations to accommodate business logic, there are some basic things to keep in mind while preparing data from MongoDB to BigQuery.

  • BigQuery expects UTF-8 encoded CSV data. If the data is encoded in ISO-8859-1(Latin-1), explicitly specify it while loading to BigQuery.
  • BigQuery doesn’t enforce Primary Key and Unique Key constraints. ETL (Extract, Transform, and Load) process has to take care of that.
  • Column types are somewhat different between MongoDB and BigQuery. Most of the types have either equivalent or convertible types. Refer to the following list to know more about the common data types:
MongoDB Data TypeBigQuery Data Type
DOUBLEFLOAT64
STRINGSTRING
BINARY DATABYTES
OBJECTIDSTRING
BOOLEANBOOL
DATEDATE
NULLNULL
32-BIT INTEGERINT64
TIMESTAMPTIMESTAMP
64-BIT INTEGERINT64
DECIMAL128NUMERIC

Check out Bigquery documentation to read on BigQuery data types.

  • DATE value must be a dash (-) separated and in the form YYYY-MM-DD (Year-Month-Day). Fortunately, mongoexport date format is exactly like this and there is no need to do any modifications in this for connecting MongoDB to BigQuery.
  • Make sure text columns are quoted if they can potentially have Delimiter Characters.

Step 3: Upload to Google Cloud Storage (GCS)

Once data is extracted from MongoDB, the next step is to upload data to GCS. There are multiple ways to do this.

Gsutil is the standard tool from Google Cloud Platform (GCP) to handle Objects and Buckets in GCS. It has options to upload a file from your local machine to GCS.

To copy a file to GCS:

gsutil cp table_data.csv  gs://my-bucket/path/to/folder/

To copy an entire folder:

gsutil cp -r dir gs://my-bucket/path/to/parent/

If the files to be copied are already in S3, the following command can be used:

gsutil cp -R s3://bucketname/source/path gs://bucketname/destination/path

Storage Transfer Service from Google Cloud is an incredible alternative to upload files to GCS from S3 or other online data sources, provided the destination is Cloud Storage Bucket. This service supports the GCS Bucket as a source.

This service is handy when it comes to data movement to GCS with the following options:

  • Both first-time and recurring loads can be scheduled.
  • After successfully transferring data to GCS, the source data can be deleted.
  • Periodic synchronization between source and target can further be selected using filters.

Uploading From The Web Console

To upload data from the local machine, the GCP Web Console have inbuilt options. Follow the steps given below to make use of that:

  • Login to the GCP, click on “Storage” and go to “Browser” in the left bar as shown in the image below:
Google Cloud Platform Dashboard image
  • Select the “GCS Bucket” with proper access where you want to upload the file which is to be loaded to BigQuery. In the below image, test-data-hevo is the Bucket Name. Click on the “Bucket Name”.
Browse Bucket image
  • Now you are on the Bucket details page. Click on the “Upload Files” then select the file from your computer.
Upload Files Settings image
  • Now data upload is started and a progress bar is shown, wait till it completes. After completion, the file will be listed in the Bucket as shown in the image below.
After Uploading the File image

Step 4: Upload Data Extracted from MongoDB to BigQuery Table from GCS

You can use the Web Console UI or Command Line tool called bq to stream data from MongoDB to BigQuery Table. First, take a look into how to do it using the Web Console, step by step:

  • Go to “BigQuery Console” from the left side panel as shown in the image below.
BigQuery Dashboard image
  • Create a dataset if not present already as shown in the image below.
Create Dataset image
  • Now, click on the created datasets on the left side. Then “Create Table” option will appear below Query Editor as shown in the image below.
Create Table image
  • The “Create Table” button will open the wizard with options to specify the input source and other specifications for the table as shown in the image below.
Create Table Settings image

As shown in the above image, there are options to specify schema or auto-detect. Many other options are also available.

As mentioned above, the Command Line tool bq is a handy tool to interact with BigQuery which helps to upload data from MongoDB to BigQuery using GCS. The bq load command can be used for this and the source_format CSV.

The syntax of bq load command is as follows:

bq --location=[LOCATION] load --source_format=[FORMAT_NAME]
[DATASET_NAME].[TABLE_NAME] [SOURCE_PATH] [INLINE_SCHEMA]

[LOCATION] is the data location.
[FORMAT_NAME] here it is CSV.
[DATASET_NAME] is an existing dataset in BigQuery.
[TABLE_NAME] is the name of the table into which data to be loaded.
[SOURCE_PATH] Cloud Storage URI of data files.
[INLINE_SCHEMA] is a valid schema to describe data. The schema can be a local JSON file or can be specified directly.
Note: –autodetect flag also can be used instead of supplying a schema definition.

There are a bunch of options specific to CSV data load :

CSV optionbq command flagDescription
Field Delimiter-F or –field_delimiterField delimiter: Comma, Tab, Pipe, Other
Header rows–skip_leading_rowsHeader rows to skip
Number of bad records allowed–max_bad_recordsNumber of errors allowed
Newline characters–allow_quoted_newlinesAllow quoted newlines
Custom null values–null_markerSpecifies a string that represents a null value in a CSV file
Trailing optional columns–allow_jagged_rowsAllow jagged rows
Unknown values–ignore_unknown_valuesIgnore unknown values
Quote–quoteThe value that is used to quote data sections in a CSV file
Encoding-E or –encodingThe character encoding of the data

To see full list options visit BigQuery documentation on loading data cloud storage CSV.

Below example, commands to load data using the bq Command Line tool will help you familiarise with different options:

Specify data schema using a schema JSON file:

bq --location=US load --source_format=CSV  your_dataset_name.your_table_name 
gs://bucket_name/data_file.csv ./schema_file.json
bq --location=US load --source_format=CSV  your_dataset_name.your_table_name 
gs://bucket_name/data_file.csv ./schema_file.json

Auto-detected schema from the file:

bq --location=US load --autodetect --source_format=CSV  your_dataset_name.your_table_name
gs://bucket_name/path/to/file/data_file.csv

If data is written to an existing table, BigQuery provides 3 options:

  • Write if the target is empty.
  • Append to the target table.
  • Overwrite target table. Also, it is possible to add new fields to the table while uploading data.

Let us see each with an example.

To overwrite target table:

bq --location=US load --autodetect --replace --source_format=CSV
your_dataset_name.your_table_name gs://bucket_name/path/to/file/file_name.csv

To append to the target table:

bq --location=US load --autodetect --noreplace --source_format=CSV
your_dataset_name.your_table_name gs://bucket_name/path/to/file/file_name.csv ./schema_file.json

If you would like to add a new field to the target table a new schema JSON file with the extra field can be given as an option:

bq --location=asia-northeast1 load --noreplace --schema_update_option=ALLOW_FIELD_ADDITION --source_format=CSV
your_dataset_name.your_table_name gs://bucket_name/path/to/file/file_name.csv ./schema_file.json

Step 5: Update Target Table in BigQuery

This is the final step in order to stream data from MongoDB to BigQuery. To Upsert new data to the BigQuery table initially load the data into a Staging Table as a full load. Let’s call it interm_delta_table. Below 2 approaches can be followed to load data to the Final Table:

  • Update the rows in the Final Table and insert new rows from the Delta Table which are missing from the Final Table using the code given below:
UPDATE data_setname.final_target_table_name t
SET t.value = s.value
FROM  data_set_name.interm_delta_table s
WHERE t.id = s.id;INSERT data_setname.final_target_table_name (id, value)
SELECT id, value
FROM  data_set_name.interm_delta_table
WHERE NOT id IN (SELECT id FROM data_setname.final_target_table_name);
  • Delete rows from the Final Table which are in the Delta Table. And then insert all records from the Delta Table to the Final Table using the code given below:
DELETE data_setname.final_target_table_name f
WHERE f.id IN (SELECT id from data_set_name.interm_delta_table);
INSERT data_setname.final_target_table_name (id, value)
SELECT id, value
FROM  data_set_name.interm_delta_table;

Conclusion

The article introduced you to MongoDB and BigQuery. It also provided you a step-by-step comprehensive guide that you can use to connect MongoDB to BigQuery and furthermore stream your data from MongoDB to BigQuery. Most of the time moving data from a NoSQL system like MongoDB to BigQuery is essential for proper data analysis. Converting schema to a normalized form and changing data format without any loss of information is really tricky.

For keeping track of recurring incremental data loads, separate systems might be required for storing ETL job-related metadata. There is a quicker option to deliver the same output and scale instantly. A Cloud Data Integration Platform – Hevo, can help you consolidate data from any data source like MongoDB to BigQuery in a matter of a few minutes.

Visit our Website to Explore Hevo

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Share your experience of streaming data from MongoDB to BigQuery in the comments section below!

No-code Data Pipeline for BigQuery