MongoDB to BigQuery ETL: 2 Comprehensive Methods

on Data Integration • February 25th, 2019 • Write for Hevo

MongoDB to BigQuery - Featured Image

MongoDB is a popular NoSQL database that requires data to be modeled in the JSON format. If your application’s data model has a natural fit to MongoDB’s recommended data model, it can provide good performance, flexibility, and scalability for transaction types of workloads.

But while coming to analyzing data, MongoDB is not a great solution – it doesn’t have proper join, getting data from other systems to MongoDB will be difficult, and no native support for SQL. MongoDB’s aggregation framework is not as easy to draft complex analytics logic as in SQL.

Therefore, it is highly recommended to stream data from MongoDB to BigQuery and use BigQuery for further analytical purposes.

The article provides steps to migrate data from MongoDB to BigQuery. It also talks about Hevo Data, which makes it easier for you to connect MongoDB to BigQuery and, at the same time, stream your data from MongoDB to BigQuery.

Table of Contents

Method 1: Using Hevo Data to Set up MongoDB to BigQuery

Image Source

Hevo Data, a No-code Data Pipeline, helps you directly transfer data from MongoDB to BigQuery and other Data Warehouses, BI tools, or a destination of your choice in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of not only loading data from MongoDB but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

The steps to load data from MongoDB to BigQuery using Hevo Data are as follows:

Step 1: Select the Source Type

To select MongoDB as the Source:

  1. Click PIPELINES in the Asset Palette.
  2. Click + CREATE in the Pipelines List View.
  3. In the Select Source Type page, select the MongoDB variant.

Step 2: Select the MongoDB Variant

Select the MongoDB service provider that you use to manage your MongoDB databases:

  • Generic Mongo Database: Database management is done at your end, or by a service provider other than MongoDB Atlas.
  • MongoDB Atlas: The managed database service from MongoDB.

Step 3: Specify MongoDB Connection Settings

Refer to the following sections based on your MongoDB deployment:

In the Configure your MongoDB Source page, specify the following:

Configure Source
Image Source

Step 4: Configure BigQuery Connection Settings

Now Select Google BigQuery as your destination and start moving your data.

You can modify only some of the settings that you provide here once the Destination is created. Refer to the section Modifying BigQuery Destination Configuration below for more information.

  1. Click DESTINATIONS in the Asset Palette.
  2. Click + CREATE in the Destinations List View.
  3. In Add Destination page select Google BigQuery as the Destination type.
  4. In the Configure your Google BigQuery Account page, select the authentication method for connecting to BigQuery.
  5. In the Configure your Google BigQuery Warehouse page, specify the following details
BigQuery Settings
Image Source

Here are more reasons to try Hevo:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Auto Schema Mapping: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo is Built to Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support call
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.

With continuous Real-Time data movement, Hevo allows you to combine MongoDB data along with your other data sources and seamlessly load it to BigQuery with a no-code, easy-to-setup interface. Try our 14-day full-feature access free trial!

Get Started with Hevo for Free

Method 2: Manual Steps to Stream Data from MongoDB to BigQuery

Follow the steps below for connecting MongoDB to BigQuery and stream your data from MongoDB to BigQuery manuallyShortcode

Let’s take a comprehensive look at all the required steps:

Step 1: Extract Data from MongoDB

This is the first step to streaming your data from MongoDB to BigQuery. To extract data from MongoDB, you can use “mongoexport”. The utility can be used to create JSON or CSV export of data stored in a MongoDB collection.

A general guideline for using mongoexport to stream data from MongoDB to BigQuery:

  • mongoexport must be run directly from the system command line, not from the Mongo Shell.
  • Ensure that the connecting user possesses, at a minimum, the read role on the target database.
  • mongoexport defaults to primary read preference when connected to Mongo or a replica set.
  • The default read preference can be overridden using the —readPreference option.

Below is an example that exports data from cont_collect collection to a CSV file in the location /opt/exports/csv/cnts.csv

  • When you export in CSV format, you must specify the fields in the documents to export. In the above example, specify the contact_name and contact_address fields to export. The output would look like this: contact_name, contact_address Elon Wick, 27 Manhattan John Musk, 911 Martian Town
  • Optionally you can specify the fields in a file containing the line-separated list of fields to export – with one field per line. For example, you can specify the contact_name and contact_address fields in a file contact_fields.txt:
    contact_name contact_address
  • Then, applying the —fieldFile option, define the fields to export with the file: mongoexport –db users –collection cont_collects –type=csv –fieldFile fcontact_ields.txt –out/opt/backups/contacts.cs
  • By default field, names are included as a header. –noHeaderLineoption can be used to eliminate the field names in a CSV export.
  • As shown in the above example, –fields can be used to specify fields. Specified fields can be a field inside a sub-document.

Incremental Data Extract from MongoDB

Instead of extracting the entire collection, you can pass query and filter data. This option can be used to extract data incrementally. –query or -q is used to pass this option.

As an example, let’s consider the above-discussed contacts collection. Suppose ‘last_updated_time’ field in each document stores the last updated/inserted Unix Timestamp for that document.

The above command will extract all records from the collection having last_updated_time greater than the specified value, 1548930176. You should keep track of the last pulled last_updated_time separately.

Step 2: Optional Clean and Transform Data

This is the most complex step to migrate data from MongoDB to BigQuery. Apart from transformations to accommodate business logic, there are some basic things to keep in mind while preparing data from MongoDB to BigQuery.

  • BigQuery expects UTF-8 encoded CSV data. If the data is encoded in ISO-8859-1(Latin-1), explicitly specify it while loading to BigQuery.
  • BigQuery doesn’t enforce Primary Key and Unique Key constraints. ETL (Extract, Transform, and Load) process has to take care of that.
  • Column types are somewhat different between MongoDB and BigQuery. Most of the types have either equivalent or convertible types. Refer to the following list to know more about the common data types:
MongoDB Data TypeBigQuery Data Type
DOUBLEFLOAT64
STRINGSTRING
BINARY DATABYTES
OBJECTIDSTRING
BOOLEANBOOL
DATEDATE
NULLNULL
32-BIT INTEGERINT64
TIMESTAMPTIMESTAMP
64-BIT INTEGERINT64
DECIMAL128NUMERIC
  • DATE value must be a dash (-) separated and in the form YYYY-MM-DD (Year-Month-Day). Fortunately, mongoexport date format is exactly like this, and there is no need to modify this to connect MongoDB to BigQuery.
  • Make sure text columns are quoted if they can potentially have Delimiter Characters.

Step 3: Upload to Google Cloud Storage (GCS)

Once data is extracted from MongoDB, the next step is to upload data to GCS. There are multiple ways to do this.

Gsutilis the standard Google Cloud Platform (GCP) tool to handle Objects and Buckets in GCS. It has options to upload a file from your local machine to GCS.

To copy a file to GCS:

To copy an entire folder:

If the files to be copied are already in S3, the following command can be used:

Storage Transfer Service from Google Cloud is an incredible alternative to uploading files to GCS from S3 or other online data sources, provided the destination is Cloud Storage Bucket. This service supports the GCS Bucket as a source.

This service is handy when it comes to data movement to GCS with the following options:

  • Both first-time and recurring loads can be scheduled.
  • After successfully transferring data to GCS, the source data can be deleted.
  • Periodic synchronization between source and target can further be selected using filters.

Uploading From The Web Console

To upload data from the local machine, the GCP Web Console have inbuilt options. Follow the steps given below to make use of that:

  • Login to the GCP, click on “Storage” and go to “Browser” in the left bar, as shown in the image below:
  • Select the “GCS Bucket” with proper access to where you want to upload the file which is to be loaded to BigQuery. In the below image, test-data-hevo is the Bucket Name. Click on the “Bucket Name”.
  • Now you are on the Bucket details page. Click on the “Upload Files” andthen select the file from your computer.
Upload Files Settings image
  • Now data upload is started, and a progress bar is shown, wait till it completes. After completion, the file will be listed in the Bucket, as shown in the image below.
After Uploading the File image

Step 4: Upload Data Extracted from MongoDB to BigQuery Table from GCS

You can use the Web Console UI or Command Line tool bq to stream data from MongoDB to BigQuery Table. First, take a look into how to do it using the Web Console, step by step:

  • Go to “BigQuery Console” from the left side panel, as shown in the image below.
BigQuery Dashboard image
  • Create a dataset if not present already, as shown in the image below.
Create Dataset image
  • Now, click on the created datasets on the left side. Then “Create Table” option will appear below Query Editor as shown in the image below.
Create Table image
  • The “Create Table” button will open the wizard with options to specify the input source and other specifications for the table as shown in the image below.
Create Table Settings image

As shown in the above image, there are options to specify schema or auto-detect. Many other options are also available.

As mentioned above, the Command Line tool bq is a handy tool to interact with BigQuery, which helps to upload data from MongoDB to BigQuery using GCS. The bq load command can be used for this and the source_format CSV.

The syntax of bq load command is as follows:

[LOCATION] is the data location.
[FORMAT_NAME] here it is CSV.
[DATASET_NAME] is an existing dataset in BigQuery.
[TABLE_NAME] is the name of the table into which data to be loaded.
[SOURCE_PATH] Cloud Storage URI of data files.
[INLINE_SCHEMA] is a valid schema to describe data. The schema can be a local JSON file or can be specified directly.
Note: –autodetect flag also can be used instead of supplying a schema definition.

There are a bunch of options specific to CSV data load :

CSV optionbq command flagDescription
Field Delimiter-F or –field_delimiterField delimiter: Comma, Tab, Pipe, Other
Header rows–skip_leading_rowsHeader rows to skip
Number of bad records allowed–max_bad_recordsNumber of errors allowed
Newline characters–allow_quoted_newlinesAllow quoted newlines
Custom null values–null_markerSpecifies a string that represents a null value in a CSV file
Trailing optional columns–allow_jagged_rowsAllow jagged rows
Unknown values–ignore_unknown_valuesIgnore unknown values
Quote–quoteThe value that is used to quote data sections in a CSV file
Encoding-E or –encodingThe character encoding of the data

Below example, commands to load data using the bq Command Line tool will help you familiarise yourself with different options:

Specify data schema using a schema JSON file:

Auto-detected schema from the file:

If data is written to an existing table, BigQuery provides 3 options:

  • Write if the target is empty.
  • Append to the target table.
  • Overwrite the target table. Also, adding new fields to the table while uploading data is possible.

Let us see each with an example.

To overwrite the target table:

To append to the target table:

If you would like to add a new field to the target table, a new schema JSON file with the extra field can be given as an option:

Step 5: Update Target Table in BigQuery

This is the final step in streaming data from MongoDB to BigQuery. To Upsert new data to the BigQuery table, initially load the data into a Staging Table as a full load. Let’s call it interm_delta_table. Below 2 approaches can be followed to load data to the Final Table:

  • Update the rows in the Final Table and insert new rows from the Delta Table which are missing from the Final Table using the code given below:
  • Delete rows from the Final Table, which are in the Delta Table. And then, insert all records from the Delta Table to the Final Table using the code given below:

Conclusion

The article introduced you to MongoDB and BigQuery. It also provides you with a comprehensive step-by-step guide that you can use to connect MongoDB to BigQuery and stream your data from MongoDB to BigQuery.

Most of the time, moving data from a NoSQL system like MongoDB to BigQuery is essential for proper data analysis. Converting schema to a normalized form and changing data format without any information loss is tricky.

To keep track of recurring incremental data loads, separate systems might be required to store ETL job-related metadata. There is a quicker option to deliver the same output and scale instantly.

A Cloud Data Integration Platform – Hevo can help you consolidate data from any data source like MongoDB to BigQuery in a few minutes.

Visit our Website to Explore Hevo

Want to take Hevo for a spin?

Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.

You can also look at the unbeatable pricing that will help you choose the right plan for your business needs.

Share your experience streaming data from MongoDB to BigQuery in the comments below!

No-code Data Pipeline for BigQuery & MongoDB