MongoDB is a popular NoSQL database that requires data to be modeled in JSON format. If your application’s data model has a natural fit to MongoDB’s recommended data model, it can provide good performance, flexibility, and scalability for transaction types of workloads. However, due to a few restrictions that you can face while analyzing data, it is highly recommended to stream data from MongoDB to BigQuery or any other data warehouse.

MongoDB doesn’t have proper join, getting data from other systems to MongoDB will be difficult, and it also has no native support for SQL. MongoDB’s aggregation framework is not as easy to draft complex analytics logic as in SQL.

The article provides steps to migrate data from MongoDB to BigQuery. It also talks about Hevo Data, making it easier to replicate data. Therefore, without any further ado, let’s start learning about this MongoDB to BigQuery ETL.

Method 1: Using Hevo Data to Set up MongoDB to BigQuery

Image Source

Hevo Data, a No-code Data Pipeline, helps you directly transfer data from MongoDB to BigQuery and other Data Warehouses, BI tools, or a destination of your choice in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of not only loading data from MongoDB but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code.

Get Started with Hevo for Free

The steps to load data from MongoDB to BigQuery using Hevo Data are as follows:

Step 1: Select the Source Type

To select MongoDB as the Source:

  1. Click PIPELINES in the Asset Palette.
  2. Click + CREATE in the Pipelines List View.
  3. In the Select Source Type page, select the MongoDB variant.

Step 2: Select the MongoDB Variant

Select the MongoDB service provider that you use to manage your MongoDB databases:

  • Generic Mongo Database: Database management is done at your end, or by a service provider other than MongoDB Atlas.
  • MongoDB Atlas: The managed database service from MongoDB.

Step 3: Specify MongoDB Connection Settings

Refer to the following sections based on your MongoDB deployment:

In the Configure your MongoDB Source page, specify the following:

MongoDB to BigQuery: Configure Source
Image Source

Step 4: Configure BigQuery Connection Settings

Now Select Google BigQuery as your destination and start moving your data.

You can modify only some of the settings that you provide here once the Destination is created. Refer to the section Modifying BigQuery Destination Configuration below for more information.

  1. Click DESTINATIONS in the Asset Palette.
  2. Click + CREATE in the Destinations List View.
  3. In Add Destination page select Google BigQuery as the Destination type.
  4. In the Configure your Google BigQuery Account page, select the authentication method for connecting to BigQuery.
  5. In the Configure your Google BigQuery Warehouse page, specify the following details
MongoDB to BigQuery: BigQuery Settings
Image Source

By following the above mentioned steps, you will have successfully completed MongoDB BigQuery replication.

With continuous Real-Time data movement, Hevo allows you to combine MongoDB data along with your other data sources and seamlessly load it to BigQuery with a no-code, easy-to-setup interface. Try our 14-day full-feature access free trial!

SIGN UP HERE FOR A 14-DAY FREE TRIAL

Method 2: Manual Steps to Stream Data from MongoDB to BigQuery

Follow the steps below for connecting MongoDB to BigQuery and stream your data from MongoDB to BigQuery. You can also read about migrating data from MongoDB to Snowflake here.

Let’s take a comprehensive look at all the required steps:

Step 1: Extract Data from MongoDB

This is the first step to streaming your data from MongoDB to BigQuery. To extract data from MongoDB, you can use “mongoexport“. The utility can be used to create JSON or CSV export of data stored in a MongoDB collection.

A general guideline for using mongoexport to stream data from MongoDB to BigQuery:

  • mongoexport must be run directly from the system command line, not from the Mongo Shell.
  • Ensure that the connecting user possesses, at a minimum, the read role on the target database.
  • mongoexport defaults to primary read preference when connected to Mongo or a replica set.
  • The default read preference can be overridden using the —readPreference option.

Below is an example that exports data from cont_collect collection to a CSV file in the location /opt/exports/csv/cnts.csv

  • When you export in CSV format, you must specify the fields in the documents to export. In the above example, specify the contact_name and contact_address fields to export. The output would look like this: contact_name, contact_address Elon Wick, 27 Manhattan John Musk, 911 Martian Town
  • Optionally you can specify the fields in a file containing the line-separated list of fields to export – with one field per line. For example, you can specify the contact_name and contact_address fields in a file contact_fields.txt:
    contact_name contact_address
  • Then, applying the —fieldFile option, define the fields to export with the file: mongoexport –db users –collection cont_collects –type=csv –fieldFile fcontact_ields.txt –out/opt/backups/contacts.cs
  • By default field, names are included as a header. –noHeaderLineoption can be used to eliminate the field names in a CSV export.
  • As shown in the above example, –fields can be used to specify fields. Specified fields can be a field inside a sub-document.

Incremental Data Extract from MongoDB

Instead of extracting the entire collection, you can pass query and filter data. This option can be used to extract data incrementally. –query or -q is used to pass this option.

As an example, let’s consider the above-discussed contacts collection. Suppose ‘last_updated_time’ field in each document stores the last updated/inserted Unix Timestamp for that document.

The above command will extract all records from the collection having last_updated_time greater than the specified value, 1548930176. You should keep track of the last pulled last_updated_time separately.

Step 2: Optional Clean and Transform Data

This is the most complex step to migrate data from MongoDB to BigQuery. Apart from transformations to accommodate business logic, there are some basic things to keep in mind while preparing data from MongoDB to BigQuery.

  • BigQuery expects UTF-8 encoded CSV data. If the data is encoded in ISO-8859-1(Latin-1), explicitly specify it while loading to BigQuery.
  • BigQuery doesn’t enforce Primary Key and Unique Key constraints. ETL (Extract, Transform, and Load) process has to take care of that.
  • Column types are somewhat different between MongoDB and BigQuery. Most of the types have either equivalent or convertible types. Refer to the following list to know more about the common data types:
MongoDB Data TypeBigQuery Data Type
DOUBLEFLOAT64
STRINGSTRING
BINARY DATABYTES
OBJECTIDSTRING
BOOLEANBOOL
DATEDATE
NULLNULL
32-BIT INTEGERINT64
TIMESTAMPTIMESTAMP
64-BIT INTEGERINT64
DECIMAL128NUMERIC
  • DATE value must be a dash (-) separated and in the form YYYY-MM-DD (Year-Month-Day). Fortunately, mongoexport date format is exactly like this, and there is no need to modify this to connect MongoDB to BigQuery.
  • Make sure text columns are quoted if they can potentially have Delimiter Characters.

Step 3: Upload to Google Cloud Storage (GCS)

Once data is extracted from MongoDB, the next step is to upload data to GCS. There are multiple ways to do this.

Gsutilis the standard Google Cloud Platform (GCP) tool to handle Objects and Buckets in GCS. It has options to upload a file from your local machine to GCS.

To copy a file to GCS:

To copy an entire folder:

If the files to be copied are already in S3, the following command can be used:

Storage Transfer Service from Google Cloud is an incredible alternative to uploading files to GCS from S3 or other online data sources, provided the destination is Cloud Storage Bucket. This service supports the GCS Bucket as a source.

This service is handy when it comes to data movement to GCS with the following options:

  • Both first-time and recurring loads can be scheduled.
  • After successfully transferring data to GCS, the source data can be deleted.
  • Periodic synchronization between source and target can further be selected using filters.

Uploading From The Web Console

To upload data from the local machine, the GCP Web Console have inbuilt options. Follow the steps given below to make use of that:

  • Login to the GCP, click on “Storage” and go to “Browser” in the left bar, as shown in the image below:
  • Select the “GCS Bucket” with proper access to where you want to upload the file which is to be loaded to BigQuery. In the below image, test-data-hevo is the Bucket Name. Click on the “Bucket Name”.
  • Now you are on the Bucket details page. Click on the “Upload Files” and then select the file from your computer.
MongoDB to BigQuery: Upload Files Settings image
Image Source
  • Now data upload is started, and a progress bar is shown, wait till it completes. After completion, the file will be listed in the Bucket, as shown in the image below.
MongoDB to BigQuery: After Uploading the File image
Image Source

Step 4: Upload Data Extracted from MongoDB to BigQuery Table from GCS

You can use the Web Console UI or Command Line tool bq to stream data from MongoDB to BigQuery Table. First, take a look into how to do it using the Web Console, step by step:

  • Go to “BigQuery Console” from the left side panel, as shown in the image below.
MongoDB to BigQuery: BigQuery Dashboard image
Image Source
  • Create a dataset if not present already, as shown in the image below.
MongoDB to BigQuery: Create Dataset image
Image Source
  • Now, click on the created datasets on the left side. Then “Create Table” option will appear below Query Editor as shown in the image below.
MongoDB to BigQuery: Create Table image
Image Source
  • The “Create Table” button will open the wizard with options to specify the input source and other specifications for the table as shown in the image below.
MongoDB to BigQuery: Create Table Settings image
Image Source

As shown in the above image, there are options to specify schema or auto-detect. Many other options are also available.

As mentioned above, the Command Line tool bq is a handy tool to interact with BigQuery, which helps to upload data from MongoDB to BigQuery using GCS. The bq load command can be used for this and the source_format CSV.

The syntax of bq load command is as follows:

[LOCATION] is the data location.
[FORMAT_NAME] here it is CSV.
[DATASET_NAME] is an existing dataset in BigQuery.
[TABLE_NAME] is the name of the table into which data to be loaded.
[SOURCE_PATH] Cloud Storage URI of data files.
[INLINE_SCHEMA] is a valid schema to describe data. The schema can be a local JSON file or can be specified directly.
Note: –autodetect flag also can be used instead of supplying a schema definition.

There are a bunch of options specific to CSV data load :

CSV optionbq command flagDescription
Field Delimiter-F or –field_delimiterField delimiter: Comma, Tab, Pipe, Other
Header rows–skip_leading_rowsHeader rows to skip
Number of bad records allowed–max_bad_recordsNumber of errors allowed
Newline characters–allow_quoted_newlinesAllow quoted newlines
Custom null values–null_markerSpecifies a string that represents a null value in a CSV file
Trailing optional columns–allow_jagged_rowsAllow jagged rows
Unknown values–ignore_unknown_valuesIgnore unknown values
Quote–quoteThe value that is used to quote data sections in a CSV file
Encoding-E or –encodingThe character encoding of the data

Below example, commands to load data using the bq Command Line tool will help you familiarise yourself with different options:

Specify data schema using a schema JSON file:

Auto-detected schema from the file:

If data is written to an existing table, BigQuery provides 3 options:

  • Write if the target is empty.
  • Append to the target table.
  • Overwrite the target table. Also, adding new fields to the table while uploading data is possible.

Step 5: Update Target Table in BigQuery

This is the final step in streaming data from MongoDB to BigQuery. To Upsert new data to the BigQuery table, initially load the data into a Staging Table as a full load. Let’s call it interm_delta_table. Below 2 approaches can be followed to load data to the Final Table:

  • Update the rows in the Final Table and insert new rows from the Delta Table which are missing from the Final Table using the code given below:
  • Delete rows from the Final Table, which are in the Delta Table. And then, insert all records from the Delta Table to the Final Table using the code given below:

You can follow any of the above mentioned methods to seamlessly establish MongoDB to Bigquery dataflow.

Learn about MongoDB data replication in this article.

Limitations of Manually Moving Data from MongoDB to BigQuery

The following are some possible drawbacks when data is streamed from MongoDB to BigQuery manually:

  • Time-Consuming: Compared to automated methods, manually exporting MongoDB data, transferring it to Cloud Storage, and then importing it into BigQuery is inefficient. Every time fresh data enters MongoDB, this laborious procedure must be repeated.
  • Potential for human error: There is a chance that data will be wrongly exported, uploaded to the wrong place, badly converted, or loaded to the wrong table or partition if error-prone manual procedures are followed at every stage.
  • Data lags behind MongoDB: The data in BigQuery might not be current with the most recent inserts and changes in the MongoDB database due to the manual process’s latency. Recent modifications may be overlooked in important analyses.
  • Difficult to incrementally add new data: When opposed to automatic streaming, which manages this effectively, adding just new or modified MongoDB entries manually is difficult.
  • Hard to reprocess historical data: It would be necessary to manually export historical data from MongoDB and reload it into BigQuery if any problems were discovered in the datasets that were previously imported.
  • No error handling: Without automated procedures to detect, manage, and retry mistakes and incorrect data, problems like network outages, data inaccuracies, or restrictions violations may arise.
  • Scaling limitations: The exporting, uploading, and loading processes in MongoDB don’t scale properly and get more and more difficult as data sizes increase.

The constraints drive the requirement for automated MongoDB to BigQuery replication in order to create more dependable, scalable, and resilient data pipelines.

MongoDB to BigQuery: Use Cases

Streaming data from MongoDB to BigQuery may be very helpful in the following frequent use cases:

  • Business analytics: Analysts may use BigQuery’s quick SQL queries, sophisticated analytics features, and smooth interaction with data visualization tools like Data Studio by streaming MongoDB data into BigQuery. This can lead to greater business insights.
  • Data warehousing: By streaming data from MongoDB and merging it with data from other sources, businesses may create a cloud data warehouse on top of BigQuery, enabling corporate reporting and dashboards.
  • Log analysis: BigQuery’s columnar storage and massively parallel processing capabilities enable the streaming of server, application, and clickstream logs from MongoDB databases for large-scale analytics.
  • Data integration: By streaming to BigQuery as a centralised analytics data centre, businesses using MongoDB for transactional applications may integrate and analyse data from their relational databases, customer relationship management (CRM) systems, and third-party sources.
  • Machine Learning: Streaming data from production MongoDB databases may be utilized to train ML models using BigQuery ML’s comprehensive machine learning features.
  • Cloud migration: By gradually streaming data, move analytics from on-premises MongoDB to Google Cloud’s analytics and storage services.

Conclusion

The article provides you with a comprehensive step-by-step guide that you can use to connect MongoDB to BigQuery and stream your data from MongoDB to BigQuery.

Most of the time, moving data from a NoSQL system like MongoDB to BigQuery is essential for proper data analysis. Converting schema to a normalized form and changing data format without any information loss is tricky.

To keep track of recurring incremental data loads, separate systems might be required to store ETL job-related metadata. There is a quicker option to deliver the same output and scale instantly.

A Cloud Data Integration Platform, Hevo can help you consolidate data from 150+ data sources like MongoDB to BigQuery in a few minutes.

Visit our Website to Explore Hevo

Want to take Hevo for a spin?

Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.

You can also look at the unbeatable Hevo Pricing that will help you choose the right plan for your business needs.

Share your experience streaming data from MongoDB to BigQuery in the comments below!

mm
Freelance Technical Content Writer, Hevo Data

Faisal loves data science and combines his problem-solving ability and passion for writing to help data teams in solving complex business problems.

Get Started with Hevo