MongoDB is a popular NoSQL database that requires data to be modeled in JSON format. If your application’s data model has a natural fit to MongoDB’s recommended data model, it can provide good performance, flexibility, and scalability for transaction types of workloads. However, due to a few restrictions that you can face while analyzing data, it is highly recommended to stream data from MongoDB to BigQuery or any other data warehouse.
MongoDB doesn’t have proper join, getting data from other systems to MongoDB will be difficult, and it also has no native support for SQL. MongoDB’s aggregation framework is not as easy to draft complex analytics logic as in SQL.
The article provides steps to migrate data from MongoDB to BigQuery. It also talks about Hevo Data, making it easier to replicate data. Therefore, without any further ado, let’s start learning about this MongoDB to BigQuery ETL.
Method 1: Using Hevo Data to Set up MongoDB to BigQuery
Hevo Data, a No-code Data Pipeline, helps you directly transfer data from MongoDB to BigQuery and other Data Warehouses, BI tools, or a destination of your choice in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of not only loading data from MongoDB but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code.
Get Started with Hevo for Free
The steps to load data from MongoDB to BigQuery using Hevo Data are as follows:
Step 1: Select the Source Type
To select MongoDB as the Source:
- Click PIPELINES in the Asset Palette.
- Click + CREATE in the Pipelines List View.
- In the Select Source Type page, select the MongoDB variant.
Step 2: Select the MongoDB Variant
Select the MongoDB service provider that you use to manage your MongoDB databases:
- Generic Mongo Database: Database management is done at your end, or by a service provider other than MongoDB Atlas.
- MongoDB Atlas: The managed database service from MongoDB.
Step 3: Specify MongoDB Connection Settings
Refer to the following sections based on your MongoDB deployment:
In the Configure your MongoDB Source page, specify the following:
Step 4: Configure BigQuery Connection Settings
Now Select Google BigQuery as your destination and start moving your data.
You can modify only some of the settings that you provide here once the Destination is created. Refer to the section Modifying BigQuery Destination Configuration below for more information.
- Click DESTINATIONS in the Asset Palette.
- Click + CREATE in the Destinations List View.
- In Add Destination page select Google BigQuery as the Destination type.
- In the Configure your Google BigQuery Account page, select the authentication method for connecting to BigQuery.
- In the Configure your Google BigQuery Warehouse page, specify the following details
By following the above mentioned steps, you will have successfully completed MongoDB BigQuery replication.
With continuous Real-Time data movement, Hevo allows you to combine MongoDB data along with your other data sources and seamlessly load it to BigQuery with a no-code, easy-to-setup interface. Try our 14-day full-feature access free trial!
SIGN UP HERE FOR A 14-DAY FREE TRIAL
Method 2: Manual Steps to Stream Data from MongoDB to BigQuery
Follow the steps below for connecting MongoDB to BigQuery and stream your data from MongoDB to BigQuery. You can also read about migrating data from MongoDB to Snowflake here.
Let’s take a comprehensive look at all the required steps:
Step 1: Extract Data from MongoDB
This is the first step to streaming your data from MongoDB to BigQuery. To extract data from MongoDB, you can use “
mongoexport“. The utility can be used to create JSON or CSV export of data stored in a MongoDB collection.
A general guideline for using mongoexport to stream data from MongoDB to BigQuery:
mongoexport must be run directly from the system command line, not from the Mongo Shell.
- Ensure that the connecting user possesses, at a minimum, the read role on the target database.
mongoexport defaults to primary read preference when connected to Mongo or a replica set.
- The default read preference can be overridden using the —readPreference option.
Below is an example that exports data from cont_collect collection to a CSV file in the location /opt/exports/csv/cnts.csv
- When you export in CSV format, you must specify the fields in the documents to export. In the above example, specify the contact_name and contact_address fields to export. The output would look like this: contact_name, contact_address Elon Wick, 27 Manhattan John Musk, 911 Martian Town
- Optionally you can specify the fields in a file containing the line-separated list of fields to export – with one field per line. For example, you can specify the contact_name and contact_address fields in a file contact_fields.txt:
- Then, applying the —fieldFile option, define the fields to export with the file: mongoexport –db users –collection cont_collects –type=csv –fieldFile fcontact_ields.txt –out/opt/backups/contacts.cs
- By default field, names are included as a header. –noHeaderLineoption can be used to eliminate the field names in a CSV export.
- As shown in the above example, –fields can be used to specify fields. Specified fields can be a field inside a sub-document.
Incremental Data Extract from MongoDB
Instead of extracting the entire collection, you can pass query and filter data. This option can be used to extract data incrementally. –query or -q is used to pass this option.
As an example, let’s consider the above-discussed contacts collection. Suppose ‘last_updated_time’ field in each document stores the last updated/inserted Unix Timestamp for that document.
The above command will extract all records from the collection having last_updated_time greater than the specified value, 1548930176. You should keep track of the last pulled last_updated_time separately.
Step 2: Optional Clean and Transform Data
This is the most complex step to migrate data from MongoDB to BigQuery. Apart from transformations to accommodate business logic, there are some basic things to keep in mind while preparing data from MongoDB to BigQuery.
- BigQuery expects UTF-8 encoded CSV data. If the data is encoded in ISO-8859-1(Latin-1), explicitly specify it while loading to BigQuery.
- BigQuery doesn’t enforce Primary Key and Unique Key constraints. ETL (Extract, Transform, and Load) process has to take care of that.
- Column types are somewhat different between MongoDB and BigQuery. Most of the types have either equivalent or convertible types. Refer to the following list to know more about the common data types:
|MongoDB Data Type
|BigQuery Data Type
- DATE value must be a dash (-) separated and in the form YYYY-MM-DD (Year-Month-Day). Fortunately,
mongoexport date format is exactly like this, and there is no need to modify this to connect MongoDB to BigQuery.
- Make sure text columns are quoted if they can potentially have Delimiter Characters.
Step 3: Upload to Google Cloud Storage (GCS)
Once data is extracted from MongoDB, the next step is to upload data to GCS. There are multiple ways to do this.
Gsutilis the standard Google Cloud Platform (GCP) tool to handle Objects and Buckets in GCS. It has options to upload a file from your local machine to GCS.
To copy a file to GCS:
To copy an entire folder:
If the files to be copied are already in S3, the following command can be used:
Storage Transfer Service from Google Cloud is an incredible alternative to uploading files to GCS from S3 or other online data sources, provided the destination is Cloud Storage Bucket. This service supports the GCS Bucket as a source.
This service is handy when it comes to data movement to GCS with the following options:
- Both first-time and recurring loads can be scheduled.
- After successfully transferring data to GCS, the source data can be deleted.
- Periodic synchronization between source and target can further be selected using filters.
Uploading From The Web Console
To upload data from the local machine, the GCP Web Console have inbuilt options. Follow the steps given below to make use of that:
- Login to the GCP, click on “Storage” and go to “Browser” in the left bar, as shown in the image below:
- Select the “GCS Bucket” with proper access to where you want to upload the file which is to be loaded to BigQuery. In the below image, test-data-hevo is the Bucket Name. Click on the “Bucket Name”.
- Now you are on the Bucket details page. Click on the “Upload Files” and then select the file from your computer.
- Now data upload is started, and a progress bar is shown, wait till it completes. After completion, the file will be listed in the Bucket, as shown in the image below.
Step 4: Upload Data Extracted from MongoDB to BigQuery Table from GCS
You can use the Web Console UI or Command Line tool bq to stream data from MongoDB to BigQuery Table. First, take a look into how to do it using the Web Console, step by step:
- Go to “BigQuery Console” from the left side panel, as shown in the image below.
- Create a dataset if not present already, as shown in the image below.
- Now, click on the created datasets on the left side. Then “Create Table” option will appear below Query Editor as shown in the image below.
- The “Create Table” button will open the wizard with options to specify the input source and other specifications for the table as shown in the image below.
As shown in the above image, there are options to specify schema or auto-detect. Many other options are also available.
As mentioned above, the Command Line tool
bq is a handy tool to interact with BigQuery, which helps to upload data from MongoDB to BigQuery using GCS. The bq load command can be used for this and the source_format CSV.
The syntax of bq load command is as follows:
[LOCATION] is the data location.
[FORMAT_NAME] here it is CSV.
[DATASET_NAME] is an existing dataset in BigQuery.
[TABLE_NAME] is the name of the table into which data to be loaded.
[SOURCE_PATH] Cloud Storage URI of data files.
[INLINE_SCHEMA] is a valid schema to describe data. The schema can be a local JSON file or can be specified directly.
Note: –autodetect flag also can be used instead of supplying a schema definition.
There are a bunch of options specific to CSV data load :
|bq command flag
|-F or –field_delimiter
|Field delimiter: Comma, Tab, Pipe, Other
|Header rows to skip
|Number of bad records allowed
|Number of errors allowed
|Allow quoted newlines
|Custom null values
|Specifies a string that represents a null value in a CSV file
|Trailing optional columns
|Allow jagged rows
|Ignore unknown values
|The value that is used to quote data sections in a CSV file
|-E or –encoding
|The character encoding of the data
Below example, commands to load data using the
bq Command Line tool will help you familiarise yourself with different options:
Specify data schema using a schema JSON file:
Auto-detected schema from the file:
If data is written to an existing table, BigQuery provides 3 options:
- Write if the target is empty.
- Append to the target table.
- Overwrite the target table. Also, adding new fields to the table while uploading data is possible.
Step 5: Update Target Table in BigQuery
This is the final step in streaming data from MongoDB to BigQuery. To Upsert new data to the BigQuery table, initially load the data into a Staging Table as a full load. Let’s call it interm_delta_table. Below 2 approaches can be followed to load data to the Final Table:
- Update the rows in the Final Table and insert new rows from the Delta Table which are missing from the Final Table using the code given below:
- Delete rows from the Final Table, which are in the Delta Table. And then, insert all records from the Delta Table to the Final Table using the code given below:
You can follow any of the above mentioned methods to seamlessly establish MongoDB to Bigquery dataflow.
Learn about MongoDB data replication in this article.
Limitations of Manually Moving Data from MongoDB to BigQuery
The following are some possible drawbacks when data is streamed from MongoDB to BigQuery manually:
- Time-Consuming: Compared to automated methods, manually exporting MongoDB data, transferring it to Cloud Storage, and then importing it into BigQuery is inefficient. Every time fresh data enters MongoDB, this laborious procedure must be repeated.
- Potential for human error: There is a chance that data will be wrongly exported, uploaded to the wrong place, badly converted, or loaded to the wrong table or partition if error-prone manual procedures are followed at every stage.
- Data lags behind MongoDB: The data in BigQuery might not be current with the most recent inserts and changes in the MongoDB database due to the manual process’s latency. Recent modifications may be overlooked in important analyses.
- Difficult to incrementally add new data: When opposed to automatic streaming, which manages this effectively, adding just new or modified MongoDB entries manually is difficult.
- Hard to reprocess historical data: It would be necessary to manually export historical data from MongoDB and reload it into BigQuery if any problems were discovered in the datasets that were previously imported.
- No error handling: Without automated procedures to detect, manage, and retry mistakes and incorrect data, problems like network outages, data inaccuracies, or restrictions violations may arise.
- Scaling limitations: The exporting, uploading, and loading processes in MongoDB don’t scale properly and get more and more difficult as data sizes increase.
The constraints drive the requirement for automated MongoDB to BigQuery replication in order to create more dependable, scalable, and resilient data pipelines.
MongoDB to BigQuery: Use Cases
Streaming data from MongoDB to BigQuery may be very helpful in the following frequent use cases:
- Business analytics: Analysts may use BigQuery’s quick SQL queries, sophisticated analytics features, and smooth interaction with data visualization tools like Data Studio by streaming MongoDB data into BigQuery. This can lead to greater business insights.
- Data warehousing: By streaming data from MongoDB and merging it with data from other sources, businesses may create a cloud data warehouse on top of BigQuery, enabling corporate reporting and dashboards.
- Log analysis: BigQuery’s columnar storage and massively parallel processing capabilities enable the streaming of server, application, and clickstream logs from MongoDB databases for large-scale analytics.
- Data integration: By streaming to BigQuery as a centralised analytics data centre, businesses using MongoDB for transactional applications may integrate and analyse data from their relational databases, customer relationship management (CRM) systems, and third-party sources.
- Machine Learning: Streaming data from production MongoDB databases may be utilized to train ML models using BigQuery ML’s comprehensive machine learning features.
- Cloud migration: By gradually streaming data, move analytics from on-premises MongoDB to Google Cloud’s analytics and storage services.
The article provides you with a comprehensive step-by-step guide that you can use to connect MongoDB to BigQuery and stream your data from MongoDB to BigQuery.
Most of the time, moving data from a NoSQL system like MongoDB to BigQuery is essential for proper data analysis. Converting schema to a normalized form and changing data format without any information loss is tricky.
To keep track of recurring incremental data loads, separate systems might be required to store ETL job-related metadata. There is a quicker option to deliver the same output and scale instantly.
A Cloud Data Integration Platform, Hevo can help you consolidate data from 150+ data sources like MongoDB to BigQuery in a few minutes.
Visit our Website to Explore Hevo
Want to take Hevo for a spin?
Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.
You can also look at the unbeatable Hevo Pricing that will help you choose the right plan for your business needs.
Share your experience streaming data from MongoDB to BigQuery in the comments below!