Do you want to stream your data to Google BigQuery? Are you finding it challenging to load your data into your Google BigQuery tables? If yes, then you’ve landed at the right place!
- Follow our easy step-by-step guide to help you master the skill of seamlessly Streaming Data to BigQuery from a source of your choice in real time!
- It will help you take charge in a hassle-free way without compromising efficiency. This article aims at making the data streaming process as smooth as possible.
With Hevo, you can seamlessly integrate your data from 150+ sources into Google BigQuery without writing a single line of code. Hevo offers automated pipelines that simplify data migration and real-time analytics and enrich the quality of your data as well.
Get Started with Hevo for Free
Methods to DataStream to BigQuery
Method 1: DataStream to BigQuery using Hevo’s No-code Data Pipelines
Hevo Data, a No-code Data Pipeline, helps you stream data from 150+ sources to Google BigQuery & makes it analysis ready for you to visualize it in a BI tool. Here, I will show you how you can stream your data from any source to Google BigQuery in real time in just two easy steps:
Step 1: Configure the Source
To create a new pipeline to facilitate smooth data flow, you can choose any source of your choice that contains the data you want to migrate.
Step 1.1: Select any Source from the available sources. For example, I am selecting Postgres here. You can find the full list of sources available.
Step 1.2: Provide access to your Source – You can enter the required details, such as Host name, port number, username, and password for Hevo to access the database.
Step 1.3: Click on Test and Continue.
Step 2: Configure BigQuery as your Destination
You can select BigQuery as your destination by providing credentials to your Google account.
Step 2.1: Select BigQuery as your Destination.
Step 2.2: Provide credentials to your BigQuery account.
Step 2.3: Save and Continue to run the pipeline.
Integrate PostgreSQL to BigQuery
Integrate MySQL to BigQuery
Integrate MongoDB to BigQuery
Method 2: DataStream to BigQuery using Python
Required Permissions
To stream data into BigQuery, you will need the following IAM permissions:
- bigquery.tables.updateData (lets you insert data into the table)
- bigquery.tables.get (lets you obtain table metadata)
- bigquery.datasets.get (lets you obtain dataset metadata)
- bigquery.tables.create (required if you use a template table to create the table automatically)
Prerequisites
Here are certain things that you should have before streaming the data:
- Make sure you have write access to the dataset that contains your destination table.
- You need to have a billing account in Google Cloud Storage.
- Grant Identity and Access Management (IAM) roles that give users the necessary permissions.
Steps to Stream Data to Google BigQuery
Step 1: Installing the Python Dependency for Google BigQuery
- To start Streaming Data to BigQuery using Python, you first need to install the Python dependency for Google BigQuery on your system. To do this, you can make use of the pip install command as follows:
pip install --upgrade google-cloud-BigQuery
- This is how you can install the Google BigQuery dependency for Python on your system.
Step 2: Creating the Service Account Key
- Once you’ve installed the Python dependency, you now need to create service key for your Google BigQuery instance, that will help provide access to your Google BigQuery data.
- To do this, go to the official website of Google Search Console and log in with your credentials such as username and password.
- You can also directly login with your Google account, associated with Google BigQuery database.
- Once you’ve logged in, click on the create button to start configuring your service key.
- Here, you will need to download the key in JSON format and save it on your system. To do this, select the JSON option found in the key-type section and then click on create.
- The service key file will now start downloading on your system. Save the file and safely copy the path of this location.
- Once you’ve successfully downloaded the file, you need to configure the environment variable, thereby allowing Python’s BigQuery client to access your data. To do this, add the path of your JSON file to the environment variable as follows:
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/keys/google-service-key
.json"
This is how you can create and configure the account service key for your Google BigQuery instance.
Step 3: Coding the Python Script to Stream Data to Google BigQuery
- With your service key now available, you can start building the Python script for streaming data to BigQuery tables. To do this, import the necessary libraries as follows:
from google.cloud import BigQuery
- Once you’ve imported the libraries, you need to initialise the client for your Google BigQuery instance to set up the connection. You can use the following line of code to do this:
client = bigquery.Client()
- With your connection now up and running, you now need to configure the Google BigQuery table in which you want to insert the data. To do this, you can use the following lines of code:
table_name = "<your fully qualified table name >"
insert_rows = [
{u"firstname": u"Arsha",u”lastname”:u”richard”,u"age": 32},
{u"firstname": u"Shneller",u”lastname”:u”james”,u"age": 39},
]
- You can now start loading your data into your Google BigQuery table using the Python script as follows:
errors = client.insert_rows_json(table_name, insert_rows)
if result == []:
print("Added data")
else:
print("Something went wrong: {}".format(result))
This is how you can use Python’s Google BigQuery dependency to start Streaming Data to BigQuery.
Troubleshooting stream inserts
Here are some errors that you may encounter while streaming data into BigQuery:
#1 Error:
google.auth.exceptions.AuthenticationError: Unauthorized
Solution: Ensure that your service account credentials or OAuth tokens are correctly configured and have the necessary permissions (roles/bigquery.dataEditor, roles/bigquery.admin) to write data to the specified dataset and table in BigQuery.
#2 Error:
google.api_core.exceptions.NotFound: 404 Not found.
Solution: Verify that the dataset and table IDs specified in your Python code match the actual IDs in your BigQuery project.
Limitations of Streaming Data to Google BigQuery Manually
- Streaming data to BigQuery requires you to write multiple custom integration-based code snippets to establish a connection, thereby making it challenging, especially when the data source is not a Google service.
- This method of streaming data to BigQuery fails to handle the issue of duplicate records. It requires you to write custom code that will help identify the “duplicate” rows and, help remove them from the database after the insertion process.
- Google BigQuery follows a complex quota-based policy that considers numerous factors such as, whether you’re using the de-duplication feature, etc. It thus requires you to prepare your code carefully with the quota policy in mind.
Learn More About:
Google BigQuery Streaming Insert
Conclusion
In this blog, we have discussed two methods by which you can stream our data to BigQuery. While using Python scripts provides flexibility and control over data pipelines, it can pose challenges such as managing scalability, handling schema changes dynamically, and ensuring robust error handling and monitoring. Hevo, on the other hand, simplifies these complexities with its managed service approach. It automates schema detection and evolution, handles scalability seamlessly with auto-scaling capabilities, and provides built-in error handling and monitoring.
Discover how connecting BigQuery to Python can enhance your data operations. Our guide walks you through the process with clear steps for seamless integration.
FAQs to stream data to BigQuery
1. Can you stream data to BigQuery?
Yes, you can stream data into BigQuery using its streaming inserts feature.
2. Does BigQuery support streaming inserts?
Yes, BigQuery supports streaming inserts. It allows you to stream individual rows of data into BigQuery tables in real-time using the BigQuery Streaming API.
3. What is streaming data in big data?
Streaming data in the context of big data refers to continuously flowing, real-time data that is generated continuously and needs to be processed and analyzed in near real-time.
4. What is a datastream in GCP?
In Google Cloud Platform (GCP), a Datastream refers to a managed service that enables real-time data integration and replication from various sources to Google Cloud destinations.
Talha is a Software Developer with over eight years of experience in the field. He is currently driving advancements in data integration at Hevo Data, where he has been instrumental in shaping a cutting-edge data integration platform for the past four years. Prior to this, he spent 4 years at Flipkart, where he played a key role in projects related to their data integration capabilities. Talha loves to explain complex information related to data engineering to his peers through writing. He has written many blogs related to data integration, data management aspects, and key challenges data practitioners face.