Are you trying to load data into Google BigQuery? Are you struggling with Google BigQuery Streaming Insert? If yes, then this blog will answer all your queries. You can load data into Google BigQuery using various methods, but loading data using Google BigQuery Streaming Insert allows you to insert data in near real-time. In this blog, you will learn about Google BigQuery, its mechanism, limitations, and pricing in detail.

Introduction to Big Query

Google BigQuery Streaming Insert: Google BigQuery

Google BigQuery is a serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility

Here are few features of Google BigQuery:

  • Google BigQuery allows you to analyze petabytes of data with quick speed and zero operational overhead. 
  • You don’t need cluster deployment, virtual machines, setting up of keys or indexes, and software.
  • Stream millions of rows per second for real-time analysis.
  • Thousands of cores are used per query.
  • Separate Storage and Computing is available in Google BigQuery

To understand more about Google BigQuery, please refer to the following Hevo Data article

Streaming Data into Google BigQuery

There are few ways of loading data into Google BigQuery, and the different formats of loading data into BigQuery as listed below:

  • Loading data from Cloud Storage 
  • Loading data into Google BigQuery
  • Streaming data into Google BigQuery  

Google BigQuery processes and loads data efficiently. This article discusses one of the cost-effective ways to load data into BigQuery: Streaming Data into BigQuery. Everyone appreciates the importance of pumping data into data stores in ‘near real-time’. Google BigQuery Streaming Insert ensures data is pumped in ‘near real-time’.

Hevo Data: Load Real-Time Data in Google BigQuery

Hevo is a No-code Data Pipeline. It supports pre-built data integrations from 150+ data sources. Hevo offers a fully managed solution for your data migration process. It will automate your data flow in minutes without writing any line of code. Its fault-tolerant architecture makes sure that your data is secure and consistent. Hevo provides a truly efficient and fully-automated solution to manage data in real-time and always have analysis-ready data in Google BigQuery.

Let’s look at some salient features of Hevo:

  • Fully Managed: It requires no management and maintenance as Hevo is a fully automated platform.
  • Data Transformation provides a simple interface for perfecting, modifying, and enriching the data you want to transfer. 
  • Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
  • Schema Management: Hevo can automatically detect the schema of the incoming data and maps it to the destination schema.
  • Live Monitoring: Advanced monitoring gives you a one-stop view to watch all the activities that occur within pipelines.
  • Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support call.

Explore more about Hevo by signing up for a 14-day free trial today

Prerequisites for Google BigQuery Streaming Insert

Before proceeding with the implementation of Google BigQuery Streaming Insert, it is important to consider the following prerequisites:

  1. Write Access is a must, before streaming data into the destination table, and the existence of tables is must, before writing data into it. In case tables are not available, one will have to use Template Tables. 
    Please Note – Template Tables are new tables you want to add to the database. In template tables, one does not need to specify the table structure or import the table metadata. These tables are created automatically.  
  2. There is a quota policy which you have to follow to avoid errors. You can find it here. It is discussed later in this article. 
  3. Google BigQuery Streaming Insert is not available on the Free Tier Account. So, make sure that the billing is open and set to proceed further. In case you attempt to use streaming without enabling billing, you receive the following error. 
    ‘BigQuery: Streaming insert is not allowed in the free tier.’
Download the Cheatsheet on How to Set Up High-performance ETL to BigQuery
Download the Cheatsheet on How to Set Up High-performance ETL to BigQuery
Download the Cheatsheet on How to Set Up High-performance ETL to BigQuery
Learn the best practices and considerations for setting up high-performance ETL to BigQuery

Mechanism of Google BigQuery Streaming Insert

The image below shows how BigQuery Streaming insert occurs

Image Source
  • Streaming data producers: These are the applications that send the data to BigQuery for streaming.  It is a user code and can also be generated by other cloud services like Google Cloud Dataflow or Cloud Logging.
  • Streaming ingestion workers: This is the portion of the API surface that deals with BigQuery streaming insert traffic. It takes JSON representation of the streamed records, inserts them into the streaming buffer, and reports if the operation has failed or succeeded.
  • Streaming buffer: It keeps all the recently inserted rows and is optimized for high-throughput writes instead of columnar access.
  • Instant availability reader: This enables the query engine to read records from the streaming buffer directly. 
  • Columnar storage: It handles the data in columnar format. There are many individual columnar files of reasonable size. 
  • Extraction worker: This commits that data to BigQuery’s managed storage by collecting groups of buffered records and converting them to columnar format. 
  • Query engine: Represents the query engine. 

Instead of using a job to load data into BigQuery, you can choose to stream your data into Google BigQuery with one record at a time by using the tabledata().insertAll() method. This approach enables querying data without any delay in running a load job.

The tabledata.insertAll method inserts, along with other details such as authorization headers and details about the intended destination, is sent to the API in JSON format. A single insertAll call has one or more individual records within it. 

The Streaming Worker receives and processes the insert which makes the necessary calls to push each record in the insertAll payload into the streaming buffer. A well-formed key refers to the record in the buffer as the streaming worker uses the optionally provided insertId property. The worker then notifies of the success and/or failure of the insertAll request as a whole. The payload contains a detailed status for individual records for successful insertAll requests

The data is committed to the streaming buffer once the insert worker confirms it. The stream buffer storage is optimized for high-throughput writes instead of a columnar format.

from google.cloud import bigquery
client = bigquery.Client()
rows_to_insert = [
    {u"name": u“Richard", u"age": 18},
    {u"name": u"Dennise", u"age": 20},
    {u"name": u"Hakim", u"age": 23},
]
errors = client.insert_rows_json(table_id, rows_to_insert) #table id is id of table your data is stored in  
if errors == []:
    print("New rows have been added.")
else:
    print("Encountered errors while inserting rows: {}".format(errors))

There is flexibility as you do not need to populate the insertId field when you insert rows. The following examples show how to avoid sending insertId for each row when streaming.

errors = client.insert_rows_json(
    table_id, rows_to_insert, row_ids=[None] * len(rows_to_insert)
)  

Additional Considerations

  • Deduplication: You must take care of failure modes. During streaming records to BigQuery, transient failures like network issues can occur. BigQuery reveals the [insertId](https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll#rows.insertId) field to help users build write-once semantics. The arrival of new records bearing the same insertId will be treated as a new record instead of being deduplicated against the original insert after extraction of a record bearing a specific insertId from the streaming buffer.
  • Consistency and caching: The streaming system caches metadata information to support the high throughput of individual inserts by streaming, specifically around table existence and schema information. 
  • Table copies, extracts, and data availability: When you run a copy or extract job,  records in the streaming buffer are not considered as those jobs exclusively attempt to work with managed storage. 
  • Data Manipulation Language (DML): Bulk modification of tables via data manipulation statements such as UPDATE and DELETE is supported in BigQuery.  To prevent inconsistencies during active streaming DML statements are not allowed on a table. The buffer is detached, and DML is allowed after a table has received no inserts for an extended interval of time.
  • Partitioned tables: It is feasible to do Streaming works with partitioned tables. Ingestion time is used as the partitioning element for tables that are defined with natural partitioning. The current day is represented by default when a naturally partitioned table delivers streamed data to the partition. You can use the $YYYYMMDD suffix for delivering BigQuery streaming inserts to recent partitions.

Streaming into Partitioned Tables

You can stream data into a table, partitioned on a DATE or TIMESTAMP column with a duration between 5 years in the past and 1 year in the future. Data outside this range is rejected. 

When data streaming occurs, extraction of the UNPARTITIONED partition takes place. When there’s enough unpartitioned data, the data is partitioned to the corresponding partitions.

Limitations of Google BigQuery Streaming Insert

While using BigQuery streaming insert, there are few limitations that you have to bear. 

  1. Access Control:
    To insert streaming data, one must have the Write access for the dataset that contains the destination table. Further detail is available in the official documentation.
  2. Quota Policy:
    If you do not populate the insertId field when you insert rows, the following quote applies:
    • Maximum bytes per second allowed are 1GB. This limit applies at a project level and does not apply to the individual tables. 

Another scenario occurs when you populate the insertId field while inserting rows:

  • If you populate the insertId field for each inserted row, you will be limited to 500,000 rows per second in the US and EU multi-region, per project. In other words, the sum of rows per second streamed to all tables for a given project within a multi-region is limited to 500,000. Each table is additionally limited to 100,000 rows per second.
  • Exceeding either of these limitations will cause quotaExceeded errors. 
  • In all other regions, maximum rows per second per project is 100000, and again exceeding the limit will cause quotaExceeded errors.
  • Maximum rows per second per table allowed are 100,000. 
  • Maximum bytes per second per table allowed are 100 MB.

The following quotas apply whether you populate the insertId Field or not:

  • Maximum row size allowed is 5 MB. Exceeding it will give an invalid error. 
  • HTTP request size limit is 10 MB. Exceeding it will give an invalid error. 
  • Maximum rows per request allowed are 10,000. It is the max limit, but a maximum of 500 rows per request is recommended. As experimentation with representative data (schema and data sizes) will help you determine the ideal batch size.
  • insertId field length is 128. Exceeding this will give an invalid error. 

These are the quota limitations set by Google. If your quota increases beyond these limits, you can submit a request at Google Cloud Console. You will be getting a response in 2-3 working days. 

  1. Data Availability:
    The first time a BigQuery streaming insert occurs, the streamed data is inaccessible for up to 2-3 minutes. After this, the streamed data is immediately queryable. The availability of data for export and copy can take up to 90 minutes. 

It is recommended that instead of the tabledata.insertAll method, use BigQuery Storage Write API for new projects of BigQuery streaming API. The pricing of Storage Write API is lower and also it has more robust features with exactly-once delivery semantics. Select the default stream when you have to migrate an existing project from the tabledata.insertAll method to the Storage Write API. However, still the tabledata.insertAll method is completely supported.

Pricing for Google BigQuery Streaming Insert

In Google BigQuery, billing takes place based on rows that are successfully inserted. Individual rows are calculated using a 1KB of minimum size. The pricing for the insert operations cost you 0.010$ per 200 MB of data. For more details, please have a look at this link.

Conclusion

There are certain benefits of using the Google BigQuery streaming insert, but it does not come without limitations and a small cost. If you require real-time ingestion of non-transactional data, this feature is worth using. But, if you are looking for a fully automated solution, then try Hevo Data. This article provides you a comprehensive information on streaming insert BigQuery with its constraints and alternatives.

Hevo Data is a No-code Data Pipeline. It supports pre-built integration from 150+ data sources. You can easily migrate your data to Google BigQuery in real-time.

Explore more about Hevo by signing up for a 14-day free trial today.

Share your experience of using Google BigQuery Streaming Insert in the comment section below.

Muhammad Faraz
Technical Content Writer, Hevo Data

Muhammad Faraz is an AI/ML and MLOps expert with extensive experience in cloud platforms and new technologies. With a Master's degree in Data Science, he excels in data science, machine learning, DevOps, and tech management. As an AI/ML and tech project manager, he leads projects in machine learning and IoT, contributing extensively researched technical content to solve complex problems.

No-code Data Pipeline for Google BigQuery