Connect PubSub to BigQuery: 4 Easy Steps

on bigquery datasets, Data Warehouse, Google BigQuery, Google Cloud Platform, PubSub • September 22nd, 2021 • Write for Hevo

Data is an integral part of any business or company in a data-driven world where most businesses manage their workflows online. Applications are developed with a modern Cloud architecture to handle large data volumes without lag because the data generated by users is increasing rapidly. The applications are now decoupled into smaller parts for greater flexibility and work as independent building blocks. PubSub messaging is a new data-driven architecture that delivers instant event updates to distributed applications. 

PubSub allows services to communicate asynchronously with latencies in the order of 100 milliseconds. The data sent from PubSub to BigQuery is important for companies as it can contain customer details, operation details, Marketing data, E-Mail data, etc. Data Warehouses such as Google BigQuery help companies store and analyze this data. PubSub handles high volumes of data simultaneously in an application that helps developers to create apps faster and independent of other data streams.

PubSub allows companies to scale and manage data at a fast rate without affecting performance. Connecting PubSub to BigQuery helps companies get access to raw or processed data in real-time. In this article, you will read about PubSub and its use cases. You will also learn the steps to connect PubSub to BigQuery for seamless data flow. 

Table of Contents

What is Google BigQuery?

PubSub to BigQuery: Google BigQuery Logo | Hevo Data
Image Source

Google BigQuery is a Cloud Data Warehouse that is a part of the Google Cloud Platform which means it can easily integrate with other Google products and services. It allows users to manage their terabytes of data using SQL and helps companies analyze their data faster with SQL queries and generate insights. Google BigQuery Architecture uses the Columnar Storage structure that enables faster query processing and file compression. 

Google BigQuery can be accessed through Google Cloud Platform Console or its Web UI interface or by making calls to its Rest API. It offers a flexible pricing model based on pay-per-usage. Users can use Hevo to connect to Google Cloud Platform, REST APIs, and a multitude of sources with Google BigQuery using its No Code interface and automate the process seamlessly.

Key Features of Google BigQuery

Google BigQuery allows enterprises to store and analyze data with faster processing. There are many more features for choosing Google BigQuery. A few of them are listed below:

  • BI Engine: It is an in-memory analysis service that allows users to analyze large datasets interactively in Google BigQuery’s Data Warehouse itself. It offers sub-second query response time and high concurrency.
  • Integrations: Google BigQuery offers easy integrations with other Google products and its partnered apps. Moreover, developers can easily create integration with its API.
  • Fault-tolerant Structure: Google BigQuery delivers a fault-tolerant structure to prevent data loss and provide real-time logs for any error in an ETL process.

To learn more about Google BigQuery, click here.

What is PubSub?

PubSub to BigQuery: PubSub Logo | Hevo Data
Image Source

PubSub (Pub/Sub) stands for Publisher-Subscriber messaging, which is a method for asynchronous communication. It is involved in sending messages where the receiver doesn’t know about the sender and vice-versa. The Publisher in PubSub is the one who publishes messages to a topic with the help of data stream from apps, while the Subscriber is the one who subscribes to a topic or application to receive messages.

PubSub is used for high volume Data Streaming synchronously and data integration pipelines to ingest and distribute data. Real-time Data Streaming to Google BigQuery for handling handle data queueing & streaming tasks can be achieved using Hevo which also supports built-in connectors for Google Cloud Storage, Jira, Kafka, etc. to streamline data transfer to Google BigQuery.

PubSub acts as an intermediate or middleware service integration or as a queue to parallelize tasks. Publishers send messages or events to the PubSub without knowing if there is a receiver or not. Similarly, the receiver requests or subscribes to pull messages from PubSub without knowing any Publisher. 

Some common Use Cases of PubSub are listed below:

  • Parallel Processing: PubSub can connect with Google Cloud Functions that allow developers to send multiple processing requests parallelly to different PubSub workers, such as sending notifications, compressing an image, updating profile picture, etc. 
  • Real-time Event Distribution: PubSub allows developers to send data streams from multiple events to applications for real-time processing. 
  • Replicating Data: PubSub is used to distribute the events when changes in Databases. The events are used to create a state history in Google BigQuery and other Data Storage systems.

Key Features of PubSub

PubSub is useful for gathering information or events from many streams simultaneously. A few features of PubSub are listed below:

  • Filtering: PubSub can filter the incoming messages or events from Publishers based on various attributes to reduce the delivery volumes to Subscribers.
  • 3rd Party Integrations: It offers OSS Integrations for Apache Kafka and Knative Eventing. Also, it provides integrations with Spunk and Datadog for logs.
  • Replay Events: PubSub allows users to rewind the backlog or snapshot to reprocess messages.

To learn more about PubSub, click here.

Load Data Seamlessly to BigQuery Using Hevo’s No Code Data Pipeline

Hevo Data, an Automated No Code Data Pipeline, helps you transfer data from a plethora of data sources to Google BigQuery in a completely hassle-free manner. Hevo is fully managed and completely automates the Batch data loads and Streaming data loads into Google BigQuery, also it enriches the data and transforms it into an analysis-ready form without having to write a single line of code. You can also leverage Hevo’s Data Mapping feature to ensure that your Google BigQuery schema is replicated in an error-free manner.

Load Data to BigQuery with Hevo for Free

“Hevo’s fault-tolerant Data Pipeline offers you a secure option to unify data from 100+ data sources (including 40+ free sources) and store it in a BigQuery or any other Data Warehouse of your choice. This way you can focus more on your key business activities and let Hevo take full charge of the Data Transfer process.”

Experience an entirely automated hassle-free Data Loading to Google BigQuery. Try our 14-day full access free trial today!

Key Features of Pubsub to BigQuery Data Transfer

  • High Availability: PubSub to BigQuery offers synchronous, cross-zone message replication, and per-message receipt tracking to ensure fast and reliable delivery of the message at any scale.
  • At-least-once-delivery: PubSub to BigQuery Data Transfer supports In-order and any-order at-least-once message delivery features with both pull and push modes.
  • Security: To follow data privacy norms and keep data secure, PubSub to BigQuery Data Transfer ensures data protection with fine-grained access controls and always-on encryption.
Download the Cheatsheet on How to Set Up High-performance ETL to BigQuery
Download the Cheatsheet on How to Set Up High-performance ETL to BigQuery
Download the Cheatsheet on How to Set Up High-performance ETL to BigQuery
Learn the best practices and considerations for setting up high-performance ETL to BigQuery

Steps to Connect PubSub to BigQuery

Now that you read about Google BigQuery and PubSub. In this section, you will learn the steps to connect PubSub to BigQuery. Here, you will learn to manually send messages from PubSub to BigQuery as an example. Similarly, this can be implemented in applications. The steps to connect PubSub to BigQuery are listed below:

Step 1: Creating Google Storage Bucket

  • Log in to Google Cloud Platform here.
  • There 2 ways to use Google Cloud Platforms i.e., through Web UI and Cloud Shell. In this tutorial to connect PubSub to BigQuery, we will use Cloud Shell.
  • Open the Cloud Shell Editor by clicking on the shell icon located on top of the screen, as shown in the image below.
PubSub to BigQuery: Cloud Shell Icon in Google Cloud Platform | Hevo Data
Image Source: Self
  • It will take some time to start, then click on the “Authorize” button to grant permissions.
  • Now choose the existing project or create a new project for PubSub to BigQuery dataflow.
  • Now on Cloud Shell and create a new Bucket in Google Cloud Storage by typing the command given below.
gsutil mb gs://uniqueName
  • Here, provide a unique name to the Bucket in place of “uniqueName” in the above command.
  • It will create a new Bucket in Google Cloud Storage.
  • In the new tab of the browser open Google Cloud Platform and there open Google Cloud Storage and check if the new Bucket is present or not.

Step 2: Creating Topic in PubSub

  • Go to the Cloud Shell again and create a new topic in PubSub by typing the command given below.
gcloud pubsub topics create Topic01
  • Here, provide a topic name of your choice in place of “Topic01” in the above command.
  • It will create a new topic in PubSub.
  • In the new tab of the browser open Google Cloud Platform and there open PubSub and check if the new topic with a provided name is present or not.

What makes Hevo’s Data Transfer to Google BigQuery Unique

Manually performing the data transfer to Google BigQuery can be a cumbersome task that involves maintaining Data Pipelines. Hevo automates the data transfer process into Google BigQuery by offering ready-to-go integrations with SaaS applications such as JiraKafka, and many more. 

Check out how Hevo can make your life easier:

  • Secure: Hevo has a fault-tolerant architecture and ensures that your Google BigQuery data is handled in a secure & consistent manner with zero data loss.
  • Auto Schema Mapping: Hevo takes away the tedious task of schema management & automatically detects the format of incoming data and replicates it to the destination schema. 
  • High-Speed Data Loading: Loading compressed data into Google BigQuery is slower than loading uncompressed data. Hevo can decompress your data before feeding it to BigQuery. Hence your process would be simpler on the source side and will be completed efficiently on the destination side. 
  • Transformations: Hevo provides preload transformations to make your incoming data fit for the chosen destination. You can also use drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few.
  • Live Support: The Hevo team is available round the clock to extend exceptional support for your convenience through chat, email, and support calls.

Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo.

Step 3: Creating Dataset in Google BigQuery

  • Now, create a new dataset in Google BigQuery by typing the command in Cloud Shell given below.
bq mk dataset01
  • Here, replace the name of the dataset of your choice with “dataset01“.
  • It will create a new dataset in Google BigQuery, that you can check in Google BigQuery Web UI.
  • Now, let’s create a new table in the current dataset to get messages from PubSub to BigQuery in that table.
  • Here, let’s take a simple example for a message in JSON (JavaScript Object Notation) format, given below.
{ 
"name" : "Aditya",
"language" : "ENG" 
}
  • You can create any message you want but it should be in JSON format.
  • Also, it’s an example. Generally, the PubSub topic is connected to an application using programming languages and scripts.
  • According to the above example, the message has 2 fields “name” and “language” as “STRING” type.
  • To create a table according to the above message structure, type the command given below.
bq mk dataset01.table01 name:STRING,language:STRING 
  • Here, “dataset01” is the name of your dataset, “table01” is the name of the table you want to provide. Then, “name” and “language” as fields with their datatypes.

Step 4: Connecting PubSub to BigQuery Using Dataflow

  • In the new tab of the browser, open Google Cloud Platform and go to search for “Dataflow” and open it. It will open the Dataflow service by Google.
  • Here, click on the “+CREATE JOB FROM TEMPLATE” option to create a new PubSub BigQuery Job, as shown in the image below.
Creating Dataflow Job from Template for PubSub to BigQuery Data Transfer | Hevo Data
Image Source: Self
  • Now, provide the Job name and choose the “PubSub Topic to BigQuery” option in the Dataflow template drop-down menu, as shown in the image below.
PubSub to BigQuery: Choosing Topic PubSub to BigQuery Option for Dataflow Template | Hevo Data
Image Source: Self
  • Copy the “Topic name” from the PubSub topic you just created and Table ID from the Google BigQuery Table Info.
  • Paste both the values in their respective fields in Dataflow.
  • Go to Google Cloud Storage, and open the current project Bucket. There navigate to the “CONFIGURATION” tab and copy the value against “gsutil URI“.
  • Paste the value in the “Temporary location” field in Dataflow.
  • Now, click on the “RUN JOB” button to connect PubSub to BigQuery.
  • Now, go to PubSub and click on the “+ PUBLISH MESSAGE” button.
  • Copy and paste the above message in JSON or provide your message based on the fields you added in the table.
  • Click the “PUBLISH” button. It will publish the message from PubSub to BigQuery using Dataflow and Google Cloud Storage.
  • You can check data in the table in Google BigQuery.

That’s it! You successfully connected PubSub to BigQuery.

Use Cases of PubSub to BigQuery Connection

Some of the common use cases of BigQuery PubSub Connection are listed below.

Stream Analytics

Google Cloud comes with a streaming analytics solution that allows its users to ingest, process, and analyze event streams in real-time to make data more organized, useful, and accessible from the instant it is generated. The availability and accessibility of fresh data are possible via the PubSub service with Dataflow and Google BigQuery.

It manages all the resources needed to handle fluctuating volumes of real-time data for live business insights. Moreover, PubSub to BigQuery reduces the complexity and makes data streams available to Data Engineers and Analysts.

Asynchronous Microservices Integration

PubSub serves as a messaging middleware for traditional service integration or a simple communication medium for modern microservices used in applications.

The PubSub push subscription feature delivers the events to serverless webhooks on Cloud Run, App Engine, Cloud Functions, or even on the custom environments such as Google Kubernetes Engine.

Moreover, it also provides a low-latency pull delivery option when exposing webhooks is not an option or for efficient handling of throughput streams.

Benefits of Connecting PubSub to BigQuery

PubSub powers modern applications with the asynchronous streamline data flow. Connecting PubSub to BigQuery allows companies to get all the data, messages, or events in a Data Warehouse. A few benefits of Connecting PubSub to BigQuery are listed below:

  • Connecting PubSub to BigQuery allow companies to transfer data to Data Warehouse to run analytics on data and generate insights.
  • PubSub ensures reliable delivery of the message at any scale and message replication simple.
  • It allows users to integrate PubSub with many other apps and services through Google BigQuery and publish data.

Conclusion

In this article, you learned about the steps to connect PubSub to BigQuery. You also read how PubSub to BigQuery data flow allows companies to store user data and manage applications’ data streams at any scale. PubSub allows developers to manage data streams in real-time without lag. PubSub ensures data delivery by maintaining a queue and following the one-to-many, many-to-one, and many-to-many Pubsliher-Subscriber model.

Companies store valuable data from multiple data sources into Google BigQuery. However, when it comes to loading with BigQuery, you need to be an expert to set up ETL pipelines from scratch and manually configure several details. Moreover, most of the time, the data is not available in the right format and you will need data engineering and BigQuery administration skills to transform the data.

Hevo Data provides an Automated No-code Data Pipeline that empowers you to overcome the above-mentioned limitations. Hevo caters to 100+ data sources (including 40+ free sources) and can seamlessly load data to BigQuery in real-time. Furthermore, Hevo’s fault-tolerant architecture ensures a consistent and secure transfer of your data to BigQuery. Using Hevo will make your life easier and make Data Transfer hassle-free.

Visit our Website to Explore Hevo

Share your experience of learning about PubSub to BigQuery Seamless Dataflow in the comments section below!

No-code Data Pipeline For your Google BigQuery