Connect PubSub to BigQuery: 4 Easy Steps

on bigquery datasets, Data Warehouse, Google BigQuery, Google Cloud Platform, PubSub • September 22nd, 2021 • Write for Hevo

Data is an integral part of any business or company in a data-driven world where most businesses manages their workflows online. Applications are developed with a modern Cloud architecture to handle large data volumes without lag because the data generated by users is increasing rapidly. The applications are now decoupled into smaller parts for greater flexibility and work as independent building blocks. PubSub messaging is a new data-driven architecture that delivers instant event updates to distributed applications. 

PubSub allows services to communicate asynchronously with latencies on the order of 100 milliseconds. The data sent from PubSub to BigQuery is important for companies as it can contain customer details, operation details, Marketing data, E-Mail data, etc. Data Warehouses such as Google BigQuery help companies store and analyze this data. PubSub handles high volumes of data simultaneously in an application that helps developers to create apps faster and independent of other data streams.

PubSub allows companies to scale and manage data at a fast rate without affecting performance. Connecting PubSub to BigQuery helps companies get access to raw or processed data in real-time. In this article, you will read about PubSub and its use cases. You will also learn the steps to connect PubSub to BigQuery for seamless data flow. 

Table of Contents

Introduction to Google BigQuery

Google BigQuery Logo
Image Source

Google BigQuery is a Cloud Data Warehouse that is a part of the Google Cloud Platform which means it can easily integrate with other Google products and services. It allows users to manage their terabytes of data using SQL and helps companies analyze their data faster with SQL queries and generate insights. Google BigQuery uses the Columnar Storage structure that enables faster query processing and file compression. 

Google BigQuery can be accessed through Google Cloud Platform Console or its Web UI interface or by making calls to its Rest API. It offers a flexible pricing model based on pay-per-usage. 

To learn more about Google BigQuery, click here.

Key Features of Google BigQuery

Google BigQuery allows enterprises to store and analyze data with faster processing. There are many more features for choosing Google BigQuery. A few of them are listed below:

  • BI Engine: It is an in-memory analysis service that allows users to analyze large datasets interactively in Google BigQuery’s Data Warehouse itself. It offers sub-second query response time and high concurrency.
  • Integrations: Google BigQuery offers easy integrations with other Google products and its partnered apps. Moreover, developers can easily create integration with its API.
  • Fault-tolerant Structure: Google BigQuery delivers a fault-tolerant structure to prevent data loss and provide real-time logs for any error in an ETL process.

Introduction to PubSub

PubSub Logo
Image Source

PubSub (Pub/Sub) stands for Publisher-Subscriber messaging, which is a method for asynchronous communication. It is involved in sending messages where the receiver doesn’t know about the sender and vice-versa. The Publisher in PubSub is the one who publishes messages to a topic with the help of data stream from apps, while the Subscriber is the one who subscribes to a topic or application to receive messages. PubSub is used for high volume data streaming asynchronously and data integration pipelines to ingest and distribute data. 

PubSub acts as an intermediate or middleware service integration or as a queue to parallelize tasks. Publishers send messages or events to the PubSub without knowing if there is a receiver or not. Similarly, the receiver requests or subscribes to pull messages from PubSub without knowing for any Publisher. 

Some common Use Cases of PubSub are listed below:

  • Parallel Processing: PubSub can connect with Google Cloud Functions that allows developers to send multiple processing requests parallelly to different PubSub workers, such as sending notifications, compressing an image, updating profile picture, etc. 
  • Real-time Event Distribution: PubSub allows developers to send data streams from multiple events to applications for real-time processing. 
  • Replicating Data: PubSub is used to distribute the events when changes in Databases. The events are used to create a state history in Google BigQuery and other Data Storage systems.

Key Features of PubSub

PubSub is useful for gathering information or events from many streams simultaneously. A few features of PubSub are listed below:

  • Filtering: PubSub can filter the incoming messages or events from Publishers based on various attributes to reduce the delivery volumes to Subscribers.
  • 3rd Party Integrations: It offers OSS Integrations for Apache Kafka and Knative Eventing. Also, it provides integrations with Spunk and Datadog for logs.
  • Replay Events: PubSub allows users to rewind the backlog or snapshot to reprocess messages.

To learn more about PubSub, click here.

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ data sources and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

Get Started with Hevo for Free

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  1. Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  2. Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  3. Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  4. Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  5. Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  6. Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  7. Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Steps to Connect PubSub to BigQuery

Now that you read about Google BigQuery and PubSub. In this section, you will learn the steps to connect PubSub to BigQuery. Here, you will learn to manually send messages from PubSub to BigQuery as an example. A similar method can be implemented to applications using programs. The steps to connect PubSub to BigQuery are listed below:

Step 1: Creating Google Storage Bucket

  • Log in to Google CloudPlatform here.
  • There 2 ways to use Google Cloud Platforms i.e., through Web UI and Cloud Shell. In this tutorial to connect PubSub to BigQuery, will use Cloud Shell.
  • Open the Cloud Shell Editor by clicking on the shell icon located on top of the screen, as shown in the image below.
Cloud Shell Icon in Google Cloud Platform
Image Source: Self
  • It will take some time to start, then click on the “Authorize” button to grant permissions.
  • Now choose the existing project or create a new project for PubSub to BigQuery dataflow.
  • Now on Cloud Shell and create a new Bucket in Google Cloud Storage by typing the command given below.
gsutil mb gs://uniqueName
  • Here, provide a unique name to the Bucket in place of “uniqueName” in the above command.
  • It will create a new Bucket in Google Cloud Storage.
  • In the new tab of the browser open Google Cloud Platform and there open Google Cloud Storage and check if the new Bucket is present or not.

Step 2: Creating Topic in PubSub

  • Go to the Cloud Shell again and create a new topic in PubSub by typing the command given below.
gcloud pubsub topics create MyTopic
  • Here, provide a topic name of your choice in place of “MyTopic” in the above command.
  • It will create a new topic in PubSub.
  • In the new tab of the browser open Google Cloud Platform and there open PubSub and check if the new topic with a provided name is present or not.

Step 3: Creating Dataset in Google BigQuery

  • Now, create a new dataset in Google BigQuery by typing the command in Cloud Shell given below.
bq mk mydataset
  • Here, replace the name of the dataset of your choice with “mydataset“.
  • It will create a new dataset in Google BigQuery, that you can check in Google BigQuery Web UI.
  • Now, let’s create a new table in the current dataset to get messages from PubSub to BigQuery in that table.
  • Here, let’s take a simple example for a message in JSON (JavaScript Object Notation) format, given below.
{ 
"name" : "John",
"language" : "ENG" 
}
  • You can create any message you want but it should be in JSON format.
  • Also, it’s an example. Generally, the PubSub topic is connected to an application using programming languages and scripts.
  • According to the above example, the message has 2 fields “name” and “language” as “STRING” type.
  • To create a table according to the above message structure, type the command given below.
bq mk mydataset.mytable name:STRING,language:STRING 
  • Here, “mydataset” is the name of your dataset, “mytable” is the name of the table you want to provide. Then, “name” and “language” as fields with their datatypes.

Step 4: Connecting PubSub to BigQuery Using Dataflow

  • In the new tab of the browser, open Google Cloud Platform and go to search for “Dataflow” and open it. It will open the Dataflow service by Google.
  • Here, click on the “+CREATE JOB FROM TEMPLATE” option to create a new PubSub to BigQuery Job, as shown in the image below.
Creating Dataflow Job from Template for PubSub to BigQuery Data Transfer
Image Source: Self
  • Now, provide Job name and choose the “PubSub Topic to BigQuery” option in the Dataflow template drop-down menu, as shown in the image below.
Choosing Topic PubSub to BigQuery Option  for Dataflow Template
Image Source: Self
  • Copy the “Topic name” from the PubSub topic you just created and Table ID from the Google BigQuery Table Info.
  • Paste both the values in their respective fields in Dataflow.
  • Go to Google Cloud Storage, and open the current project Bucket. There navigate to the “CONFIGURATION” tab and copy the value against “gsutil URI“.
  • Paste the value in the “Temporary location” field in Dataflow.
  • Now, click on the “RUN JOB” button to connect PubSub to BigQuery.
  • Now, go to PubSub and click on the “+ PUBLISH MESSAGE” button.
  • Copy and paste the above message in JSON or provide your message based on the fields you added in the table.
  • Click the “PUBLISH” button. It will publish the message from PubSub to BigQuery using Dataflow and Google Cloud Storage.
  • You can check data in the table in Google BigQuery.

That’s it! You successfully connected PubSub to BigQuery.

Benefits of Connecting PubSub to BigQuery

PubSub powers modern applications with the asynchronous streamline data flow. Connecting PubSub to BigQuery allows companies to get all the data, messages, or events in a Data Warehouse. A few benefits of Connecting PubSub to BigQuery are listed below:

  • Connecting PubSub to BigQuery allow companies to transfer data to Data Warehouse to run analytics on data and generate insights.
  • PubSub ensures reliable delivery of the message at any scale and message replication simple.
  • It allows users to integrate PubSub with many other apps and services through Google BigQuery and publish data.

Conclusion

In this article, you learnt about the steps to connect PubSub to BigQuery. You also read how PubSub to BigQuery data flow allows companies to store user data and manage application’s data streams at any scale. PubSub allows developers to manage data streams in real-time without lag. PubSub ensures data delivery by maintaining a queue and follow the one-to-many, many-to-one, and many-to-many Pubsliher-Subscriber model.

Visit our Website to Explore Hevo

Companies store valuable data from multiple data sources into Google BigQuery. The manual process to transfer data from source to destination is a tedious task. Hevo Data is a No-code Data Pipeline that can help you transfer data from any data source to desired Google BigQuery. It fully automates the process to load and transform data from 100+ sources to a destination of your choice without writing a single line of code. 

Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your experience of learning about PubSub to BigQuery Seamless Dataflow in the comments section below!

No-code Data Pipeline For your Google BigQuery