Nowadays streaming data is a very crucial data source for any business that wants to perform real-time analytics. The first step to analyze data in real-time is to load the streaming data – often from Webhooks, in real-time to the warehouse. In this article, you will learn how to load real-time streaming data from Webhook to BigQuery. But first, let us briefly understand these systems.
What is Webhook?
Image Source: www.tech.trustpilot.com
Webhook is a very beneficial and resource-light way to perform and capture event reactions. WebHook has a mechanism to notify client applications whenever there is any new event happening over the server-side.
Webhooks are also known as Reverse API. Usually, under normal circumstances in an API, the client-side makes a call to the server-side application. However, in the case of webhooks, the reverse happens. It is the server-side that calls the webhook. i.e the server-side calls the client-side.
As WebHook calls the client application, it avoids continuous polling of client applications to the server application for any new updates. You can read more about webhooks here.
What is Google BigQuery?
Image Source: www.dataschool.com
BigQuery is a Google-managed cloud-based data warehouse service. BigQuery is a dedicated store that is used to process and analyze huge volumes of data in seconds. Its unique architecture allows to automatically scale both up and down based on the volume of data and query complexity.
In addition to its high-performance features, BigQuery also takes care of all resource management. It pretty much works out of the box. You can read more about BigQuery features here.
One of the below-mentioned approaches can be used to load streaming data to BigQuery:
Method 1: Use a fully-managed Data Integration Platform like Hevo Data that lets you move data without writing a single line of code (comes with a 14-day free trial).
Get started with hevo for free
Method 2: Build custom scripts to configure ETL jobs to perform the data load.
In this post, we will cover the second method (Custom Code) in detail. Towards the end of the post, you can also find a quick comparison of both data streaming methods so that you can assess your requirements and choose judiciously.
Webhook to BigQuery ETL Using Custom Code
The steps involved in migrating data from WebHook to BigQuery are as follows:
- Getting data out of your application using Webhook
- Preparing Data received from Webhook
- Loading data into Google BigQuery
Download the Cheatsheet on How to Set Up High-performance ETL to BigQuery
Learn the best practices and considerations for setting up high-performance ETL to BigQuery
Step 1: Getting data out of your application using Webhook
Setup a webhook for your application and define the endpoint URL on which you will deliver the data. This is the same URL from which the target application will read the data.
Step 2: Preparing Data received from Webhook
Webhooks post data to your specified endpoints in JSON format. It is up to you to parse the JSON objects and determine how to load them into your BigQuery data warehouse.
You need to make sure the target BigQuery table is well aligned with source data layout, specifically column sequence and data type of columns.
Step 3: Loading data into Google BigQuery
We can load data into BigQuery directly using API call or can create CSV file and then load into BigQuery table.
Create a Python script to read data from Webhook URL endpoint and load into BigQuery table.
from google.cloud import bigquery
import requests
client = bigquery.Client()
dataset_id = 'dataset_name'
#replace with your dataset ID
table_id = 'table_name'
#replace with your table ID
table_ref = client.dataset(dataset_id).table(table_id)
table = client.get_table(table_ref) # API request
receive data from WebHook
Convert received data into rows to insert into BigQuery
errors = client.insert_rows(table, rows_to_insert)# API request
assert errors == []
You can store streaming data into a file by a specific interval and use the bq command-line tool to upload the files to your datasets, adding schema and data type information. In GCP documentation of GSUTIL tool you can find the syntax of bq command line. Iterate through this process as many times as it takes to load all of your tables into BigQuery.
Once the data has been extracted from your application using Webhook, the next step is to upload it to the GCS. There are multiple techniques to upload data to GCS.
Upload file to GCS bucket
Using Gsutil: Using Gsutil utility we can upload a local file to GCS(Google Cloud Storage) bucket.
gsutil cp local_folder/file_name.csv gs://gcs_bucket_name/path/to/folder/
To copy a file to GCS:
Using Web console: An alternative way to upload the data from your local machine to GCS is using the web console. To use the web console option follow the below steps.
- First of all, you need to login to your GCP account. You must have a working Google account of GCP. In the menu option, click on storage and navigate to the browser on the left tab.
- If needed create a bucket to upload your data. Make sure that the name of the bucket you choose is globally unique.
- Click on the bucket name that you have created in step #2, this will ask you to browse the file from your local machine.
- Choose the file and click on the upload button. A progression bar will appear. Next, wait for the upload to complete. You can see the file is loaded in the bucket.
Create Table in BigQuery
- Go to the BigQuery from the menu option.
- On G-Cloud console, click on create a dataset option. Next, provide a dataset name and location.
- Next, click on the name of the created dataset. On G-Cloud console, click on create table option and provide the dataset name, table name, project name, and table type.
Load the data into BigQuery Table
Start the command-line tool, click on the cloud shell icon shown here.
The syntax of the bq command line to load the file in the BigQuery table:
Note: Autodetect flag identifies the table schema
bq --location=[LOCATION] load --source_format=[FORMAT]
[DATASET].[TABLE] [PATH_TO_SOURCE] [SCHEMA]
[LOCATION] is an optional parameter that represents Location name like “us-east”
[FORMAT] to load CSV file set it to CSV [DATASET] dataset name.
[TABLE] table name to load the data.
[PATH_TO_SOURCE] path to source file present on the GCS bucket.
[SCHEMA] Specify the schema
bq --location=US load --source_format=CSV your_dataset.your_table gs://your_bucket/your_data.csv ./your_schema.json
You can specify your schema using bq command line
bq --location=US load --source_format=CSV your_dataset.your_table gs://your_bucket/your_data.csv ./your_schema.json
Your target table schema can also be autodetected:
bq --location=US load --autodetect --source_format=CSV your_dataset.your_table gs://mybucket/data.csv
BigQuery command-line interface allows us to 3 options to write to an existing table.
Overwrite the table
bq --location = US load --autodetect --replace --source_file_format = CSV your_target_dataset_name.your_target_table_name gs://source_bucket_name/path/to/file/source_file_name.csv
Append data to the table
bq --location = US load --autodetect --noreplace --source_file_format = CSV your_target_dataset_name.your_table_table_name gs://source_bucket_name/path/to/file/source_file_name.csv ./schema_file.json
Adding new fields in the target table
bq --location = US load --noreplace --schema_update_option = ALLOW_FIELD_ADDITION --source_file_format = CSV your_target_dataset.your_target_table gs://bucket_name/source_data.csv ./target_schema.json
Update data into BigQuery Table
The data that was matched in the above-mentioned steps not done complete data updates on the target table. The data is stored in an intermediate data table. This is because GCS is a staging area for BigQuery upload. There are two ways of updating the target table as described here.
Update the rows in the target table. Next, insert new rows from the intermediate table
UPDATE target_table t
SET t.value = s.value
FROM intermediate_table s
WHERE t.id = s.id;
INSERT target_table (id, value)
SELECT id, value
FROM intermediate_table WHERE NOT id IN (SELECT id FROM target_table);
Delete all the rows from the target table which are in the intermediate table. Then, insert all the rows newly loaded in the intermediate table. Here the intermediate table will be in truncate and load mode.
DELETE FROM final_table f WHERE f.id IN (SELECT id from intermediate_table); INSERT data_setname.target_table(id, value) SELECT id, value FROM data_set_name.intermediate_table;
Limitations of writing custom Scripts to stream data from Webhook to BigQuery
- The above code is built based on a certain defined schema from the Webhook source. There are possibilities that the scripts break if the source schema is modified.
- If in future you identify some data transformations need to be applied on your incoming webhook events, you would require to invest additional time and resources on it.
- Overload of incoming data, you might have to throttle the data moving to BQ.
- Given you are dealing with real-time streaming data you would need to build very strong alerts and notification systems to avoid data loss due to an anomaly at the source or destination end. Since webhooks are triggered by certain events, this data loss can be very grave for your business.
Webhook to BigQuery ETL Using Hevo Data
A much easy way to get rid of all complexities that come your way in the custom code method is by implementing a fully managed Data Pipeline solution like Hevo Data. Hevo can be set up in minutes and would help you move data from Webhooks to BigQuery in 2 simple steps:
- Connect and configure your Webhook endpoint URL
- Configure your BigQuery Warehouse the data has to be streamed
Sign up here for a 14-day free trial!
In addition to webhooks, Hevo can move data from a variety of data sources (Databases, Cloud Applications, SDKs and more). Hevo ensures that your data is reliably and securely moved from any source to BigQuery in real-time.
visit our website to explore hevo
https://www.youtube.com/watch?v=p0XGLDgvCo8&feature=emb_title
Before you go ahead and take a call on the right approach to move data from Webhook to BigQuery – sign up and do experience Hevo’s hassle-free Data Pipeline platform.