One of the most promising capabilities in any enterprise is the implementation of Stream or Batch Processing data from platforms like Dataflow and its storage in a Cloud-based Data Warehouse like Google BigQuery. This is where the concept of Dataflow BigQuery comes into play.

Are you struggling to find a simple step-by-step guide to help you stream data from Dataflow to BigQuery? Do you want a simple fix? If yes? You’ve landed at the right place. Follow this easy guide to help you transfer data from Dataflow to BigQuery in no time. Upon a complete walkthrough of the content, you will be able to successfully stream your data to BigQuery using Dataflow for a fruitful analysis. It will further help you build a customized ETL pipeline for your organization.

Prerequisites

  • Working knowledge of Google BigQuery.
  • Working knowledge of Dataflow.
  • A Google BigQuery Account.
Stream your Data to Google BigQuery using Hevo’s No-Code Data Pipeline

Effortlessly transfer your data from Dataflow to BigQuery using Hevo’s no-code platform! Make your data migration process automatic, and enjoy real-time syncing with minimal effort and efficiency.

Why Hevo Stands Out?

Explore Hevo and see why BeepKart chose us to enhance their data pipeline efficiency.

Get Started with Hevo for Free

What is Google BigQuery?

Dataflow to BigQuery: Google BigQuery Logo

Google BigQuery is a fully-managed data warehouse offered by Google. It provides users with various features such as BigQuery ML, BigQuery QnA, Connected Sheets, etc. apart from its comprehensive querying layer that delivers exceptional performance and provides a fast querying ability. It also allows users to bring in data from a wide variety of external sources such as Cloud SQL, Google Drive, Sheets, Zendesk, Salesforce, etc.

Some Key Features of Google BigQuery are given below:

  • Scalability: BigQuery offers true scalability and consistent performance using its massively parallel computing and secure storage engine.
  • Data Ingestions Formats: BigQuery allows users to load data in various formats such as AVRO, CSV, JSON etc.
  • Built-in AI & ML: It supports predictive analysis using its auto ML tables feature, a codeless interface that helps develop models having best in class accuracy. BigQuery ML is another feature that supports algorithms such as K means, Logistic Regression etc.
  • Parallel Processing: It uses a cloud-based parallel query processing engine that reads data from thousands of disks at the same time. This is one of the main factors that enable the transfer of data from Dataflow to BigQuery efficiently.

What is Dataflow?

Dataflow to BigQuery: Google Dataflow Logo

Dataflow is a fully-managed data processing service by Google that follows a pay-as-you-go pricing model. Being serverless it helps organizations leverage the robust functionality of ETL tools without putting much effort into building and maintaining the ETL tools. It provides various provisions and takes care of all the resources required to carry out data processing operations.

Integrate HubSpot to BigQuery
Integrate Google Analytics to BigQuery
Integrate Google Analytics to Redshift

Need for Streaming Data from Dataflow to BigQuery

Google BigQuery, being a fully-managed Data Warehouse service by Google, requires users to bring in data from a variety of sources. A typical enterprise architecture involves updating data from a transactional database in a continuous fashion to ensure that analysts always have up-to-date data in the BigQuery Data Warehouse.

Writing custom codes to manage data loading operations that work with diverse data types is quite challenging, time-consuming, and requires a lot of technical expertise, and hence it is always better to use a fully-managed service like Dataflow to perform such operations.

Steps to Stream Data from Dataflow to BigQuery

Google provides users with a diverse set of open-source templates to set up a streaming workflow from Dataflow to BigQuery. With Google Dataflows in place, you can create a job using one of the predefined templates to transfer data to BigQuery.

This can be implemented using the following steps:

Step 1: Using a JSON File to Define your BigQuery Table Structure

Step 1.1: To start streaming data from Dataflow to BigQuery, you first need to create a JSON file that will define the structure for your BigQuery tables. To do this, create a JSON file outlining the table structure as follows:

{
    'fields': [{
        'name': name,
        'type': 'STRING'
    }, {
        'name': 'gender',
        'type': 'STRING'
    }, {
        'name': 'age',
        'type': 'STRING',
    }, {
        'name': ‘city,
        'type': 'STRING'
    }]
}

Step 1.2: Once you’ve created the JSON file, you need to add some data. To do this, create a file that will act as the data input containing the following information:

“Jeremiah”,”f”,”31”,”cairo”
“Sam”,”m”,”35”,”seattle”
“Rock”,”m”,”40”,”new york”

Step 1.3: To convert the input file information into the schema of your BigQuery table, create a Javascript function, that will transform each line of your input file to your desired BigQuery table schema as follows:

function transform(line) {
var values = line.split(',');
var custObj = new Object();
obj.name = values[0];
obj.gender = values[1];
obj.age = values[2];
obj.city = values[3];
var outJson = JSON.stringify(custObj);
return outJson;
}

Step 1.4: Save this Javascript function as “trasformfunction.js” in your Google Cloud Storage. This is how you can configure the schema for your BigQuey tables and transform your data using Javascript.

Step 2: Creating Jobs in Dataflow to Stream Data from Dataflow to BigQuery

Step 2.1: The next step in transferring data from Dataflow to BigQuery once you’ve configured your BigQuery table and input data is to go to the official site of Google Dataflow Console and log in with your credentials , such as username and password:

Dataflow to BigQuery: Sign In to Google Cloud Platform

Step 2.2: The Dataflow Console will now open up on your screen. Click on the create “a job from template” option at the top of your screen. The job template window will now open up on your screen as follows:

Dataflow to BigQuery: Configuring Job Template

Step 2.3: Configure the job template by providing the following information carefully:

  • Dataflow Template: Here, you need to select the “Text files on Cloud Storage to BigQuery” option from the drop-down list.
  • Required Parameters:  Here, you need to provide the location of the Javascript and JSON schema file, that you have created in the previous steps.

Step 2.4: To finish configuring your Dataflow job template, you now need to provide the name of the target BigQuery table, the location of your text file and information about your temporary directory locations.

Step 2.5: Once you’ve provided all the necessary information, select the Google Managed Key for encryption and click on the run job option. You can now track the progress of the job using the data console, and once the job finishes, you can log in to your BigQuery account and check the target table.

Limitations

Although the above Method of transferring data from Dataflow to BigQuery works, it has multiple limitations:

  • The above approach involved only a few clicks because of the built-in templates. These templates exist only for the most common scenarios and mainly capture scenarios that involve Google-based source and target databases.
  • Dataflow Integration to cloud-based services is not great. In case you are using a service like Marketo or Salesforce, you will have a tough time loading the data to BigQuery since it involves accessing the service APIs to capture data.

Conclusion

This article taught you how to stream data from Dataflow to BigQuery. It provides in-depth knowledge about the concepts behind every step to help you understand and implement them efficiently. These steps, however, are manual and can be very taxing. You will need to implement them manually, which will consume your time & resources, and writing custom scripts can be error-prone.

Hevo caters to 150+ data sources (including 60+ free sources) and can seamlessly ETL your data to BigQuery within minutes. Try a 14-day free trial and experience the feature-rich Hevo suite first hand. Also, check out our transparent pricing page for more information.

Tell us about your experience of streaming data from Dataflow to BigQuery! Let us know in the comments section below.

Frequently Asked Questions

1. What is the difference between BigQuery and Dataflow?

BigQuery is a fully managed analytics service based on SQL that is designed for SQL-based reporting and data analysis. Dataflow processes real-time stream and batch data with a deep emphasis on data ingestion and transformation.

2. Is Dataflow an ETL?

Dataflow is somewhat similar to an ETL tool in many ways: it extracts data, transforms it, and loads it. It can pull data from various sources, process and transform it, and feed it into storage solutions such as Google BigQuery or Cloud Storage.

3. Why do we use dataflow in GCP?

Dataflow in GCP is a powerful tool for stream and batch processing, capable of processing versatile data workflows. It automatically scales based on demand, freeing developers from infrastructure management. It is seamlessly integrated into GCP services such as BigQuery and Pub/Sub, supporting complex event processing and transformation. 

Talha
Software Developer, Hevo Data

Talha is a Software Developer with over eight years of experience in the field. He is currently driving advancements in data integration at Hevo Data, where he has been instrumental in shaping a cutting-edge data integration platform for the past four years. Prior to this, he spent 4 years at Flipkart, where he played a key role in projects related to their data integration capabilities. Talha loves to explain complex information related to data engineering to his peers through writing. He has written many blogs related to data integration, data management aspects, and key challenges data practitioners face.