Understanding BigQuery Transaction: An Easy Guide 101

on Data Warehouse, Google BigQuery • March 29th, 2022 • Write for Hevo

bigquery transaction: FI

Google BigQuery is a completely managed data warehouse service offered based on subscription payments by Google. Completely managed warehouse services like BigQuery separates the storage and compute costs allowing customers to pay only for what they use. Google BigQuery has a very capable SQL layer and can handle petabytes of data.

Google BigQuery can be used to process data indexed as internal tables or stored in external sources like Google cloud storage based on a schema-on-read mode. Google BigQuery is hence considered to be offering the best of a data warehouse as well as a data lake. Google BigQuery supports multi-cloud installation and hence allows organizations to avoid vendor lock-ins. It also provides many other nice to have features like native machine learning support in SQL queries, geospatial analysis, natural language-based querying, etc to name a few.

This article explains Google BigQuery Transaction in detail. It also gives an introduction to multi-statement transaction support in Google BigQuery and how to use it. 

 Table of Contents

Prerequisites

  • Google Cloud Account with Google BigQuery Storage API permission
  • Google BigQuery Editor IAM permission and Google BigQuery Writer permission to the specific dataset. 
  • Basic understanding of database concepts

What is Google BigQuery?

Google BigQuery is a Cloud-based Data Warehouse that provides a Big Data Analytic Web Service for processing petabytes of data. It is intended for analyzing data on a large scale. It consists of two distinct components: Storage and Query Processing. It employs the Dremel Query Engine to process queries and is built on the Colossus File System for storage. These two components are decoupled and can be scaled independently and on-demand.

Launched in 2010, Google BigQuery is a Cloud-Based Data Warehouse service offered by Google. It is built to handle petabytes of data and can automatically scale as your business flourishes. Developers at Google have designed its architecture keeping the storage and computing resources separate. This makes querying more fluid as you can scale them independently without sacrificing performance.

Since there is no physical infrastructure present similar to the conventional server rooms for you to manage and maintain, you can focus all your workforce and effort on important business goals. Using standard SQL, you can accurately analyze your data and execute complex queries from multiple users simultaneously.

Google BigQuery is fully managed by Cloud service providers. You don’t need to deploy any resources, such as discs or virtual machines. It is designed to process read-only data. Dremel and Google BigQuery use Columnar Storage for quick data scanning, as well as a tree architecture for executing queries using ANSI SQL and aggregating results across massive computer clusters. Furthermore, owing to its short deployment cycle and on-demand pricing, Google BigQuery is server-less and designed to be extremely scalable.

For further information about Google BigQuery, follow the Official Documentation.

Key Features of Google BigQuery

Google BigQuery has continuously evolved over the years and is offering some of the most intuitive features :

  • User Friendly: With just a few clicks, you can start storing and analysing your data in Big Query. An easy-to-understand interface with simple instructions at every step allows you to set up your cloud data warehouse quickly as you don’t need to deploy clusters, set your storage size, or compression and encryption settings.    
  • On-Demand Storage Scaling: With ever-growing data needs, you can be rest assured that it will scale automatically when required. Based on Colossus (Google Global Storage System), it stores data in a columnar format with the ability to directly work on the compressed data without decompressing the files on the go.
  • Real-Time Analytics: Stay updated with real-time data transfer and accelerated analytics as Google BigQuery optimally allots any number of resources to provide the best performance and provide results so that you can generate business reports when requested.
  • Google BigQuery ML: Armed with machine learning capabilities, you can effectively design and build data models using existing SQL Commands. This eliminates the need for technical know-how of machine learning and empowers your data analysts to directly evaluate ML models.
  • Optimization Tools: To boost your query performance, Google provides Google BigQuery partitioning and clustering features for faster results. You also change the default datasets and table’s expiration settings for optimal storage costs and usage.   
  • Secure: Google BigQuery allows administrators to set access permissions to the data by groups and individuals. You can also enable row-level security for access to certain rows of a dataset. Data is encrypted before being written on the disk as well as during the transit phase. It also allows you to manage the encryption keys for your data.
  • Google Environment: Maintained and managed by Google, Google BigQuery enjoys the easy and fluid integrations with various applications present in the Google Ecosystem. With little to no friction at all, you can connect to platforms such as Google Sheets and Google Data Studio for further analysis. 

Benefits of Google BigQuery

In terms of analytics, Google BigQuery is a fantastic choice. It offers a managed data analytics service that makes it easier for customers to manage and run large analyses in the cloud. However, there are many other vendors’ options, so in this blog, I’d like to highlight four key advantages of Google BigQuery.

  • Distributed Architecture: Google’s distributed architecture dynamically distributes Google BigQuery’s computing across compute resources, removing the need to manage compute clusters. Competing offerings frequently require custom sizing (and pricing) of specific compute clusters, which can be difficult to manage over time.
  • Flexible Pricing Options: Because Google allocates resources dynamically, prices are also dynamic. Google provides a pay-as-you-go option in which you pay for the data imported into Google BigQuery as well as per-query charges. They provide a reporting tool as part of this approach to provide more visibility into usage and cost trends. For larger users, fixed pricing is also an option.
  • Fully Managed: Because Google BigQuery is a fully managed service, Google handles the backend configuration and tuning. This is far more straightforward than competing solutions, which require you to select the number and type of clusters to create and manage over time.
  • High Availability: To ensure high availability, Google BigQuery replicates data between zones automatically. It also load balances automatically for optimal performance and to reduce the impact of any hardware failures. This differs from competing solutions, which usually concentrate on a single zone.

Simplify Google BigQuery ETL with Hevo’s No-code Data Pipeline

A fully managed No-code Data Pipeline platform like Hevo Data helps you integrate and load data from 100+ Data Sources (including 40+ Free Sources) to a Data Warehouse such as Google BigQuery or Destination of your choice in real-time in an effortless manner. Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. In addition to the 100+ data sources, Hevo also allows integrating data from non-native sources using Hevo’s in-built Webhooks Connector.

Get Started with Hevo for Free

Its strong integration with umpteenth sources allows users to bring in data of different kinds in a smooth fashion without having to code a single line. 

Check out some of the cool features of Hevo:

  • Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
  • Connectors: Hevo supports 100+ Integrations to SaaS platforms, Files, Databases, BI tools such as Tableau, and Native REST API & Webhooks Connectors. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake Data Warehouses; Amazon S3 Data Lakes; Databricks (Connector Live Soon!); and MySQL, SQL Server, TokuDB, DynamoDB, PostgreSQL Databases to name a few.  
  • Real-Time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
  • 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (Including 40+ Free Sources) that can help you scale your data infrastructure as required.
  • 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Understanding Google BigQuery Transactions

Understanding Google BigQuery Transactions: What are Google BigQuery Transactions?

A database transaction is a unit of work performed against a database that represents a change to the state of the data and is treated independently of other transactions happening at the same time. ACID properties for databases dictate that all transactions should be atomic, consistent, isolated, and durable. While this is straightforward in the case of single-node relational databases, enforcing ACID properties comes with challenges in the case of massively parallel processing databases like Google BigQuery. 

Despite all challenges, Google BigQuery is ACID-compliant and offers multi-statement transactions as long as the data is stored in Google BigQuery storage and not external sources. A multi-statement transaction allows users to perform multiple operations like insert, delete, update, etc and commit or roll back the group of operations as a single block. Multi statement Google BigQuery transactions can be a group of operations over multiple tables or a single table. 

Google BigQuery transactions support snapshot isolation. Snapshot isolation is related to two aspects of transaction behavior – While a BigQuery transaction is happening, all reads return a consistent state of the table. If one of the statements in the Google BigQuery transaction modifies a table, the changes are visible to all further statements in that transaction.

Google BigQuery Transactions are bundled in script files and can not be tested. That is if you begin a transaction, another one cannot start with in the same script till the first one is committed.

Multi-statement transactions are supported by Google BigQuery within scripts. A multi-statement Google BigQuery transaction allows you to perform mutating operations on one or more tables, such as inserting or deleting rows, and atomically commit or rollback the changes.

Multi-statement transactions can be used for a variety of purposes, including the following:

  • Using a single transaction to perform DML mutations across multiple tables. Multiple datasets or projects can be referenced in the tables.
  • Using intermediate computations, performing mutations on a single table in stages.

Transactions provide snapshot isolation and ensure ACID properties. All reads during a transaction return a consistent snapshot of the transaction’s tables. If a statement in a transaction modifies a table, subsequent statements in the transaction will see the changes.

Understanding Google BigQuery Transactions: Executing Google BigQuery Transactions

Google BigQuery transactions start with BEGIN TRANSACTION and end when the script reaches the COMMIT TRANSACTION or ROLLBACK TRANSACTION. If the script ends before reaching any of these, Google BigQuery automatically rolls back the transaction. The developer also has the ability to define an exception handler. If it is defined, Google BigQuery will transfer the control to the error handler when an error happens. The developer can choose whether to commit or rollback the transaction inside the error handler. 

Combining everything from the above section, let us try to define a transaction. You will try to define a transaction based on a standard slowly changing dimension requirement. The requirement is to update a table that contains the prices of products with new prices for a specific category as well new arrivals. You also need to delete the entries from the table that contains the new prices. 

Assume there are two tables:

CREATE OR REPLACE TABLE product_dataset.product_prices
(
product string,
price numeric
);

CREATE OR REPLACE TABLE product_dataset.new_prices
(
product string,
quantity numeric category string
);

The transaction to update the prices of the product to product_prices table will look as below.

BEGIN
BEGIN TRANSACTION;

-- Create a temporary table that holds new prices for 'category #1'.
CREATE TEMP TABLE tmp
  AS SELECT * FROM product_dataset.new_prices WHERE category = 'category #1';

-- Delete the matching records from the new_prices table.
DELETE product_dataset.new_prices WHERE category = 'category #1';

-- Merge the records from the temporary table into the product_prices table.
MERGE product_dataset.product_prices AS I
USING tmp AS T
ON I.product = T.product
WHEN NOT MATCHED THEN
INSERT(product, quantity)
VALUES(product, quantity)
WHEN MATCHED THEN
UPDATE SET price = T.price;

-- Drop the temporary table and commit the transaction.
DROP TABLE tmp;

COMMIT TRANSACTION;
EXCEPTION WHEN ERROR THEN  -- Roll back the transaction inside the exception handler.  SELECT @@error.message;  ROLLBACK TRANSACTION;END;

The above transaction snippet first loads the entries for the specific category into a temporary table. It then updates the prices for the products that already exist in the product_prices table and adds the products that do not exist. The temporary table is then dropped and the transaction is committed. 

Wrapping the sequence in a transaction ensures that the data is not corrupted even if one statement fails. The transaction is safely rolled back in those cases. 

The type of statements supported by Google BigQuery transactions are:

  • SELECT statements.
  • DML statements like INSERT, UPDATE, DELETE, MERGE.
  • Creating, and deleting temporary tables. Transactions are not supported while creating and deleting permanent assets like datasets, tables, etc.

Concurrency while executing transactions is another point that developers need to be aware of. When a transaction manipulates rows in a table, then other transactions that manipulate rows in the same table can not run at the same time. Operations that read or append rows to the table can still be executed.

If you use time functions in a transaction, the time returned will point to the time when the transaction started. Use of the ‘FOR SYSTEM_TIME AS OF’ clause is prohibited while using transactions. 

Understanding Google BigQuery Transactions: Monitoring and Managing Google BigQuery Transactions

Unlike single statements, multi-statement transactions often involve time-consuming tasks and are executed as long-running jobs. Google BigQuery offers ways to view the running jobs and terminate the jobs if required. 

Viewing the list of running jobs can be done by querying the INFORMATION_SCHEMA.JOBS_BY_*   views. This view has information about all the jobs that have run for the last 180 days.  

To view the list of all the transactions running on a particular table, the following query can be used.

WITH
  running_transactions AS (
  SELECT DISTINCT transaction_id
  FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
  WHERE state = "RUNNING")
SELECT
  transaction_id, parent_job_id, query
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT, running_transactions
WHERE destination_table = "name_of_your_target_table"
  AND transaction_id = running_transactions.transaction_id;

To terminate a transaction, you should pass the parent_job_id to the BQ.JOBS.CANCEL routine. This can be done via the below query.

CALL BQ.JOBS.CANCEL('JOB_ID');

Limitations of Google BigQuery Transactions

  • Multiple Scripts can’t be used in the same Google BigQuery transaction.
  • DDL statements affecting permanent entities are not allowed in Google BigQuery transactions.
  • Materialized views are transformed into logical views within a transaction. A materialized view can still be queried within a Google BigQuery transaction, but it has no performance or cost advantages over a logical view.

Conclusion

You have learned all about Google BigQuery transactions and the nuances while using them. Transactions provide the option to consider a sequence of instructions a single block of work and thereby provide the ability to commit or roll back the whole block safely. This contributes to preventing data corruption. On another note, if you use Google BigQuery as a data warehouse service in your organization, you might want to check out a service that makes loading, and extracting data from Google BigQuery a frictionless experience.

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations with a few clicks.

Visit our Website to Explore Hevo

Hevo Data with its strong integration with 100+ Data Sources (including 40+ Free Sources) allows you to not only export data from your desired data sources & load it to the destination of your choice such as Google BigQuery but also transform & enrich your data to make it analysis-ready. Hevo also allows integrating data from non-native sources using Hevo’s in-built REST API & Webhooks Connector. You can then focus on your key business needs and perform insightful analysis using BI tools. 

Want to give Hevo a try? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You may also have a look at the amazing price, which will assist you in selecting the best plan for your requirements.

No-code Data Pipeline for BigQuery