Google BigQuery is a completely managed data warehouse service offered based on subscription payments by Google. Completely managed warehouse services like BigQuery separates the storage and compute costs allowing customers to pay only for what they use. Google BigQuery has a very capable SQL layer and can handle petabytes of data.
Google BigQuery can be used to process data indexed as internal tables or stored in external sources like Google cloud storage based on a schema-on-read mode. Google BigQuery is hence considered to be offering the best of a data warehouse as well as a data lake. Google BigQuery supports multi-cloud installation and hence allows organizations to avoid vendor lock-ins. It also provides many other nice to have features like native machine learning support in SQL queries, geospatial analysis, natural language-based querying, etc to name a few.
This article explains Google BigQuery Transaction in detail. It also gives an introduction to multi-statement transaction support in Google BigQuery and how to use it.
Prerequisites
- Google Cloud Account with Google BigQuery Storage API permission
- Google BigQuery Editor IAM permission and Google BigQuery Writer permission to the specific dataset.
- Basic understanding of database concepts
What is Google BigQuery?
Google BigQuery is a Cloud-based Data Warehouse that provides a Big Data Analytic Web Service for processing petabytes of data. It is intended for analyzing data on a large scale. It consists of two distinct components: Storage and Query Processing. It employs the Dremel Query Engine to process queries and is built on the Colossus File System for storage. These two components are decoupled and can be scaled independently and on-demand.
Launched in 2010, Google BigQuery is a Cloud-Based Data Warehouse service offered by Google. It is built to handle petabytes of data and can automatically scale as your business flourishes. Developers at Google have designed its architecture keeping the storage and computing resources separate. This makes querying more fluid as you can scale them independently without sacrificing performance.
Since there is no physical infrastructure present similar to the conventional server rooms for you to manage and maintain, you can focus all your workforce and effort on important business goals. Using standard SQL, you can accurately analyze your data and execute complex queries from multiple users simultaneously.
Google BigQuery is fully managed by Cloud service providers. You don’t need to deploy any resources, such as discs or virtual machines. It is designed to process read-only data. Dremel and Google BigQuery use Columnar Storage for quick data scanning, as well as a tree architecture for executing queries using ANSI SQL and aggregating results across massive computer clusters. Furthermore, owing to its short deployment cycle and on-demand pricing, Google BigQuery is server-less and designed to be extremely scalable.
For further information about Google BigQuery, follow the Official Documentation.
Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready.
Start for free now!
Get Started with Hevo for Free
Key Features of Google BigQuery
Google BigQuery has continuously evolved over the years and is offering some of the most intuitive features :
- User Friendly: With just a few clicks, you can start storing and analysing your data in Big Query. An easy-to-understand interface with simple instructions at every step allows you to set up your cloud data warehouse quickly as you don’t need to deploy clusters, set your storage size, or compression and encryption settings.
- On-Demand Storage Scaling: With ever-growing data needs, you can be rest assured that it will scale automatically when required. Based on Colossus (Google Global Storage System), it stores data in a columnar format with the ability to directly work on the compressed data without decompressing the files on the go.
- Real-Time Analytics: Stay updated with real-time data transfer and accelerated analytics as Google BigQuery optimally allots any number of resources to provide the best performance and provide results so that you can generate business reports when requested.
- Google BigQuery ML: Armed with machine learning capabilities, you can effectively design and build data models using existing SQL Commands. This eliminates the need for technical know-how of machine learning and empowers your data analysts to directly evaluate ML models.
- Optimization Tools: To boost your query performance, Google provides Google BigQuery partitioning and clustering features for faster results. You also change the default datasets and table’s expiration settings for optimal storage costs and usage.
- Secure: Google BigQuery allows administrators to set access permissions to the data by groups and individuals. You can also enable row-level security for access to certain rows of a dataset. Data is encrypted before being written on the disk as well as during the transit phase. It also allows you to manage the encryption keys for your data.
- Google Environment: Maintained and managed by Google, Google BigQuery enjoys the easy and fluid integrations with various applications present in the Google Ecosystem. With little to no friction at all, you can connect to platforms such as Google Sheets and Google Data Studio for further analysis.
Benefits of Google BigQuery
In terms of analytics, Google BigQuery is a fantastic choice. It offers a managed data analytics service that makes it easier for customers to manage and run large analyses in the cloud. However, there are many other vendors’ options, so in this blog, I’d like to highlight four key advantages of Google BigQuery.
- Distributed Architecture: Google’s distributed architecture dynamically distributes Google BigQuery’s computing across compute resources, removing the need to manage compute clusters. Competing offerings frequently require custom sizing (and pricing) of specific compute clusters, which can be difficult to manage over time.
- Flexible Pricing Options: Because Google allocates resources dynamically, prices are also dynamic. Google provides a pay-as-you-go option in which you pay for the data imported into Google BigQuery as well as per-query charges. They provide a reporting tool as part of this approach to provide more visibility into usage and cost trends. For larger users, fixed pricing is also an option.
- Fully Managed: Because Google BigQuery is a fully managed service, Google handles the backend configuration and tuning. This is far more straightforward than competing solutions, which require you to select the number and type of clusters to create and manage over time.
- High Availability: To ensure high availability, Google BigQuery replicates data between zones automatically. It also load balances automatically for optimal performance and to reduce the impact of any hardware failures. This differs from competing solutions, which usually concentrate on a single zone.
Integrate BigQuery to BigQuery
Integrate Amazon S3 to BigQuery
Understanding Google BigQuery Transactions
Understanding Google BigQuery Transactions: What are Google BigQuery Transactions?
A database transaction is a unit of work performed against a database that represents a change to the state of the data and is treated independently of other transactions happening at the same time. ACID properties for databases dictate that all transactions should be atomic, consistent, isolated, and durable. While this is straightforward in the case of single-node relational databases, enforcing ACID properties comes with challenges in the case of massively parallel processing databases like Google BigQuery.
Despite all challenges, Google BigQuery is ACID-compliant and offers multi-statement transactions as long as the data is stored in Google BigQuery storage and not external sources. A multi-statement transaction allows users to perform multiple operations like insert, delete, update, etc and commit or roll back the group of operations as a single block. Multi statement Google BigQuery transactions can be a group of operations over multiple tables or a single table.
Google BigQuery transactions support snapshot isolation. Snapshot isolation is related to two aspects of transaction behavior – While a BigQuery transaction is happening, all reads return a consistent state of the table. If one of the statements in the Google BigQuery transaction modifies a table, the changes are visible to all further statements in that transaction.
Google BigQuery Transactions are bundled in script files and can not be tested. That is if you begin a transaction, another one cannot start with in the same script till the first one is committed.
Multi-statement transactions are supported by Google BigQuery within scripts. A multi-statement Google BigQuery transaction allows you to perform mutating operations on one or more tables, such as inserting or deleting rows, and atomically commit or rollback the changes.
Multi-statement transactions can be used for a variety of purposes, including the following:
- Using a single transaction to perform DML mutations across multiple tables. Multiple datasets or projects can be referenced in the tables.
- Using intermediate computations, performing mutations on a single table in stages.
Transactions provide snapshot isolation and ensure ACID properties. All reads during a transaction return a consistent snapshot of the transaction’s tables. If a statement in a transaction modifies a table, subsequent statements in the transaction will see the changes.
Understanding Google BigQuery Transactions: Executing Google BigQuery Transactions
Google BigQuery transactions start with BEGIN TRANSACTION and end when the script reaches the COMMIT TRANSACTION or ROLLBACK TRANSACTION. If the script ends before reaching any of these, Google BigQuery automatically rolls back the transaction. The developer also has the ability to define an exception handler. If it is defined, Google BigQuery will transfer the control to the error handler when an error happens. The developer can choose whether to commit or rollback the transaction inside the error handler.
Combining everything from the above section, let us try to define a transaction. You will try to define a transaction based on a standard slowly changing dimension requirement. The requirement is to update a table that contains the prices of products with new prices for a specific category as well new arrivals. You also need to delete the entries from the table that contains the new prices.
Assume there are two tables:
CREATE OR REPLACE TABLE product_dataset.product_prices
(
product string,
price numeric
);
CREATE OR REPLACE TABLE product_dataset.new_prices
(
product string,
quantity numeric category string
);
The transaction to update the prices of the product to product_prices table will look as below.
BEGIN
BEGIN TRANSACTION;
-- Create a temporary table that holds new prices for 'category #1'.
CREATE TEMP TABLE tmp
AS SELECT * FROM product_dataset.new_prices WHERE category = 'category #1';
-- Delete the matching records from the new_prices table.
DELETE product_dataset.new_prices WHERE category = 'category #1';
-- Merge the records from the temporary table into the product_prices table.
MERGE product_dataset.product_prices AS I
USING tmp AS T
ON I.product = T.product
WHEN NOT MATCHED THEN
INSERT(product, quantity)
VALUES(product, quantity)
WHEN MATCHED THEN
UPDATE SET price = T.price;
-- Drop the temporary table and commit the transaction.
DROP TABLE tmp;
COMMIT TRANSACTION;
EXCEPTION WHEN ERROR THEN -- Roll back the transaction inside the exception handler. SELECT @@error.message; ROLLBACK TRANSACTION;END;
The above transaction snippet first loads the entries for the specific category into a temporary table. It then updates the prices for the products that already exist in the product_prices table and adds the products that do not exist. The temporary table is then dropped and the transaction is committed.
Wrapping the sequence in a transaction ensures that the data is not corrupted even if one statement fails. The transaction is safely rolled back in those cases.
The type of statements supported by Google BigQuery transactions are:
- SELECT statements.
- DML statements like INSERT, UPDATE, DELETE, MERGE.
- Creating, and deleting temporary tables. Transactions are not supported while creating and deleting permanent assets like datasets, tables, etc.
Concurrency while executing transactions is another point that developers need to be aware of. When a transaction manipulates rows in a table, then other transactions that manipulate rows in the same table can not run at the same time. Operations that read or append rows to the table can still be executed.
If you use time functions in a transaction, the time returned will point to the time when the transaction started. Use of the ‘FOR SYSTEM_TIME AS OF’ clause is prohibited while using transactions.
Understanding Google BigQuery Transactions: Monitoring and Managing Google BigQuery Transactions
Unlike single statements, multi-statement transactions often involve time-consuming tasks and are executed as long-running jobs. Google BigQuery offers ways to view the running jobs and terminate the jobs if required.
Viewing the list of running jobs can be done by querying the INFORMATION_SCHEMA.JOBS_BY_* views. This view has information about all the jobs that have run for the last 180 days.
To view the list of all the transactions running on a particular table, the following query can be used.
WITH
running_transactions AS (
SELECT DISTINCT transaction_id
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE state = "RUNNING")
SELECT
transaction_id, parent_job_id, query
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT, running_transactions
WHERE destination_table = "name_of_your_target_table"
AND transaction_id = running_transactions.transaction_id;
To terminate a transaction, you should pass the parent_job_id to the BQ.JOBS.CANCEL routine. This can be done via the below query.
CALL BQ.JOBS.CANCEL('JOB_ID');
Limitations of Google BigQuery Transactions
- Multiple Scripts can’t be used in the same Google BigQuery transaction.
- DDL statements affecting permanent entities are not allowed in Google BigQuery transactions.
- Materialized views are transformed into logical views within a transaction. A materialized view can still be queried within a Google BigQuery transaction, but it has no performance or cost advantages over a logical view.
Learn More About:
BigQuery Snapshots
Load your Data from Source to Destination within minutes
No credit card required
Conclusion
You have learned all about Google BigQuery transactions and the nuances while using them. Transactions provide the option to consider a sequence of instructions a single block of work and thereby provide the ability to commit or roll back the whole block safely. This contributes to preventing data corruption. On another note, if you use Google BigQuery as a data warehouse service in your organization, you might want to check out a service that makes loading, and extracting data from Google BigQuery a frictionless experience.
Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations with a few clicks.
FAQ
What is a transaction in BigQuery?
In BigQuery, a transaction is a sequence of SQL statements that are executed as a single unit of work, ensuring that all statements succeed or none at all, typically used for managing data modifications.
What is BigQuery Data Transfer API?
The BigQuery Data Transfer API automates the transfer of data from various sources, such as Google Cloud services and external databases, into BigQuery, allowing for scheduled and managed data loads.
Does BigQuery support ACID transactions?
BigQuery supports some ACID properties, specifically atomicity and consistency, through its DML operations, but it does not fully support ACID transactions like traditional databases, particularly isolation and durability.
Talha is a Software Developer with over eight years of experience in the field. He is currently driving advancements in data integration at Hevo Data, where he has been instrumental in shaping a cutting-edge data integration platform for the past four years. Prior to this, he spent 4 years at Flipkart, where he played a key role in projects related to their data integration capabilities. Talha loves to explain complex information related to data engineering to his peers through writing. He has written many blogs related to data integration, data management aspects, and key challenges data practitioners face.