Google BigQuery Architecture: The Comprehensive Guide

By: Published: January 7, 2022

Google BigQuery Architecture | Hevo Data

Google BigQuery is a fully managed data warehouse tool. It allows scalable analysis over a petabyte of data, querying using ANSI SQL, integration with various applications, etc. To access all these features conveniently, you need to understand BigQuery architecture, maintenance, pricing, and security.

This guide decodes the most important components of Google BigQuery: BigQuery Architecture, Maintenance, Performance, Pricing, and Security.

Table of Contents

What Is Google BigQuery?

Google BigQuery is a Cloud Datawarehouse run by Google.  It is capable of analyzing terabytes of data in seconds. If you know how to write SQL Queries, you already know how to query it. In fact, there are plenty of interesting public data sets shared in BigQuery, ready to be queried by you.

You can access BigQuery by using the GCP console or the classic web UI, by using a command-line tool, or by making calls to BigQuery Rest API using a variety of Client Libraries such as Java, and .Net, or Python.

There are also a variety of third-party tools that you can use to interact with BigQuery, such as visualizing the data or loading the data.

What are the Key Features of Google BigQuery?

Why did Google release BigQuery and why would you use it instead of a more established data warehouse solution?

  • Ease of Implementation: Building your own is expensive, time-consuming, and difficult to scale. With BigQuery, you need to load data first and pay only for what you use.
  • Speed: Process billions of rows in seconds and handle the real-time analysis of Streaming data.

What is the Google BigQuery Architecture?

BigQuery Architecture is based on Dremel Technology. Dremel is a tool used in Google for about 10 years. 

  • Dremel: BigQuery Architecture dynamically apportions slots to queries on an as-needed basis, maintaining fairness amongst multiple users who are all querying at once. A single user can get thousands of slots to run their queries. It takes more than just a lot of hardware to make your queries run fast. BigQuery requests are powered by the Dremel query engine. 
  • Colossus: BigQuery Architecture relies on Colossus, Google’s latest generation distributed file system. Each Google data center has its own Colossus cluster, and each Colossus cluster has enough disks to give every BigQuery user thousands of dedicated disks at a time. Colossus also handles replication, recovery (when disks crash), and distributed management.
  • Jupiter Network: It is the internal data center network that allows BigQuery to separate storage and compute.

Data Model/Storage

  • Columnar storage.
  • Nested/Repeated fields.
  • No Index: Single full table scan.

Query Execution

  • The query is implemented in Tree Architecture.
  • The query is executed using tens of thousands of machines over a fast Google Network.

What is the BigQuery’s Columnar Database?

Google BigQuery Architecture uses column-based storage or columnar storage structure that helps it achieve faster query processing with fewer resources. It is the main reason why Google BigQuery handles large datasets quantities and delivers excellent speed.

Row-based storage structure is used in Relational Databases where data is stored in rows because it is an efficient way of storing data for transactional Databases. Storing data in columns is efficient for analytical purposes because it needs a faster data reading speed.

Suppose a Database has 1000 records or 1000 columns of data. If we store data in a row-based structure, then querying only 10 rows out of 1000 will take more time as it will read all the 1000 rows to get 10 rows in the query output.

But this is not the case in Google BigQuery’s Columnar Database, where all the data is stored in columns instead of rows.

The columnar database will process only 100 columns in the interest of the query, which in turn makes the overall query processing faster.

The Google Ecosystem

Google BigQuery is a Cloud Data Warehouse that is a part of Google Cloud Platform (GCP) which means it can easily integrate with other Google products and services.

Google Cloud Platforms is a package of many Google services used to store data such as Google Cloud Storage, Google Bigtable, Google Drive, Databases, and other Data processing tools. 

Google BigQuery can process all the data stored in these other Google products. Google BigQuery uses standard SQL queries to create and execute Machine Learning models and integrate with other Business Intelligence tools like Looker and Tableau.

Google BigQuery Comparison with Other Database and Data Warehouses

Here, you will be looking at how Google BigQuery is different from other Databases and Data Warehouses:

1) Comparison with MapReduce and NoSQL

MapReduce vs. Google BigQuery

  MapReduce       BigQuery
  1. High Latency.
  2. Flexible (complex) batch processing.
  3. Unstructured Data.
  1. Low Latency.
  2. SQL-like Queries.
  3. Structured Data.
MapReduce vs Google BigQuery

NoSQL Datastore vs. Google BigQuery

NoSQL DatastoreBigQuery
  1. Index-based.        
  2. Read-write.
  1. Non-index based.
  2. Read-only.
NoSQL vs Google BigQuery

2) Comparison with Redshift and Snowflake

Name  RedshiftBigQuerySnowflake
DescriptionLarge-scale data warehouse service for use with business intelligence toolsLarge-scale data warehouse service with append-only tablesCloud-based data warehousing service for structured and semi-structured data
Primary database modelRelational DBMSRelational DBMSRelational DBMS
DeveloperAmazonGoogleSnowflake Computing
XML supportNoNoYes
APIs and other access methodsJDBC

 

ODBC

RESTfull  HTTP/JSON APICLI Client

 

JDBC

ODBC

Supported programming languagesAll languages supporting JDBC/ODBC.Net, Java, JavaScript, Objective-C, PHP, Python.JavaScript (Node.js)

 

Python

Partitioning methodsShardingNoneYes
MapReduceNoNoNo
ConcurrencyYesYesYes
Transaction conceptsACIDNoACID
DurabilityYesYesYes
In-memory capabilitiesYesNoNo
User conceptsFine-grained access rights according to SQL-standardAccess privileges (owner, writer, reader) for whole datasets, not for individual tablesUsers with fine-grained authorization concepts, user roles and pluggable authentication
Redshift vs BigQuery vs Snowflake

Some Important Considerations about these Comparisons:

  • If you have a reasonable volume of data, say, dozens of terabytes that you rarely use to perform queries and it’s acceptable for you to have query response times of up to a few minutes when you use, then Google BigQuery is an excellent candidate for your scenario.
  • If you need to analyze a big amount of data (e.g.: up to a few terabytes) by running many queries   which should be answered each very quickly — and you don’t need to keep the data available once the analysis is done, then an on-demand cloud solution like Amazon Redshift is a great fit.
    But keep in mind that differently from Google BigQuery, Redshift does need to be configured and tuned in order to perform well.
  • BigQuery Architecture is good enough if not to take into account the speed of data updating. Compared to Redshift, Google BigQuery only supports hourly syncs as its fastest frequency update. This made us choose Redshift, as we needed the solution with the support of close to real-time data integration.

Key Concepts of Google BigQuery

Now, you will get to know about the key concepts associated with Google BigQuery:

1) Google BigQuery Working

BigQuery is a data warehouse, implying a degree of centralization. The query we demonstrated in the previous section was applied to a single dataset.

However, the benefits of BigQuery become even more apparent when we do joins of datasets from completely different sources or when we query against data that is stored outside BigQuery.

If you’re a power user of Sheets, you’ll probably appreciate the ability to do more fine-grained research with data in your spreadsheets. It’s a sensible enhancement for Google to make, as it unites BigQuery with more of Google’s own existing services. Previously, Google made it possible to analyse Google Analytics data in BigQuery.

These sorts of integrations could make BigQuery Architecture a better choice in the market for cloud-based data warehouses, which is increasingly how Google has positioned BigQuery. Public cloud market leader Amazon Web Services (AWS) has Redshift, but no widely used tool for spreadsheets.

Microsoft Azure’s SQL Data Warehouse, which has been in preview for several months, does not currently have an official integration with Microsoft Excel, surprising though it may be.

2) Google BigQuery Querying

Google BigQuery Architecture supports SQL queries and supports compatibility with ANSI SQL 2011. BigQuery SQL support has been extended to support nested and repeated field types as part of the data model.

For example, you can use GitHub public dataset and issue the UNNEST command. It lets you iterate over a repeated field.

SELECT
  name, count(1) as num_repos
FROM
  `bigquery-public-data.github_repos.languages`, UNNEST(language)
GROUP BY name
ORDER BY num_repos
DESC limit 10

A) Interactive Queries

Google BigQuery Architecture supports interactive querying of datasets and provides you with a consolidated view of these datasets across projects that you can access. Features like saving as and shared ad-hoc, exploring tables and schemas, etc. are provided by the console.

B) Automated Queries

You can automate the execution of your queries based on an event and cache the result for later use. You can use Airflow API to orchestrate automated activities.

For simple orchestrations, you can use corn jobs. To encapsulate a query as an App Engine App and run it as a scheduled cron job you can refer to this blog.

C) Query Optimization

Each time a Google BigQuery executes a query, it executes a full-column scan. It doesn’t support indexes. As you know, the performance and query cost of Google BigQuery Architecture is dependent on the amount of data scanned during a query, you need to design your queries to reference the column that is strictly relevant to your query.

When you are using data partitioned tables, make sure that only the relevant partitions are scanned.

You can also refer to the detailed blog here that can help you to understand the performance characteristics after a query executes.

D) External sources

With federated data sources, you can run queries on the data that exists outside of your Google BigQuery. But this method has performance implications. You can also use query federation to perform the ETL process from an external source to Google BigQuery.

E) User-defined functions

Google BigQuery supports user-defined functions for queries that can exceed the complexity of SQL. User-defined functions allow you to extend the built-in SQL functions easily. It is written in JavaScript. It can take a list of values and then return a single value.

F) Query sharing

Collaborators can save and share the queries between the team members. Data exploration exercise, getting desired speed on a new dataset or query pattern becomes a cakewalk with it.

3) Google BigQuery ETL/Data Load

There are various approaches to loading data to BigQuery. In case you are moving data from Google Applications – like Google Analytics, Google Adwords, etc. google provides a robust BigQuery Data Transfer Service. This is Google’s own intra-product data migration tool.

Data load from other data sources – databases, cloud applications, and more can be accomplished by deploying engineering resources to write custom scripts.

The broad steps would be to extract data from the data source, transform it into a format that BigQuery accepts, upload this data to Google Cloud Storage (GCS) and finally load this to Google BigQuery from GCS.

A few examples of how to perform this can be found here –> PostgreSQL to BigQuery and SQL Server to BigQuery

A word of caution though – custom coding scripts to move data to Google BigQuery is both a complex and cumbersome process. A third-party data pipeline platform such as Hevo can make this a hassle-free process for you.

Simplify ETL Using Hevo’s No-code Data Pipeline

Hevo Data helps you directly transfer data from 100+ other data sources (including 30+ free sources) to Business Intelligence tools, Data Warehouses, or a destination of your choice in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

Hevo takes care of all your data preprocessing needs required to set up the integration and lets you focus on key business activities and draw a much more powerful insight on how to generate more leads, retain customers, and take your business to new heights of profitability. It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination.

Get Started with Hevo for Free

Check out what makes Hevo amazing:

  1. Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  2. Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  3. Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  4. Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  5. Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  6. Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  7. Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

4) Google BigQuery Pricing Model

CategoryPriceNote
Storage Cost$0.020 per GB per month 
Query Cost$5 per TB1st TB per month is free
Pricing Model of Google BigQuery

A) Google BigQuery Storage Cost

  • Active – Monthly charge for stored data modified within 90 days.
  • Long-term – Monthly charge for stored data that have not been modified within 90 days. This is usually lower than the earlier one.

B) Google BigQuery Query Cost

  • On-demand – Based on data usage.
  • Flat rate – Fixed monthly cost, ideal for enterprise users.

Free usage is available for the below operations:

  • Loading data (network pricing policy applicable in case of inter-region).
  • Copying data.
  • Exporting data.
  • Deleting datasets.
  • Metadata operations.
  • Deleting tables, views, and partitions.

5) Google BigQuery Maintenance

Google has managed to solve a lot of common data warehouse concerns by throwing order of magnitude of hardware at the existing problems and thus eliminating them altogether. Unlike Amazon Redshift, running VACUUM in Google BigQuery is not an option. 

Google BigQuery is specifically architected without the need for the resource-intensive VACUUM operation that is recommended for Redshift. BigQuery Pricing is way different compared to the redshift.

Keep in mind that by design, Google BigQuery is append-only. Meaning, that when planning to update or delete data, you’ll need to truncate the entire table and recreate the table with new data.

However, Google has implemented ways in which users can reduce the amount of data processed.

Partition their tables by specifying the partition date in their queries. Use wildcard tables to share their data by an attribute.

6) Google BigQuery Security

The fastest hardware and most advanced software are of little use if you can’t trust them with your data. BigQuery’s security model is tightly integrated with the rest of Google’s Cloud Platform, so it is possible to take a holistic view of your data security.

BigQuery uses Google’s Identity and Access Management (IAM) access control system to assign specific permissions to individual users or groups of users.

BigQuery also ties in tightly with Google’s Virtual Private Cloud (VPC) policy controls, which can protect against users who try to access data from outside your organization, or who try to export it to third parties.

Both IAM and VPC controls are designed to work across Google cloud products, so you don’t have to worry that certain products create a security hole.

BigQuery is available in every region where Google Cloud has a presence, enabling you to process the data in the location of your choosing. At the time of writing,

Google Cloud has more than two dozen data centers around the world, and new ones are being opened at a fast rate.

If you have business reasons for keeping data in the US, it is possible to do so. Just create your dataset with the US region code, and all of your queries against the data will be done within that region. 

Know more about Google BigQuery security from here.

7) Google BigQuery Features

Some features of Google BigQuery Data Warehouse are listed below:

  • Just upload your data and run SQL.
  • No cluster deployment, no virtual machines, no setting keys or indexes, and no software.
  • Separate storage and computing.
  • No need to deploy multiple clusters and duplicate data into each one. Manage permissions on projects and datasets with access control lists. Seamlessly scales with usage.
  • Compute scales with usage, without cluster resizing.
  • Thousands of cores are used per query.
  • Deployed across multiple data centers by default, with multiple factors of replication to optimize maximum data durability and service uptime.
  • Stream millions of rows per second for real-time analysis.
  • Analyze terabytes of data in seconds.
  • Storage scales to Petabytes.

8) Google BigQuery Interaction

A) Web User Interface

  • Run queries and examine results.
  • Manage databases and tables.
  • Save queries and share them across the organization for re-use.
  • Detailed Query history.

B) Visualize Data Studio

  • View BigQuery results with charts, pivots, and dashboards.

C) API

  • A programmatic way to access Google BigQuery.

D) Service Limits for Google BigQuery

  • The concurrent rate limit for on-demand, interactive queries: 50.
  • Daily query size limit: Unlimited by default.
  • Daily destination table update limit: 1,000 updates per table per day.
  • Query execution time limit: 6 hours.
  • A maximum number of tables referenced per query: 1,000.
  • Maximum unresolved query length: 256 KB.
  • Maximum resolved query length: 12 MB.
  • The concurrent rate limit for on-demand, interactive queries against Cloud Big table external data sources: 4.

E) Integrating with Tensorflow

BigQuery has a new feature BigQuery ML that let you create and use a simple Machine Learning (ML) model as well as deep learning prediction with the TensorFlow model. This is the key technology to integrate the scalable data warehouse with the power of ML.

The solution enables a variety of smart data analytics, such as logistic regression on a large dataset, similarity search, and recommendation on images, documents, products, or users, by processing feature vectors of the contents. Or you can even run TensorFlow model prediction inside BigQuery.

Now, imagine what would happen if you could use BigQuery for deep learning as well. After having data scientists train the cutting-edge intelligent neural network model with TensorFlow or Google Cloud Machine Learning, you can move the model to BigQuery and execute predictions with the model inside BigQuery.

This means you can let any employee in your company use the power of BigQuery for their daily data analytics tasks, including image analytics and business data analytics on terabytes of data, processed in tens of seconds, solely on BigQuery without any engineering knowledge.

Google BigQuery Performance

Google BigQuery rose from Dremel, Google’s distributed query engine. Dremel held the capability to handle terabytes of data in seconds flat by leveraging distributed computing within a serverless BigQuery Architecture.

This BigQuery architecture allows it to process complex queries with the help of multiple servers in parallel to significantly improve processing speed. In the following sections, you will take a look at the 4 critical components of Google BigQuery performance:

Tree Architecture

BigQuery Architecture and Dremel can scale to thousands of machines by structuring computations as an execution tree. A root server receives an incoming query and relays it to branches, also known as mixers, which modify incoming queries and deliver them to leaf nodes, also known as slots.

Working in parallel, the leaf nodes handle the nitty-gritty of filtering and reading the data. The results are then moved back down the tree where the mixers accumulate the results and send them to the root as the answer to the query.

Serverless Service

In most Data Warehouse environments, organizations have to specify and commit to the server hardware on which computations are run. Administrators have to provision for performance, elasticity, security, and reliability.

A serverless model can come in handy in solving this constraint. In a serverless model, processing can automatically be distributed over a large number of machines working simultaneously.

By leveraging Google BigQuery’s serverless model, database administrators and data engineers can focus less on infrastructure and more on provisioning servers and extracting actionable insights from data.

SQL and Programming Language Support

Users can avail BigQuery Architecture through standard-SQL, which many users are quite familiar with. Google BigQuery also has client libraries for writing applications that can access data in Python, Java, Go, C#, PHP, Ruby, and Node.js.

Real-time Analytics

Google BigQuery can also run and process reports on real-time data by using other GCP resources and services. Data Warehouses can provide support for analytics after data from multiple sources is accumulated and stored- which can often happen in batches throughout the day.

Apart from Batch Processing, Google BigQuery Architecture also supports streaming at a rate of millions of rows of data every second.

9) Use Cases of Google BigQuery

You can use Google BigQuery Data Warehouse in the following cases:

  • Use it when you have queries that run more than five seconds in a relational database. The idea of BigQuery is running complex analytical queries, which means there is no point in running queries that are doing simple aggregation or filtering. BigQuery is suitable for “heavy” queries, those that operate using a big set of data.
    The bigger the dataset, the more you’re likely to gain performance by using BigQuery. The dataset that I used was only 330 MB (megabytes, not even gigabytes).
  • BigQuery is good for scenarios where data does not change often and you want to use the cache, as it has a built-in cache. What does this mean? If you run the same query and the data in tables are not changed (updated), BigQuery will just use cached results and will not try to execute the query again. Also, BigQuery is not charging money for cached queries.
  • You can also use BigQuery when you want to reduce the load on your relational database. Analytical queries are “heavy” and overusing them under a relational database can lead to performance issues. So, you could eventually be forced to think about scaling your server.
    However, with BigQuery you can move these running queries to a third-party service, so they would not affect your main relational database.

Conclusion

BigQuery is a sophisticated mature service that has been around for many years. It is feature-rich, economical, and fast. BigQuery integration with Google Drive and the free Data Studio visualization toolset is very useful for comprehension and analysis of Big Data and can process several terabytes of data within a few seconds. This service needs to deploy across existing and future Google Cloud Platform (GCP) regions. Serverless is certainly the next best option to obtain maximized query performance with minimal infrastructure cost.

If you want to integrate your data from various sources and load it in Google BigQuery, then try Hevo.

Visit our Website to Explore Hevo

Businesses can use automated platforms like Hevo Data to set the integration and handle the ETL process. It helps you directly transfer data from various Data Sources to a Data Warehouse, Business Intelligence tools, or any other desired destination in a fully automated and secure manner without having to write any code and will provide you with a hassle-free experience.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

So, what are your thoughts on Google BigQuery? Let us know in the comments

Puneet Jindal
Freelance Technical Content Writer, Hevo Data

Puneet possesses a passion towards the data realm and applies analytical thinking and a problem-solving approach to help businesses with content on data integration and analysis.

No-code Data Pipeline for your Data Warehouse