Google BigQuery 101: A Guide to BigQuery Data Warehouse Architecture, Features, Cost, & more

on Data Warehouse • August 11th, 2019 • Write for Hevo

Google BigQuery is a fully managed data warehouse tool. It allows scalable analysis over a petabyte of data, querying using ANSI SQL, integration with various applications, etc. To access all these features conveniently, you need to understand its architecture, maintenance, pricing and security. This guide decodes the most important components of Google BigQuery – BigQuery Data Warehouse Architecture, Maintenance, Performance, Pricing, and Security.

Let’s see how this blog is structured for you:

  1. What is Google BigQuery?
  2. Why Google BigQuery?
  3. Google BigQuery Data Warehouse Architecture
  4. Google BigQuery Comparison with MapReduce and NoSQL
  5. Google BigQuery Comparison with Redshift and Snowflake
  6. Google BigQuery Data Warehouse Working
  7. Google BigQuery Data Warehouse Querying
  8. Google BigQuery Data Warehouse ETL/Data Load
  9. Google BigQuery Data Warehouse Pricing Model
  10. Google BigQuery Data Warehouse Maintenance
  11. Google BigQuery Data Warehouse Security
  12. Google BigQuery Data Warehouse Features
  13. Google BigQuery Data Warehouse Interaction
  14. When to use Google BigQuery Data Warehouse?
  15. Conclusion

What is Google BigQuery?

Google BigQuery: BigQuery
Google BigQuery Data Warehouse

Google BigQuery is a Cloud Datawarehouse run by Google.  It is capable of analysing terabytes of data in seconds. If you know how to write SQL Queries, you already know how to query it. In fact, there are plenty of interesting public data sets shared in BigQuery, ready to be queried by you.

You can access BigQuery by using the GCP console or the classic web UI, by using a command-line tool, or by making calls to BigQuery Rest API using a variety of Client Libraries such as Java, .Net, or Python. There are also a variety of third-party tools that you can use to interact with BigQuery, such as visualising the data or loading the data.

Why Google BigQuery?

Why did Google release BigQuery and why would you use it instead of more established data warehouse solution?

  • Ease of Implementation: Building your own is expensive, time-consuming, and difficult to scale. With BigQuery, you need to load data first and pay only for what you use.
  • Speed: Process billions of rows in seconds and handles real-time analysis of Streaming data.

Google BigQuery Data Warehouse Architecture

BigQuery is based on Dremel Technology. Dremel is a tool used in Google for about 10 years. 

Dremel: It dynamically apportions slots to queries on an as-needed basis, maintaining fairness amongst multiple users who are all querying at once. A single user can get thousands of slots to run their queries. It takes more than just a lot of hardware to make your queries run fast. BigQuery requests are powered by the Dremel query engine. 

Colossus: BigQuery relies on Colossus, Google’s latest generation distributed file system. Each Google data centre has its own Colossus cluster, and each Colossus cluster has enough disks to give every BigQuery user thousands of dedicated disks at a time. Colossus also handles replication, recovery (when disks crash) and distributed management.

Jupiter Network: It is the internal data centre network that allows BigQuery to separate storage and compute.

Data Model/Storage

  • Columnar storage.
  • Nested/Repeated fields.
  • No Index: Single full table scan.

Query Execution

  • Query is implemented Tree Architecture.
  • Query is executed using tens of thousands of machines over fast Google Network.

Google BigQuery Comparison with MapReduce and NoSQL

MapReduce vs. Google BigQuery

  MapReduce       BigQuery
  1. High Latency.
  2. Flexible (complex) batch processing.
  3. Unstructured Data.
  1. Low Latency.
  2. SQL-like Queries.
  3. Structured Data.

NoSQL Data store vs. Google BigQuery

NoSQL Data store:BigQuery:
  1. Index based.        
  2. Read-write.
  1. Non-index based.
  2. Read-only.

Google BigQuery Comparison with Redshift and Snowflake

Name  RedshiftBigQuerySnowflake
DescriptionLarge scale data warehouse service for use with business intelligence toolsLarge scale data warehouse service with append-only tablesCloud-based data warehousing service for structured and semi-structured data
Primary database modelRelational DBMSRelational DBMSRelational DBMS
DeveloperAmazonGoogleSnowflake Computing
XML supportNoNoYes
APIs and other access methodsJDBC

 

ODBC

RESTfull  HTTP/JSON APICLI Client

 

JDBC

ODBC

Supported programming languagesAll languages supporting JDBC/ODBC.Net, Java, JavaScript, Objective-C, PHP, Python.JavaScript (Node.js)

 

Python

Partitioning methodsShardingNoneYes
MapReduceNoNoNo
ConcurrencyYesYesYes
Transaction conceptsACIDNoACID
DurabilityYesYesYes
In-memory capabilitiesYesNoNo
User conceptsFine-grained access rights according to SQL-standardAccess privileges (owner, writer, reader) for whole datasets, not for individual tablesUsers with fine-grained authorization concept, user roles and pluggable authentication

Some Important Considerations about these Comparisons:

  • If you have a reasonable volume of data, say, dozens of terabytes that you rarely use to perform queries and it’s acceptable for you to have query response times of up to few minutes when you use, then Google BigQuery is an excellent candidate for your scenario.
  • If you need to analyse a big amount of data (e.g.: up to a few terabytes) by running many queries   which should be answered each very quickly — and you don’t need to keep the data available once the analysis is done, then an on-demand cloud solution like Amazon Redshift is a great fit. But keep in mind that differently from Google BigQuery, Redshift does need to be configured and tuned in order to perform well.
  • BigQuery is good enough if not to take into account the speed of data updating. Compared to Redshift, Google BigQuery only supports hourly syncs as its fastest frequency update. This made us choose Redshift, as we needed the solution with the support of close to real-time data integration.

Google BigQuery Data Warehouse Working

BigQuery is a data warehouse, implying a degree of centralization. The query we demonstrated in the previous section was applied to a single dataset. However, the benefits of BigQuery become even more apparent when we do joins of datasets from completely different sources or when we query against data that is stored outside BigQuery.

If you’re a power user of Sheets, you’ll probably appreciate the ability to do more fine-grained research with data in your spreadsheets. It’s a sensible enhancement for Google to make, as it unites BigQuery with more of Google’s own existing services. Previously, Google made it possible to analyse Google Analytics data in BigQuery.

These sorts of integrations could make BigQuery a better choice in the market for cloud-based data warehouses, which is increasingly how Google has positioned BigQuery. Public cloud market leader Amazon Web Services (AWS) has Redshift, but no widely used tool for spreadsheets. Microsoft Azure’s SQL Data Warehouse, which has been in preview for several months, does not currently have an official integration with Microsoft Excel, surprising though it may be.

Google BigQuery Data Warehouse Querying

Google BigQuery supports SQL queries and supports compatibility with ANSI SQL 2011. BigQuery SQL support has been extended to support nested and repeated field types as part of the data model. For example, you can use GitHub public dataset and issue the UNNEST command. It lets you iterate over a repeated field.

SELECT
  name, count(1) as num_repos
FROM
  `bigquery-public-data.github_repos.languages`, UNNEST(language)
GROUP BY name
ORDER BY num_repos
DESC limit 10

Interactive Queries

Google BigQuery supports interactive querying of dataset and provides you with a consolidated view of these datasets across projects that you can access. Features like saving as and shared ad-hoc, exploring tables and schemas, etc. are provided by console.

Google BigQuery: Interactive Queries

Automated Queries

You can automate the execution of your queries based on an event and cache the result for later use. You can use Airflow API to orchestrate automated activities. For simple orchestrations, you can use corn jobs. To encapsulate a query as an App Engine App and run it as a scheduled cron job you can refer to this blog.

Query Optimization

Each time when a Google BigQuery executes a query, it executes a full-column scan. It doesn’t support indexes. As you know, the performance and query cost of Google BigQuery is dependent on the amount of data scanned during a query, you need to design your queries to reference the column that is strictly relevant to your query. When you are using data partitioned tables, make sure that only the relevant partitions are scanned.

You can also refer to the detailed blog here that can help you to understand the performance characteristics after a query executes.

Google BigQuery: performnace

External sources

With federated data sources, you can run queries on the data that exists outside of your Google BigQuery. But this method has performance implications. You can also use query federation to perform the ETL process fom external source to Google BigQuery.

User-defined functions

Google BigQuery supports user-defined functions for queries that can exceed the complexity of SQL. User-defined functions allow you to extend the built-in SQL functions easily. It is written in JavaScript. It can take a list of values and then return a single value.

Query sharing

Collaborators can save and share the queries between the team members. Data exploration exercise, getting desired speed on a new dataset or query pattern becomes a cakewalk with it.

Google BigQuery Data Warehouse ETL/Data Load

There are various approaches to load data to BigQuery. In case you are moving data from Google Applications – like Google Analytics, Google Adwords, etc. google provides a robust BigQuery Data Transfer Service. This is Google’s own intra-product data migration tool.

Data load from other data sources – databases, cloud applications and more can be accomplished by deploying engineering resources to write custom scripts. The broad steps would be to extract data from the data source, transform it into a format that BigQuery accepts, upload this data to Google Cloud Storage (GCS) and finally load this to Google BigQuery from GCS.

A few examples of how to perform this can be found here –> PostgreSQL to BigQuery and SQL Server to BigQuery

A word of caution though – custom coding scripts to move data to Google BigQuery is both a complex and cumbersome process. A third party data pipeline platform such as Hevo can make this a hassle-free process for you.

Hevo for Google BigQuery Data Warehouse ETL

Hevo Data is a No-code Data Pipeline. It brings data from a wide array of data sources into Google BigQuery in real-time, without having to write any code. Hevo provides a seamless point-and-click interface to move data without having to do any heavy lifting. With its AI-powered algorithms, Hevo automatically converts source data source into BigQuery data types thereby ridding you of any overheads in the process.

Let’s discuss some unbeatable features of Hevo:

  1. Fully Automated: Hevo can be set-up in a few minutes and requires zero maintenance and management.
  2. Scalability: Hevo is built to handle millions of records per minute without any latency.
  3. Secure: Hevo offers two-factor authentication and end-to-end encryption so that your data is safe and secure.
  4. Fault-Tolerant: Hevo is capable of detecting anomalies in the incoming data and informs you instantly. All the affected rows are kept aside for correction so that it doesn’t hamper your workflow.
  5. Real-Time: Hevo provides real-time data migration. So, your data is always ready for analysis.
  6. Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.

Sign up for a 14-day free trial here to explore Hevo.

Google BigQuery Data Warehouse Pricing Model

CategoryPriceNote
Storage Cost$0.020 per GB per month 
Query Cost$5 per TB1st TB per month is free

Google BigQuery Storage Cost

  • Active – Monthly charge for stored data modified within 90 days.
  • Long-term – Monthly charge for stored data that have not been modified within 90 days. This is usually lower than the earlier one.

Google BigQuery Query Cost

  • On-demand – Based on data usage.
  • Flat rate – Fixed monthly cost, ideal for enterprise users.

Free usage is available for below operations:

  • Loading data (network pricing policy applicable in case of inter-region).
  • Copying data.
  • Exporting data.
  • Deleting datasets.
  • Metadata operations.
  • Deleting tables, views, and partitions.

You can read more about BigQuery Pricing here.

Google BigQuery Data Warehouse Maintenance

Google has managed to solve a lot of common data warehouse concerns by throwing order of magnitude of hardware at the existing problems and thus eliminating them altogether. Unlike Amazon Redshift, running VACUUM in Google BigQuery is not an option.  Google BigQuery is specifically architected without the need for the resource-intensive VACUUM operation that is recommended for Redshift.

Keep in mind that by design, Google BigQuery is append-only. Meaning, when planning to update or delete data, you’ll need to truncate the entire table and recreate the table with new data.

However, Google has implemented ways in which users can reduce the amount of data processed.

Partition their tables by specifying the partition date in their queries. Use wildcard tables to share their data by an attribute.

Google BigQuery Data Warehouse Security

The fastest hardware and most advanced software are of little use if you can’t trust them with your data. BigQuery’s security model is tightly integrated with the rest of Google’s Cloud Platform, so it is possible to take a holistic view of your data security. BigQuery uses Google’s Identity and Access Management (IAM) access control system to assign specific permissions to individual users or groups of users. BigQuery also ties in tightly with Google’s Virtual Private Cloud (VPC) policy controls, which can protect against users who try to access data from outside your organization, or who try to export it to third parties.

Both IAM and VPC controls are designed to work across Google cloud products, so you don’t have to worry that certain products create a security hole.

BigQuery is available in every region where Google Cloud has a presence, enabling you to process the data in the location of your choosing. At the time of writing, Google Cloud has more than two dozen data centres around the world, and new ones are being opened at a fast rate. If you have business reasons for keeping data in US, it is possible to do so. Just create your dataset with the US region code, and all of your queries against the data will be done within that region. 

Know more about Google BigQuery security from here.

Google BigQuery Data Warehouse Features

Some features of Google BigQuery Data Warehouse are listed below:

  • Just upload your data and run SQL.
  • No cluster deployment, no virtual machines, no setting keys or indexes, and no software.
  • Separate storage and computing.
  • No need to deploy multiple clusters and duplicate data into each one. Manage permissions on projects and datasets with access control lists. Seamlessly scales with usage.
  • Compute scales with usage, without cluster resizing.
  • Thousands of cores are used per query.
  • Deployed across multiple data centres by default, with multiple factors of replication to optimize maximum data durability and service uptime.
  • Stream millions of rows per second for real-time analysis.
  • Analyse terabytes of data in seconds.
  • Storage scales to Petabytes.

Know more about Google BigQuery features from here.

Google BigQuery Data Warehouse Interaction

-> Web User Interface

  • Run queries and examine results.
  • Manage databases and tables.
  • Save queries and share across the organization for re-use.
  • Detailed Query history.

-> Visualize-data-studio

  • View BigQuery results with charts, pivots and dashboards.

->API

  • A programmatic way to access Google BigQuery.

Service Limits for Google BigQuery

  • Concurrent rate limit for on-demand, interactive queries — 50.
  • Daily query size limit — Unlimited by default.
  • Daily destination table update limit — 1,000 updates per table per day.
  • Query execution time limit — 6 hours.
  • Maximum number of tables referenced per query — 1,000.
  • Maximum unresolved query length — 256 KB.
  • Maximum resolved query length — 12 MB.
  • Concurrent rate limit for on-demand, interactive queries against Cloud Big table external data sources — 4.

Integrating with Tensorflow

BigQuery has new feature BigQuery ML that let you create and use a simple Machine Learning (ML) model as well as deep learning prediction with TensorFlow model. This is the key technology to integrate the scalable data warehouse with the power of ML. The solution enables a variety of smart data analytics, such as logistic regression on a large dataset, similarity search and recommendation on images, documents, products or users, by processing feature vectors of the contents. Or you can even run TensorFlow model prediction inside BigQuery.

Now, imagine what would happen if you could use BigQuery for deep learning as well. After having data scientists training the cutting edge intelligent neural network model with TensorFlow or Google Cloud Machine Learning, you can move the model to BigQuery and execute predictions with the model inside BigQuery. This means you can let any employee in your company use the power of BigQuery for their daily data analytics tasks, including image analytics and business data analytics on terabytes of data, processed in tens of seconds, solely on BigQuery without any engineering knowledge.

When to use Google BigQuery Data Warehouse?

You can use Google BigQuery Data Warehouse in the following cases:

  • Use it when you have queries that run more than five seconds in a relational database. The idea of BigQuery is running complex analytical queries, which means there is no point in running queries that are doing simple aggregation or filtering. BigQuery is suitable for “heavy” queries, those that operate using a big set of data. The bigger the dataset, the more you’re likely to gain performance by using BigQuery. The dataset that I used was only 330 MB (megabytes, not even gigabytes).
  • BigQuery is good for scenarios where data does not change often and you want to use cache, as it has built-in cache. What does this mean? If you run the same query and the data in tables are not changed (updated), BigQuery will just use cached results and will not try to execute the query again. Also, BigQuery is not charging money for cached queries.
  • You can also use BigQuery when you want to reduce the load on your relational database. Analytical queries are “heavy” and overusing them under a relational database can lead to performance issues. So, you could eventually be forced to think about scaling your server. However, with BigQuery you can move these running queries to a third-party service, so they would not affect your main relational database.

Conclusion

BigQuery is a sophisticated mature service that has been around for many years. It is feature-rich, economical and fast. BigQuery integration with Google Drive and free Data Studio visualization toolset are very useful for comprehension and analysis of Big Data and can process several terabytes of data within a few seconds. This service needs to deploy across existing and future Google Cloud Platform (GCP) regions. Serverless is certainly the next best option to obtain maximized query performance with minimal infrastructure cost.

If you want to integrate your data from various sources and load it tin Google BigQuery, then try Hevo.

Hevo is a No-code Data Pipeline. It supports pre-built integrations from 100+ data sources. You can easily connect any source to your Google BigQuery in real-time. It provides a reliable, secure and consistent solution to you at a reasonable price.

Are you ready to try Hevo? If yes, then sign-up for a 14-day free trial today!

So, what are your thoughts on Google BigQuery? Let us know in the comments

No-code Data Pipeline for BigQuery