What is Google BigQuery?
BigQuery is a Cloud Datawarehouse run by Google. It is capable of analysing terabytes of data in seconds. If you know how to write SQL Queries, you already know how to query it. In fact, there are plenty of interesting public data sets shared in BigQuery, ready to be queried by you.
You can access BigQuery by using the GCP console or the classic web UI, by using a command-line tool, or by making calls to BigQuery Rest API using a variety of Client Libraries such as Java, .Net, or Python. There are also a variety of third-party tools that you can use to interact with BigQuery, such as visualising the data or loading the data.
Why Google BigQuery?
Why did Google release BigQuery and why would you use it instead of more established data warehouse solution?
- Ease of Implementation: Building your own is expensive, time-consuming, and difficult to scale. With BigQuery, you need to load data first and pay only for what you use.
- Speed: Process billions of rows in seconds and handles real-time analysis of Streaming data.
Google BigQuery Architecture:
BigQuery is based on Dremel Technology. Dremel is a tool used in Google for about 10 years.
Dremel: It dynamically apportions slots to queries on an as-needed basis, maintaining fairness amongst multiple users who are all querying at once. A single user can get thousands of slots to run their queries. It takes more than just a lot of hardware to make your queries run fast. BigQuery requests are powered by the Dremel query engine.
Colossus: BigQuery relies on Colossus, Google’s latest generation distributed file system. Each Google data centre has its own Colossus cluster, and each Colossus cluster has enough disks to give every BigQuery user thousands of dedicated disks at a time. Colossus also handles replication, recovery (when disks crash) and distributed management.
Jupiter Network: It is the internal data centre network that allows BigQuery to separate storage and compute.
- Columnar storage.
- Nested/Repeated fields.
- No Index: Single full table scan.
- Query is implemented Tree Architecture.
- Query is executed using tens of thousands of machines over fast Google Network.
Google BigQuery Comparison with MapReduce and NoSQL:
MapReduce vs. Google BigQuery:
NoSQL Data store vs. BigQuery:
|NoSQL Data store:||BigQuery:|
Google BigQuery Comparison with Redshift and Snowflake:
|Description||Large scale data warehouse service for use with business intelligence tools||Large scale data warehouse service with append-only tables||Cloud-based data warehousing service for structured and semi-structured data|
|Primary database model||Relational DBMS||Relational DBMS||Relational DBMS|
|APIs and other access methods||JDBC
|RESTfull HTTP/JSON API||CLI Client
|User concepts||Fine-grained access rights according to SQL-standard||Access privileges (owner, writer, reader) for whole datasets, not for individual tables||Users with fine-grained authorization concept, user roles and pluggable authentication|
Some Important Considerations about this Comparisons:
- If you have a reasonable volume of data, say, dozens of terabytes that you rarely use to perform queries and it’s acceptable for you to have query response times of up to few minutes when you use, then Google BigQuery is an excellent candidate for your scenario.
- If you need to analyse a big amount of data (e.g.: up to a few terabytes) by running many queries which should be answered each very quickly — and you don’t need to keep the data available once the analysis is done, then an on-demand cloud solution like Amazon Redshift is a great fit. But keep in mind that differently from Google BigQuery, Redshift does need to be configured and tuned in order to perform well.
- BigQuery is good enough if not to take into account the speed of data updating. Compared to Redshift, Google BigQuery only supports hourly syncs as its fastest frequency update. This made us choose Redshift, as we needed the solution with the support of close to real-time data integration.
Google BigQuery Working:
BigQuery is a data warehouse, implying a degree of centralization. The query we demonstrated in the previous section was applied to a single dataset. However, the benefits of BigQuery become even more apparent when we do joins of datasets from completely different sources or when we query against data that is stored outside BigQuery.
If you’re a power user of Sheets, you’ll probably appreciate the ability to do more fine-grained research with data in your spreadsheets. It’s a sensible enhancement for Google to make, as it unites BigQuery with more of Google’s own existing services. Previously, Google made it possible to analyse Google Analytics data in BigQuery.
These sorts of integrations could make BigQuery a better choice in the market for cloud-based data warehouses, which is increasingly how Google has positioned BigQuery. Public cloud market leader Amazon Web Services (AWS) has Redshift, but no widely used tool for spreadsheets. Microsoft Azure’s SQL Data Warehouse, which has been in preview for several months, does not currently have an official integration with Microsoft Excel, surprising though it may be.
Google BigQuery ETL / Data Load:
There are various approaches to load data to BigQuery. In case you are moving data from Google Applications – like Google Analytics, Google Adwords, etc. google provides a robust BigQuery Data Transfer Service. This is Google’s own intra-product data migration tool.
Data load from other data sources – databases, cloud applications and more can be accomplished by deploying engineering resources to write custom scripts. The broad steps would be to extract data from the data source, transform it into a format that BigQuery accepts, upload this data to Google Cloud Storage (GCS) and finally load this to Google BigQuery from GCS.
A word of caution though – custom coding scripts to move data to Google BigQuery is both a complex and cumbersome process. A third party data pipeline platform such as Hevo can make this a hassle-free process for you.
Hevo for Google BigQuery ETL:
Hevo Data brings data from a wide array of data sources into Google BigQuery in real-time, without having to write any code. Hevo provides a seamless point-and-click interface to move data without having to do any heavy lifting. With its AI-powered algorithms, Hevo automatically converts source data source into BigQuery data types thereby ridding you of any overheads in the process.
Google BigQuery Pricing Model:
|Storage Cost||$0.020 per GB per month|
|Query Cost||$5 per TB||1st TB per month is free|
Google BigQuery Storage Cost:
- Active – Monthly charge for stored data modified within 90 days.
- Long-term – Monthly charge for stored data that have not been modified within 90 days. This is usually lower than the earlier one.
Google BigQuery Query Cost:
- On-demand – Based on data usage.
- Flat rate – Fixed monthly cost, ideal for enterprise users.
Free usage is available for below operations:
- Loading data (network pricing policy applicable in case of inter-region).
- Copying data.
- Exporting data.
- Deleting datasets.
- Metadata operations.
- Deleting tables, views, and partitions.
Google BigQuery Maintenance:
Google has managed to solve a lot of common data warehouse concerns by throwing order of magnitude of hardware at the existing problems and thus eliminating them altogether. Unlike Amazon Redshift, running VACUUM in Google BigQuery is not an option. Google BigQuery is specifically architected without the need for the resource-intensive VACUUM operation that is recommended for Redshift.
Keep in mind that by design, Google BigQuery is append-only. Meaning, when planning to update or delete data, you’ll need to truncate the entire table and recreate the table with new data.
However, Google has implemented ways in which users can reduce the amount of data processed:
Partition their tables by specifying the partition date in their queries. Use wildcard tables to share their data by an attribute.
Google BigQuery Security:
The fastest hardware and most advanced software are of little use if you can’t trust them with your data. BigQuery’s security model is tightly integrated with the rest of Google’s Cloud Platform, so it is possible to take a holistic view of your data security. BigQuery uses Google’s Identity and Access Management (IAM) access control system to assign specific permissions to individual users or groups of users. BigQuery also ties in tightly with Google’s Virtual Private Cloud (VPC) policy controls, which can protect against users who try to access data from outside your organization, or who try to export it to third parties. Both IAM and VPC controls are designed to work across Google cloud products, so you don’t have to worry that certain products create a security hole.
BigQuery is available in every region where Google Cloud has a presence, enabling you to process the data in the location of your choosing. At the time of writing, Google Cloud has more than two dozen data centres around the world, and new ones are being opened at a fast rate. If you have business reasons for keeping data in US, it is possible to do so. Just create your dataset with the US region code, and all of your queries against the data will be done within that region.
- Just upload your data and run SQL.
- No cluster deployment, no virtual machines, no setting keys or indexes, and no software.
- Separate storage and computing.
- No need to deploy multiple clusters and duplicate data into each one. Manage permissions on projects and datasets with access control lists. Seamlessly scales with usage.
- Compute scales with usage, without cluster resizing.
- Thousands of cores are used per query.
- Deployed across multiple data centres by default, with multiple factors of replication to optimize maximum data durability and service uptime.
- Stream millions of rows per second for real-time analysis.
- Analyse terabytes of data in seconds.
- Storage scales to Petabytes.
Google BigQuery Interaction:
->Web User Interface:
- Run queries and examine results.
- Manage databases and tables.
- Save queries and share across the organization for re-use.
- Detailed Query history.
- View BigQuery results with charts, pivots and dashboards.
- A programmatic way to access Google BigQuery.
Service Limits for Google BigQuery:
- Concurrent rate limit for on-demand, interactive queries — 50.
- Daily query size limit — Unlimited by default.
- Daily destination table update limit — 1,000 updates per table per day.
- Query execution time limit — 6 hours.
- Maximum number of tables referenced per query — 1,000.
- Maximum unresolved query length — 256 KB.
- Maximum resolved query length — 12 MB.
- Concurrent rate limit for on-demand, interactive queries against Cloud Big table external data sources — 4.
Integrating with Tensorflow:
BigQuery has new feature BigQuery ML that let you create and use a simple Machine Learning (ML) model as well as deep learning prediction with TensorFlow model. This is the key technology to integrate the scalable data warehouse with the power of ML. The solution enables a variety of smart data analytics, such as logistic regression on a large dataset, similarity search and recommendation on images, documents, products or users, by processing feature vectors of the contents. Or you can even run TensorFlow model prediction inside BigQuery.
Now, imagine what would happen if you could use BigQuery for deep learning as well. After having data scientists training the cutting edge intelligent neural network model with TensorFlow or Google Cloud Machine Learning, you can move the model to BigQuery and execute predictions with the model inside BigQuery. This means you can let any employee in your company use the power of BigQuery for their daily data analytics tasks, including image analytics and business data analytics on terabytes of data, processed in tens of seconds, solely on BigQuery without any engineering knowledge.
When to use BigQuery?
- Use it when you have queries that run more than five seconds in a relational database. The idea of BigQuery is running complex analytical queries, which means there is no point in running queries that are doing simple aggregation or filtering. BigQuery is suitable for “heavy” queries, those that operate using a big set of data. The bigger the dataset, the more you’re likely to gain performance by using BigQuery. The dataset that I used was only 330 MB (megabytes, not even gigabytes).
- BigQuery is good for scenarios where data does not change often and you want to use cache, as it has built-in cache. What does this mean? If you run the same query and the data in tables are not changed (updated), BigQuery will just use cached results and will not try to execute the query again. Also, BigQuery is not charging money for cached queries.
- You can also use BigQuery when you want to reduce the load on your relational database. Analytical queries are “heavy” and overusing them under a relational database can lead to performance issues. So, you could eventually be forced to think about scaling your server. However, with BigQuery you can move these running queries to a third-party service, so they would not affect your main relational database.
BigQuery is a sophisticated mature service that has been around for many years. It is feature-rich, economical and fast. BigQuery integration with Google Drive and free Data Studio visualization toolset are very useful for comprehension and analysis of Big Data and can process several terabytes of data within a few seconds. This service needs to deploy across existing and future Google Cloud Platform (GCP) regions. Serverless is certainly the next best option to obtain maximized query performance with minimal infrastructure cost.
So, what are your thoughts on Google BigQuery? Let us know in the comments