A BigQuery Dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to your tables and views. A table or view must belong to a dataset, so you need to create at least one BigQuery dataset before loading data into BigQuery.

What is BigQuery?

BigQuery dataset : BigQuery Logo

BigQuery is an enterprise data warehouse, provided as a SAAS offering by Google, that allows storing and querying huge volumes of data. 

Google BigQuery is a Google Cloud Datawarehouse. It can analyze gigabytes of data in a matter of seconds. You already know how to query SQL if you know how to write SQL Queries. In fact, BigQuery contains a wealth of intriguing public data sets that are waiting to be queried.

You can use the GCP console or the standard web UI to access BigQuery, or you can use a command-line tool or make calls to the BigQuery Rest API using a choice of Client Libraries such as Java,.Net, or Python.

You can also interact with BigQuery using a range of third-party applications, such as displaying the data or loading the data. It is primarily used to analyze petabytes of data using ANSI SQL at blazing-fast speeds.  

Key Features

Why did Google introduce BigQuery, and why would you choose it over a more well-known data warehouse solution?

  • Ease of Use: Creating your own is costly, time-consuming, and difficult to scale. BigQuery requires you to load data first and only pay for what you need.
  • Speed: Process billions of rows in seconds and analyze streaming data in real-time.

What is a BigQuery Dataset?

DataQuery Dataset: BigQuery Dataset
DataQuery Dataset: BigQuery Dataset

Just like a database contains your data tables/views/procedures, on the Google Cloud platform there is the concept of a BigQuery Dataset to hold your tables and views. 

Any data that you wish to use, any table or view, must belong to a BigQuery dataset. 

Data is first loaded into Bigquery, from locations like Cloud Storage bucket or other datasets, and then queries are run on it. 

Location and billing nuances for BigQuery Datasets

At the core of BigQuery’s locational concepts are these 2 visualizations:- 

  * Region – a region is a specific geographic location, like New York, where the bare metal of the BigQuery infrastructure is placed. 

  * Multi-Region – is a collection of two or more regions, A multi-region may sometimes allow sharing of resources. 

The thumb rule to follow is that your Cloud Storage bucket containing the data you wish to load into BigQuery must be in a regional or multi-regional bucket in the same location as your Bigquery dataset. A dataset is created in a geographical location and always exists there only, you can only create a copy in another location. 

All queries running in a geographical location will use the datasets stored on its location only. In other words, if your dataset resides in location A, all queries that reference this dataset will run on location A only. 

There can be a scenario where a query does not refer to any dataset anywhere but has a destination specified at location A, in this case also, the query will run on location A only. Also, If a query does not reference any tables or other resources contained within any datasets, and no destination table is provided, the query job will run in the US multi-region. 

Even if you have a flat rate reservation in a region other than the US unless you specifically choose the flat rate location, queries like the one above, will be run in US multi-region only. 

Hence, the location of your BigQuery datasets being read or written to must be the same, as the location where the query is being run. 

Exception: If and only if your dataset is in the US multi-regional location, you can load as well as export data from a Cloud Storage bucket in any regional or multi-regional location. 

You can choose the location where you want your query to run by doing one of the following:- 

  • Using the Cloud Console, click More > Query settings, and for Processing Location, click Auto-select and choose your data’s location. 
  • Using the QB command-line tool, supply the –location global flag and set the value to your desired location. 
  • Using the API, go to job-resources, in the job reference section specify your region in the location property. 

The above information is useful in minimizing your costs while keeping your queries optimal.

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready.

Start for free now!

Get Started with Hevo for Free

Usage Quotas and Limits in BigQuery Datasets

BigQuery dataset: quotas and limits | Hevo Data
BigQuery dataset: quotas and limits


BigQuery runs on Google Cloud infrastructure and uses bare metal hardware like CPU cycles/RAM, software like OS/Schedulers, etc., and network resources. 

There are certain limits and Quotas on how much/many resources your cloud project can use. This is done to ensure fair sharing of resources and reduce spikes in usage, which in turn, can cause your and others’ queries to fail or get stuck. 

The good thing is that Google cloud has a fair Quota replenishment policy to ensure smoother operations and avoid long disruptions. Daily quotas are replenished regularly throughout a 24-hour period, intermittent refresh is also done when your quotas get exhausted. 

Datasets, Exporting and Copying

1Maximum number of tables in the source dataset20,000 tables
2Maximum number of tables that can be copied per run to a destination dataset in the *same region20,000 tables
3Maximum number of tables that can be copied per run to a destination dataset in a *different region1000 tables


Though there can be an Unlimited number of Datasets and an Unlimited number of tables inside a dataset, the following restrictions do apply.
1Number of authorized views in a dataset’s access control list2,500 authorized views
2Number of dataset update operations per dataset per 10 seconds5 operations
4Maximum row size(based on the internal representation of row data)100 MB
5Maximum columns in a table, query result, or view definition10,000 columns

To avoid disruptions and surprises due to quota violations, one should keep the following in mind.  

Though there can be Unlimited DML(Data Manipulation Language) statements per day, some limits are imposed by statement type/actions. 
1Concurrent mutating DML (UPDATE, DELETE, and MERGE)  statements per table2 statements
2Queued mutating DML statements per table   20 statements
3Query usage per user/per dayUnlimited
4Cloud SQL federated query cross-region bytes per day ( applies only to cross-region queries where BigQuery query processing location and the Cloud SQL instance location are different)1 TB
5The rate limit for concurrent interactive queries       100 queries
6The concurrent rate limit for interactive queries against Cloud Bigtable external data sources4 queries 
7Query/script execution-time limit6 hours
8Maximum number of resources referenced per query   1,000 resources
9Maximum number of Standard SQL query parameters10,000 parameters
10Maximum number of row access policies per table/per query 100 policies 

The above information will be useful when you lay your hands to actually code your programs/functions. 

If the program uses Partitioned tables and BigQuery APIs, below mentioned are some important quotas 

1Maximum number of modifications per partition per day ( appends or truncates)1,500 operations
2Maximum number of partitions per partitioned table4,000 partitions
3Maximum number of API requests per second per user100 requests
4Maximum number of concurrent API requests per user300 requests
5Maximum jobs.get requests per second   1,000 requests
6Maximum jobs.query response size10BM
7Maximum number of tabledata.list requests per second1,000 requests
8Read data plane/control plane requests per minute per user5,000 requests
9Maximum tables.insert requests per second10 requests
For UDFs (user-defined functions)
1The maximum amount of data that your JavaScript UDF can output 5MB 
2Maximum concurrent legacy SQL queries with Javascript UDFs6 queries
3Maximum number of arguments in a UDF256 arguments 
4Maximum number of fields in STRUCT type arguments or output per UDF1,024 fields 
5Maximum update rate per UDF per 10 seconds5 updates

Using the above information and a version control system like GIT, you can keep your UDFs optimal and consistent. 

Access control permissions in BigQuery Datasets

A dataset would contain tables/views/data.
Bigquery allows you to define fine-grained policies and permissions for access to these resources. Dataset-level permissions determine the users, groups, and service accounts allowed to access the tables, views, and table data in a specific dataset. 

If a user/administrator wants to control the access to a dataset, he must have the bigquery.datasets.update, roles/bigquery.dataOwner and bigquery.datasets.get permissions. Once you have these permissions you can cherry-pick other users and grant them selective access to this dataset’s resources. 

e.g. To allow a user to create/update/delete tables and views in a dataset, he must be granted the bigquery.dataOwner IAM role. 

You can either use the coarse-grained Basic roles ( owner/editor/viewer) OR create custom roles where each role is a set of permissions over a set of objects. Roles can be applied at the dataset level, jobs and reservations level, table/view level and finally trickling down to row-level access control.  

The above permission can be specified by using the Cloud Console/GRANT and REVOKE DCL statements/bq update command/datasets.patch API method. 

Best practices to follow while using BigQuery datasets

BigQuery dataset: Best practices | Hevo Data
BigQuery dataset: Best practices
  • If your source data evolves all the time, it will mostly, enable the schema_update_option parameter in your API calls, to facilitate the auto schema evolution capability. 

Using this feature you can augment/merge/coalesce old tables to generate newer ones that better consume your incoming data.  

  • If you’re using partitioned tables and your data storage is large, store the result of your related subquery in a variable and use this variable to filter the partition column. 
  • Use the data pre-processing features of Cloud Data Prep and Cloud Dataflow, to fine-tune your data before loading it into Bigquery. 
  • All your data in BigQuery is encrypted, you can use digital signatures and timestamping to add non-repudiation and trust to your data. 

The Google HMS ( Hardware Security Module) lets you store your unique digital identity and signing certificates in a secure fashion. 

Also, you can use the Google KMS( Key Management) service to load and use your own keys for encryption. 

Conclusion

This article gave a comprehensive guide on Google BigQuery Datasets, explaining various parameters that need to be checked while utilizing the Datasets.

While using BigQuery Dataset is insightful, it is a hectic task to Set Up the proper environment. To make things easier, Hevo comes into the picture. Hevo Data is a No-code Data Pipeline and has awesome 150+ pre-built Integrations that you can choose from.

Hevo can help you Integrate your data from numerous sources and load them into a destination like BigQuery to Analyze real-time data with a BI tool such as Tableau. It will make your life easier and data migration hassle-free. It is user-friendly, reliable, and secure.

SIGN UP for a 14-day free trial and see the difference!

Share your experience of learning about the BigQuery Dataset in the comments section below

Pratik Dwivedi
Technical Content Writer, Hevo Data

Pratik Dwivedi is a seasoned author specializing in data industry topics, including data analytics, machine learning, AI, big data, and business intelligence. With over 18 years of experience in system analysis, design, and implementation, including 8 years in a Techno-Managerial role, Pratik has successfully managed international clients and led small to medium-sized teams and projects. He excels in creating engaging content that informs and inspires.

No-code Data Pipeline For BigQuery