Advanced Google BigQuery Random Sampling Simplified 101

on bigquery datasets, Data Aggregation, Data Analytics, Data Engineering, Data Integration, Data Warehouse, Google BigQuery, SQL • September 23rd, 2021 • Write for Hevo

In the Data Management and Analysis Industry, many new platforms and applications have appeared in recent years. At present, most businesses turn to Cloud-based solutions to collect, store and interact with an extensive amount of data. One famous Cloud-based Data Warehouse solution is Google BigQuery.

You can query random chunks of data from huge Google BigQuery tables using BigQuery Random Sampling. Sampling returns a diverse set of records without incurring the costs of scanning and processing a complete table.

Google BigQuery is a popular Cloud-based Data Warehouse solution that offers scalability, simplicity, and great abstraction for its users. Unlike the competitors, BigQuery is a serverless platform, and it provides services dynamically, so it does not require users to provide hardware or hardware management. SQL, a specialized query language designed specifically for the Database Management System, is the base for querying on BigQuery Random Sampling

This article is a comprehensive guide on BigQuery Random Sampling. You will learn about Random Sampling, its benefits, and various ways to perform BigQuery Random Sampling.

Table of Contents

Introduction to Google BigQuery

 BigQuery Random Sampling - Google BigQuery logo
Image Source

Google BigQuery is a Cloud-based Data Warehouse, offering fast SQL queries and interactive analysis of massive datasets. BigQuery is designed on Google Dremel technology and can process read-only data effectively. The Google BigQuery platform uses a column storage paradigm that allows a much faster data scan and the tree architecture model, making the query and aggregate results significantly more accessible and efficient. In addition, BigQuery is a serverless platform, and it is built to be very scalable thanks to the fast deployment cycle and on-demand prices.

Here are a few pricing models of Google BigQuery:

1) On-Demand Analysis

This pricing model offers free 1TB analysis per month, after which users are billed based on the use (byte processed per query). The rate after free analysis is $5.00 per TB. Users can query on data stored in Google BigQuery or external sources such as Cloud Storage, Google Drive, or Bigtable. 

2) Flat-Rate Pricing

In this model, users can pay a fixed amount to buy slots instead of paying per byte used. The pricing option varies based on how long the service is intended to be used:

  • Flex: Initial 60 seconds
  • Monthly: initial 30 days
  • Annual: Initial 365 days

As the commitment period increases, the cost per unit time reduces. For example, a month’s cost for 100 slots with a monthly commitment will be $2000, while the same thing as the annual commitment will be $1700.

Understanding Random Sampling

Advanced BigQuery Random Sampling - Random Sampling
Image Source

Random sampling is part of the sampling technique where each sample has the same probability of being selected. Randomly selected samples are intended as representations that do not side with the total population. If for, several reasons, the sample does not represent the population, variations are called Sampling Errors.

BigQuery Random Sampling is one of the most simple forms of data collection. Each subset has the same opportunity to be chosen as part of the sampling process in BigQuery Random Sampling. For example, in an organization of 300 employees, the population is the total number of employees in the company, and a group of 30 employees required to complete a survey will be the sample. Every individual has the same probability of entering the sample group. However, there is always the possibility that the group or sample does not represent the population. Here, every random variation is called a Sampling Error.

1) Simple Random Sampling

Simple Random Sampling is a random selection of a small segment of individuals or members of all populations. It gives every individual or population member the same and fair probability to be chosen. It’s the most convenient and simple sample selection technique.

2) Systematic Sampling

Systematic Sampling is the selection of individuals or certain members of the entire population. Selection often follows the Specified Interval (K). The Systematic Sampling method is proportional to the Simple Random Sampling method, just less complicated. 

3) Stratified Sampling

Stratified Sampling partitions a population into subclasses with notable distinctions and variances. This Sampling method is helpful because it allows researchers to make more reliable and informed conclusions by confirming that each of their respective subclasses has been adequately represented in the selected sample.

4) Cluster Sampling

Cluster Sampling, which is like the Stratified Sampling method, includes dividing the population into subclasses. Each subclass must describe the characteristics that are comparable to all selected samples. This method requires Random Selection of all sub-class, compared to the sampling of members of each subclass. This method is ideal for studies involving a widespread population.

Simplify BigQuery ETL Using Hevo’s No-code Data Pipeline

Hevo Data helps you directly transfer data from 100+ data sources (including 30+ free sources) to Business Intelligence tools, Data Warehouses such as Google BigQuery, or a destination of your choice in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code.

Get Started with Hevo for Free

Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

Check out what makes Hevo amazing:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Understanding Google BigQuery SQL

SQL (Structured Query Language) is a specialized query language domain to manage data in RDBMS (Relational Database Management System) or Data-Warehouses such as Google BigQuery

The foundations of SQL are Relational Algebra and Tuple Relational Calculus. SQL was also one of the first commercial languages ​​to employ a Relational Model proposed by Edgar F. Codd. At present, SQL is not only used by database administrators but also by developers who make data integration scripts and Data Analysts preparing and running analytical queries. 

There are four sub-languages in SQL:

  • Data Definition Language (DDL): The Database Structure or Scheme is defined using the Data Definition language. This statement determines the details of implementing a Database Schema that is usually hidden from the user.
  • Data Query Language (DQL): DQL statement is used to ask for Data Stored in the Schema Object. The aim of the DQL command is to get the Scheme Relationship based on the query provided for it.
  • Data Manipulation (DML): DML statement is used to manage the data contained in the schema object.
  • Data Control Language (DCL): DCL statement is used to control access to the data contained in the database (authorization).

Here are a few advantages of Table Sampling:

1) Data Analysis with Less Error Risk

Random Sampling allows researchers to carry out Data Analysis on collected data with lower error margins. The entire process is randomized, and the random samples reflect the entire population. This allows data analysis to provide accurate insight into the concerned subject matters in BigQuery Random Sampling.

2) Less Bias

As individuals who form a subset of larger groups are randomly selected, each individual in the large dataset has the same probability of being chosen. The randomness of selection creates a balanced subset that brings the most significant individual to represent a larger group as a whole.

3) Simplicity

As the name suggests, producing Simple Random Samples is far more complicated than other methods, such as Stratified Random Sampling. As mentioned, individuals on the subset are selected randomly, and there are no additional steps in BigQuery Random Sampling.

Selecting a Randomly Distributed Sample from BigQuery Tables

Here are the two methods to select tables in BigQuery Random Sampling:

1) Use the RAND Function

Before TABLESAMPLE is added, the RAND function is used as an alternative to retrieve a Random Sample subset. The querying cost is enormous as Google BigQuery will scan the entire table to generate one random number for each record.

2) Use the TABLESAMPLE Clause

You can use the TABLESAMPLE clause to select a Pseudo-Random Number of Rows from a table. It is based upon a percentage or several rows and an optional seed number—if you need repeatable results.

BigQuery Random Sampling - TableSample with 10 %
Image Source

Unlike the ‘LIMIT’ clause, TABLESAMPLE returns a random subset of data from a table. Since Google BigQuery does not cache the results of queries that include a TABLESAMPLE clause, the query might return different results each time. You can combine selection conditions with the TABLESAMPLE clause.

The following example samples about 50% of the table and then applies a WHERE clause:

BigQuery Random Sampling - TableSample with 50%
Image Source

The next example combines a TABLESAMPLE clause with a JOIN clause:

BigQuery Random Sampling - TableSample with Join
Image Source

You’ll receive an empty result when none of the sampled rows of two joined samples meets the join condition. You can specify the percentage as a query parameter. 

You can see how to pass the percentage to a query by using the ‘bq’ command-line tool:

BigQuery Random Sampling - Bq Command.
Image Source

The TABLESAMPLE clause randomly selects a percentage of data blocks (Google BigQuery organizes tables into data blocks) from the table and reads all the rows in the selected blocks. The number of data blocks limits the Sampling Granularity.

Whenever tables or table partitions are larger than about 1 GB, Google BigQuery splits them into Blocks. Smaller tables might comprise a Single Data Block. In such a scenario, the TABLESAMPLE clause reads the entire table. If the table is empty and the sampling percentage is greater than zero, table sampling always returns some results.

The Data Blocks can be of different sizes, so the exact fraction of sampled rows might vary. You can sample individual rows by using the WHERE rand() < K clause instead of the TABLESAMPLE clause.

However, Google BigQuery will have to scan the entire table with the WHERE rand() < K clause, increasing your cost. You can work in your budget and still benefit from Row-Level Sampling by combining both techniques.

The following example combines both the techniques where the query reads approximately 20% of the data blocks and then randomly selects 10% of the rows from those blocks:

BigQuery Random Sampling - TableSample with Rand() < K
Image Source

This is how Google BigQuery Random Sampling is done!

Conclusion

Google BigQuery helps companies process and analyze their data faster and more effectively. Google BigQuery is a scalable platform that’ll meet your organization’s needs. Google BigQuery Random Sampling can quickly run queries and churn through terabytes of data with no need to worry about clusters, servers, instances, or virtual machines.

Visit our Website to Explore Hevo

Businesses can use automated platforms like Hevo Data to setup the integration and handle the ETL process. It helps you directly transfer data from a source of your choice to a Data Warehouse such as Google BigQuery, Business Intelligence tools, or any other desired destination in a fully automated and secure manner without having to write any code and will provide you a hassle-free experience.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your experience of learning about Advanced BigQuery Random Sampling in the comments section below!

No-code Data Pipeline for your Data Warehouse