BigQuery Tutorial for Beginners: A Comprehensive Guide 101

on Tutorials • October 14th, 2020 • Write for Hevo

This article is basically focused on providing a comprehensive BigQuery Tutorial. You will also gain a holistic understanding of Batch Processing, its key features, Spring framework, Spring Batch, and a step-by-step guide to developing a Spring Boot application through Spring Batch Jobs. Read along to find out in-depth information in the BigQuery Tutorial.

Table of Contents

What is Google BigQuery?

BigQuery Tutorial: BigQuery Logo.

It is Google Cloud Platform’s enterprise data warehouse for analytics. Google BigQuery performs exceptionally even while analyzing huge amounts of data & quickly meets your Big Data processing requirements with offerings such as exabyte-scale storage and petabyte-scale SQL queries. It is a serverless Software as a Service (SaaS) application that supports querying using ANSI SQL & houses machine learning capabilities.

Key Features of Google BigQuery

BigQuery Tutorial: Key Features of Google BigQuery
Image Source: miro.medium.com

Some of the key features of Google BigQuery are as follows:

1) Scalable Architecture

BigQuery has a scalable architecture and offers a petabyte scalable system that users can scale up and down as per load.

2) Faster Processing

Being a scalable architecture, BigQuery executes petabytes of data within the stipulated time and is more rapid than many conventional systems. BigQuery allows users to run analysis over millions of rows without worrying about scalability.

3) Fully-Managed

BigQuery is a product of the Google Cloud Platform, and thus it offers fully managed and serverless systems.

4) Security

BigQuery has the utmost security level that protects the data at rest and in flight. 

5) Real-time Data Ingestion

BigQuery can perform real-time data analysis, thereby making it famous across all the IoT and Transaction platforms.

6) Fault Tolerance

BigQuery offers replication that replicates data across multiple zones or regions. It ensures consistent data availability when the region/zones go down.

7) Pricing Models

The Google BigQuery platform is available in both on-demand and flat-rate subscription models. Although data storage and querying will be charged, exporting, loading, and copying data is free. It has separated computational resources from storage resources. You are only charged when you run queries. The quantity of data processed during searches is billed.

For further information on Google BigQuery, you can check the official website here.

Simplify your data analysis with Hevo’s No-code Data Pipelines

Hevo Data, a No-code Data Pipeline, helps to transfer data from 100+ sources (including 40+ sources) to BigQuery and visualize it in a BI Tool. Hevo is fully-managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination. It allows you to focus on key business needs and perform insightful analysis using various BI tools.

Get Started with Hevo for Free

Check out what makes Hevo amazing:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Why Use BigQuery?

The primary reason to use BigQuery is for analytical querying. BigQuery enables you to run complex analytical queries on large data sets. Queries are data requests that can include calculation, modification, merging, and other data manipulations. BigQuery is aimed at making analytical queries that go beyond simple CRUD operations and has a high throughput.

This completes your BigQuery Tutorial on the reason for using Google BigQuery.

How to Use BigQuery?

You can use BigQuery for the following use cases:

  • Interacting with BigQuery
  • Running and Managing Jobs
  • Working with datasets
  • Working with table schemas
  • Working with tables
  • Working with partitioned tables
  • Working with clustered tables
  • Working with table snapshots
  • Working with views
  • Working with materialized views and many more.

For detailed information on the BigQuery Tutorial on the use cases of Google BigQuery, visit Google BigQuery’s website.

Understanding the BigQuery Architecture

Google’s BigQuery service follows a four-layer structure, this BigQuery Tutorial helps you to understand the architecture in detail. The first layer is known as projects, which act as a top-level container for the data you want to store in the Google Cloud Platform. Datasets make up the second layer of Google BigQuery. You can have single or multiple datasets in a particular project.

The third layer is known as the tables, which store your data in the form of rows and columns. Just like datasets, you can have single or multiple tables in a dataset. The final layer of BigQuery is known as jobs and it’s all about executing SQL queries to fetch, insert and modify data. Also, read Bigquery Array.

BigQuery Tutorial: Four Layer Structure of BigQuery
Image Source: Self

You can learn more about the four layers of BigQuery in the following sections:

1) BigQuery Projects

BigQuery projects function as the top-level container for your data. Each project has a unique name and id, which makes storing, accessing and removing data from BigQuery, a smooth process.

BigQuery projects follow a particular naming convention and allow users to name their projects in such a way that the names must start with a lower case character and can only contain digits, hyphens and ASCII values.

To create a project in BigQuery, you can use the create command as follows:

gcloud projects create PROJECT_ID

This completes your BigQuery Tutorial on BigQuery projects.

2) BigQuery Datasets

BigQuery datasets act as the container for your tables and views, with each dataset having multiple tables that store your data. With datasets, you can manage, control and access your data from tables and views. You can also set permissions at the organisation, project and dataset level.

You can create a dataset in BigQuery using the bq command as follows:

bq mk test_dataset
BigQuery Tutorial: Creating a dataset in BigQuery.
Image Source: Self

This completes your BigQuery Tutorial on BigQuery datasets.

3) BigQuery Tables

BigQuery stores your data in the form of rows and columns in numerous tables. Each BigQuery table follows a particular schema that describes the columns, their name and datatypes.

BigQuery Tutorial: Tables Schema
Image Source: Self

BigQuery allows users to create three different types of tables: 

  • Native Tables: These tables make use of the BigQuery storage to store your data.
  • External Tables: These tables make use of external storage facilities such as Google Drive, Google Cloud Platform, etc. to store your data.
  • Views:  These are the virtual tables that a user can define using SQL queries, usually done to control the column level access.

To create a native table in BigQuery, you can use the following command:

bq mk 
--table 
--expiration 36000 
--description "test table" 
bigquery_project_id:test_dataset.test_table 
sr_no:INT64,name:STRING,DOB:DATE
BigQuery Tutorial: Creating native tables in BigQuery.
Image Source: Self

This completes your BigQuery Tutorial on BigQuery tables.

4) BigQuery Jobs

BigQuery jobs refer to the operations you perform on your data. With BigQuery, you can perform four different operations/tasks, namely, load, query, export and copy on the data you’ve stored in the BigQuery. Every time you execute one of these tasks, it automatically creates a job.

BigQuery allows users to fetch information about the jobs they’ve created using the ls command. You can use the ls command as follows:

ls -j project_id
BigQuery Tutorial: Jobs created in BigQuery.
Image Source: Self

This completes your BigQuery Tutorial on BigQuery jobs.

BigQuery Tutorial: Accessing BigQuery Data

BigQuery allows users to access their data using various SQL commands in a way similar to how they access their data stored in traditional SQL-based databases such as SQL, Oracle, Netezza, etc. It also allows users to access their BigQuery data using various other ways such as using the bq command, BigQuery service APIs, using a visualization tool such as Google Data Studio, Looker, etc.

To access the data using the select statement, you can make use of the following syntax:

select columns_names from table_name where condition group by column_name order by column_name 

For example, if you want to fetch data from the “bigquery-public-data” table, you can use the select statement as follows:

SELECT  title, count(1) as count  FROM `bigquery-public-data.wikipedia.pageviews_2019` 
WHERE date(datehour ) between '2019-01-01' and '2019-12-31' and lower(title) like '%bigquery%'
group by title
order by count desc;

This query will display the number of times the term “bigquery” featured as the title of a Wikipedia page in 2019, and it will generate the following output:

BigQuery Tutorial: Query Execution.
Image Source: Self

Apart from allowing users to access or modify their data using a select statement, BigQuery also supports various other functionalities such as providing integration support for reporting tools such as Google Data Studio, Tableau, etc. It also allows streaming data directly from a source of your choice to leverage the power of real-time analytics.

To throw some light on this BigQuery Tutorial on accessing BigQuery data, there are several sections:

1) Save Query

In BigQuery, you can save queries that you want to use later. The steps are as follows:

  • Step 1: Click on the “Save” button. Then, click on “Save Query“.
BigQuery Tutorial: Save Query using Google BigQuery
Image Source: Self
  • Step 2: Give a name to your query and choose its visibility as per your need.
  • Step 3: Click on the “Save” button.
BigQuery Tutorial: Save Query using Google BigQuery Cloud Platform
Image Source: Self

Your saved queries are visible in the respective popup tab.

2) Schedule Query

The steps to schedule a query in BigQuery are as follows:

  • Step 1: Click on the “Schedule” button. You will be notified that you must first enable the BigQuery Data Transfer API.
BigQuery Tutorial: Schedule Query using Google BigQuery
Image Source: Self
  • Step 2: Click on the “Enable API” button and wait for a while.
Schedule Query using Google BigQuery Cloud Platform
Image Source: Self
  • Step 3: Now, you’ll be able to create scheduled queries when you click on the “Schedule” button.
Schedule Query using Google BigQuery Platform
Image Source: Self
  • Step 4: Click the “Create new scheduled query” option and define the parameters accordingly. Optionally, you can set up advanced and notification options.
  • Step 5: Click the “Schedule” button when the setup is complete.
BigQuery Tutorial: Scheduled query setup
Image Source: Self

This completes your BigQuery Tutorial on scheduling queries in BigQuery.

3) Export Query

To export your query results, the steps are as follows:

  • Step 1: Click on the “Save Results” button and select one of the available options:
    • CSV file
      • Download to your device (up to 16K rows)
      • Download to Google Drive (up to 1GB)
    • JSON file
      • Download to your device (up to 16K rows)
      • Download to Google Drive (up to 1GB)
    • BigQuery table
    • Google Sheets (up to 16K rows)
    • Copy to clipboard (up to 16K rows)
BigQuery Tutorial: Save query results
Image Source: Self
  • Step 2: Suppose, you select the “BigQuery table” option.
  • Step 3: Now, you need to set the Project name, Dataset name and table name.
BigQuery Tutorial: Save query as a BigQuery table
  • Step 4: Now, click on the “Save” button.
BigQuery Tutorial: Query saved as a table

This completes your BigQuery Tutorial on exporting queries in BigQuery.

How Does Google BigQuery Store Data?

BigQuery leverages columnar storage, in which each column is stored in different file blocks. As a result, BigQuery is an excellent choice for OLAP (Online Analytical Processing) applications. You can easily stream (append) data to BigQuery tables, as well as update or delete existing values. BigQuery allows for unlimited mutations (INSERT, UPDATE, MERGE, DELETE).

BigQuery Tutorial: BigQuery Storage
Image Source: miro.medium.com

BigQuery uses a columnar format known as Capacitor to store data. Each field of a BigQuery table, i.e. column, is stored in its own Capacitor file, allowing BigQuery to achieve extremely high compression ratios and scan throughput.

BigQuery uses Capacitor to store data in Colossus. Colossus is Google’s next-generation distributed file system and the successor to GFS (Google File Systems). Colossus is in charge of cluster-wide replication, recovery, and distributed management. It ensures durability by storing redundant chunks of data on multiple physical disks with erasure encoding. It supports client-driven replication and encoding.

When writing data to Colossus, BigQuery makes a decision about the sharding strategy which further evolves based on the query and access patterns. Once data is written, BigQuery initiates geo-replication of data across different data centres to ensure maximum availability. By separating storage and compute, you can scale to petabytes of storage without requiring any additional compute resources thus saving cost.

Colossus allows data to be split into multiple partitions for lightning-fast parallel reads, whereas Capacitor reduces scan throughput requirements. They work together to process a terabyte of data per second.

This completes your BigQuery Tutorial on BigQuery data storage.

Conclusion

This article provided you with a comprehensive BigQuery Tutorial and provided in-depth knowledge about the concepts behind every step to help you understand and implement them efficiently. Using BigQuery to draw crucial insights about your business, requires you to bring in data from a diverse set of sources by setting up various ETL pipelines. 

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations with a few clicks.

Visit our Website to Explore Hevo

Hevo Data with its strong integration with 100+ Data Sources (including 40+ Free Sources) allows you to not only export data from your desired data sources & load it to the destination of your choice such a BigQuery but also transform & enrich your data to make it analysis-ready. Hevo also allows integrating data from non-native sources using Hevo’s in-built REST API & Webhooks Connector. You can then focus on your key business needs and perform insightful analysis using BI tools. 

Want to give Hevo a try? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You may also have a look at the amazing price, which will assist you in selecting the best plan for your requirements.

Tell us about your experience of learning about BigQuery and its features using our BigQuery Tutorial! Share your thoughts in the comments section below!

No-code Data Pipeline Solution For BigQuery