Understanding Google BigQuery ML: Simplified 101

on Data Analytics, Google BigQuery, Google BigQuery ML, Google Cloud Platform, Machine Learning, Python, R Programming • September 16th, 2021 • Write for Hevo

Nowadays, Machine Learning plays a vital role in making calculated decisions and predictions on data. Terabytes of data are generated daily and stored in Data Warehouses and Data Lakes to feed it to Business Intelligence tools, run Data Analytics, and create Machine Learning models. Companies tend to hire developers and Data Analysts with expertise in Python or R to handle Machine Learning models tasks this restricts the SQL experts to leverage Machine Learning in their analysis.

Google BigQuery ML is a toolset of the Google BigQuery Data Warehouse that allows SQL experts to train and execute Machine Learning models. So, if you don’t know any Machine Learning supported language or you are a Data Analyst with only expertise in SQL. You can also build Machine Learning models for your data with the help of Google BigQuery ML. 

Google BigQuery ML allows SQL users to utilize their skills without learning any other language to use Machine Learning in their daily tasks and eliminates any dependency on other team members. In this article, you will learn about Google BigQuery ML and its features. You will also read about different Machine Learning models supported in Google BigQuery ML. 

Table of Contents

Introduction to Google BigQuery ML

Google BigQuery ML Logo
Image Source

Google BigQuery ML is a new feature of Google BigQuery that is currently in the beta phase. Google BigQuery ML is a set of tools and extensions that enables users to create, train and execute Machine Learning models in Google BigQuery using standard SQL queries. It eliminates the need for data movement by allowing users to directly create Machine Learning models into the Data Warehouse. Google BigQuery Machine Learning (BQML) makes Data Analytics easy as it doesn’t require Data Analysts to learn other programming languages like Python or R to train Machine Learning models.

Google BigQuery ML supports many Machine Learning models, and users can train, test, evaluate and deploy models in Google BigQuery itself. users can use Google it in the following ways listed below:

  • The Google Cloud Console.
  • By using the “bq” command-line tool.
  • One can access it through BigQuery REST API.
  • Users can use external Bussiness Intelligence (BI) tools and Jupyter Notebook. 

Key Features of Google BigQuery ML

Google BigQuery ML allows SQL practitioners to train Machine Learning models without learning any Machine Learning supported languages like Python or R. It comes with many other features. A few of them are listed below:

  • Automatically Generate ML Models: Users who don’t have expertise in choosing the right Machine learning model for their data set. It uses AutoML to provide users a user-friendly graphical interface and allows you to choose the best Machine Learning model for your requirements.
  • Eliminates Data Transfer: Before Google BigQuery ML, users needed to move data from Google BigQuery to other platforms or isolated environments to train Machine Learning models. It consumes unnecessary time for large datasets. Now, users can train and execute models directly from Google BigQuery.
  • Encrypted Models: Google BigQuery ML allows users to encrypt Machine Learning models with the Customer-Managed Encryption Keys (CMEK).

To learn more about Google BigQuery ML, click here.

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Salesforce, Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ data sources and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

Get Started with Hevo for Free

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  1. Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  2. Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  3. Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  4. Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  5. Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  6. Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  7. Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Different Models Supported in Google BigQuery ML

Google BigQuery ML Supported Models
Image Source

Various Machine Learning models are used to train data and predict them based on the datasets fed to them. Google BigQuery ML ensures the availability of Machine Learning models to train any kind of dataset. All the Machine Learning models supported by Google BigQuery ML are listed below:

  • Linear Regression: Linear Regression is the most basic Machine Learning model that is used for forecasting. It uses statistical calculations to make predictions. Google BigQuery ML provides this Machine Learning model to forecast trends based on data. Linear Regression is used for real-valued label prediction.
  • Logistic Regression: Logistic Regression is used for the classification. When the required output is in YES or NO form, Logistic Regression delivers better results than Linera Regression because it converts every result in 1 or 0 i.e., YES or NO. Google BigQuery ML supports both Binary Logistic Regression and Multiclass Logistic Regression. Both are the types of Logistic Regression model, listed below:
    • Binary Logistic Regression: It is used when the required output only has 2 outcomes means YES or NO. The labels of Binary-class classification can have only two possible values, 1 and 0.
    • Multiclass Logistic Regression: Classification can have multiple outcomes. Google BigQuery ML supports Multiclass Logistic Regression to allow users to solve real-world problems based on classification. It uses a multinomial classifier with a cross-entropy loss function to train Multiclass Logistic Regression models.
  • K-Means Clustering: K-means clustering model is used for data segmentation. It uses non labeled data and groups similar data points in one group. Users can use the K-means clustering model with the CREATE MODEL statement with the KMEANS model type.
  • Matrix Factorization: Google BigQuery ML allows users to quickly create and train models. The matrix Factorization model is used to create product recommendation systems. Users can use past customer behavior data with their product purchase data to build a robust recommendation system using SQL. 
  • Time Series: It is used to perform forecasts on time-series data. Google BigQuery ML eliminates the tedious task to handle anomalies in data like holidays, seasonality, etc. With the help of the Time Series model, users can create many time series models, test them and use them for forecasting.

Steps to Create Models using Google BigQuery ML

Now that you have learned about Google BigQuery ML. In this section, you will learn the steps to create a simple Logistic Regression model in Google BigQuery ML and make learn to make predictions from a dataset. The steps to create a simple model are listed below:

Step 1: Setting Up the Environment

  • Sign in to your Google Cloud Platform account, here
  • Now, create a New Project and fill out all the required details for billing.
  • Now, go to Google Developer Console and click on the sidebar menu.
  • Now click on the “BigQuery” option, and it will open up the Google BigQuery Console in the next tab, shown in the image below.
Sidebar Menu of Google Cloud Platform to Open Google BigQuery
Image Source
  • The project that you provided earlier will be reflected in the menu on the left side of the screen.
  • Under the project name, click on the next arrow button and then select the “Create New Dataset” option, shown in the image below.
Creating New Dataset for Google BigQuery ML
Image Source
  • Provide a name to your dataset as bqml_codelab. If you want to give a different name, then make sure to replace the name with code in the process.

Step 2: Creating Machine Learning Model

  • Now, open Google Cloud Console.
  • In this example, you will learn about creating a model to predict whether a visitor will make a transaction. The code is given below.
#standardSQL
CREATE OR REPLACE MODEL `bqml_codelab.sample_model` 
OPTIONS(model_type='logistic_reg') AS
SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, "") AS country,
  IFNULL(totals.pageviews, 0) AS pageviews
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20160801' AND '20170631'
LIMIT 100000;
  • In the following code, the name of the dataset is “bqml_codelab” and “sample_model” is the name of the model. The model used here is Binary Logistic Regression.

Step 3: Evaluating the Model

  • Now the following code will return the evaluation metrics for the model. The metrics are “precision, “recall”, “accuracy”, “f1_score”, “log_loss”, “roc_auc”. The code is given below.
#standardSQL
SELECT
  *
FROM
  ml.EVALUATE(MODEL `bqml_codelab.sample_model`, (
SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, "") AS country,
  IFNULL(totals.pageviews, 0) AS pageviews
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'));

Step 4: Predicting Using the Model

  • Now, the model will predict the number of transactions made by visitors for each country and the code will sort the result to the top 10 countries. The code is given below.
#standardSQL
SELECT
  country,
  SUM(predicted_label) as total_predicted_purchases
FROM
  ml.PREDICT(MODEL `bqml_codelab.sample_model`, (
SELECT
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(totals.pageviews, 0) AS pageviews,
  IFNULL(geoNetwork.country, "") AS country
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY country
ORDER BY total_predicted_purchases DESC
LIMIT 10;
  • The output is shown in the image below.
Output Table of Prediction Using BigQuery ML
Image Source
  • The difference between the previous code and this block is that previously it used “m1.EVALUATE” and now it’s “m1.PREDICT”. The other difference between both the codes is the use of the GROUP BY command to group country columns, sorting the results and limiting it to 10 only.

That’s it! You have created a Logistic Regression model, evaluated it, and used the model to make predictions. You can try the same example for the Linear Regression model and compare the accuracies of both models.

Benefits of Google BigQuery ML

Google BigQuery ML is fast to create, evaluate and execute various Machine Learning models easily using standard SQL queries. It offers many benefits over other Cloud Data Warehouses. Some of the benefits are listed below.

  • It eliminates the need to know Python or any other language for managing Machine Learning models. Data Analysts with expertise in SQL only can now train models and make predictions.
  • The data export involves many steps, and it’s a time-consuming process. Google BigQuery ML saves time and resources by letting users use Machine Learning models in Google BigQuery.
  • It allows users to run Machine Learning models on large datasets within minutes as it uses computation resources of Google BigQuery Data Warehouse.
  • It features some automated Machine Learning models that reduce the workload to manipulate data manually. It saves time and allows users to quickly train and test models on the dataset.

Conclusion

In this article, you learnt about Google BigQuery ML, and how easy it is to make predictions using SQL. You also read about different Machine Learning models that Google BigQuery ML supports. It eliminates the restriction Data Analysts face for using Machine Learning models in their analysis. Companies can deploy models directly on their datasets to make better business decisions. 

Visit our Website to Explore Hevo

Google BigQuery Data Warehouse loads data from multiple data sources, and it becomes a tedious task for manually managing all the ETL processes. Hevo Data is a No-code Data pipeline solution that can help you transfer data from multiple data sources to Google BigQuery. It fully automates the process of loading and transforming data from 100+ data sources to your desired Data Warehouse without writing a single line of code.

Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your experience of learning about Google BigQuery ML in the comments section below!

No-code Data Pipeline For your Google BigQuery