Building BigQuery External Tables: Syntax, Usage, & Examples Simplified 101

on BigQuery External Tables, Data Warehouse, Google BigQuery • February 15th, 2022 • Write for Hevo

BigQuery External Tables_FI

Creating and querying data from external sources has become a holistic approach by organizations to harness possible insights from data stored in different locations. This is done by collecting data from disparate data sources to host in a unified data control point to enable the implementation of a centralized analytic service.

An External data source on BigQuery can be used to directly query data from diverse data locations even though the data is not stored in BigQuery. BigQuery gives you the ability to build BigQuery External tables that can connect to your data that is not stored in it so you can extract the data and make use of them without having to copy or transfer them from the data source systems. This makes things flexible as you can use this connection to analyze your data to get valid deductions that will help improve your organization.

In this article, you will gain information about BigQuery External Tables. You will also gain a holistic understanding of Google BigQuery, its key features, External Data Sources, different usage of BigQuery External Tables, and the limitations of BigQuery External Tables. Read along to find out in-depth information about undergoing Google BigQuery External Tables.

Table of Contents

What is Google BigQuery?

BigQuery External Tables - Google BigQuery
Image Source

Google BigQuery is a Cloud-based Data Warehouse that provides a Big Data Analytic Web Service for processing petabytes of data. It is intended for analyzing data on a large scale. It consists of two distinct components: Storage and Query Processing. It employs the Dremel Query Engine to process queries and is built on the Colossus File System for storage. These two components are decoupled and can be scaled independently and on-demand.

Google BigQuery is fully managed by Cloud service providers. We don’t need to deploy any resources, such as discs or virtual machines. It is designed to process read-only data. Dremel and Google BigQuery use Columnar Storage for quick data scanning, as well as a tree architecture for executing queries using ANSI SQL and aggregating results across massive computer clusters. Furthermore, owing to its short deployment cycle and on-demand pricing, Google BigQuery is serverless and designed to be extremely scalable.

For further information about Google Bigquery, follow the Official Documentation.

Key Features of Google BigQuery

BigQuery External Tables - Key Features
Image Source

Some of the key features of Google BigQuery are as follows:

1) Scalable Architecture

BigQuery has a scalable architecture and offers a petabyte scalable system that users can scale up and down as per load.

2) Faster Processing

Being a scalable architecture, BigQuery executes petabytes of data within the stipulated time and is more rapid than many conventional systems. BigQuery allows users to run analysis over millions of rows without worrying about scalability.

3) Fully-Managed

BigQuery is a product of Google Cloud Platform, and thus it offers fully managed and serverless systems.

4) Security

BigQuery has the utmost security level that protects the data at rest and in flight. 

5) Real-time Data Ingestion

BigQuery can perform real-time data analysis, thereby making it famous across all the IoT and Transaction platforms.

6) Fault Tolerance

BigQuery offers replication that replicates data across multiple zones or regions. It ensures consistent data availability when the region/zones go down.

7) Pricing Models

The Google BigQuery platform is available in both on-demand and flat-rate subscription models. Although data storage and querying will be charged, exporting, loading, and copying data is free. It has separated computational resources from storage resources. You are only charged when you run queries. The quantity of data processed during searches is billed.

What are External Data Sources?

An External Data Source is a data source whose data is not stored in Google BigQuery storage but can be queried directly from Google BigQuery. This can be helpful when performing ELT of workloads, loading, and cleaning of your data in the external storage. It also comes in handy when joining Google BigQuery tables with frequently changing data from External Data sources as you can query the external data directly before loading it into BigQuery storage to avoid reloading the data every time changes occur.

Google BigQuery supports the following External Data Sources:

  • Bigtable
  • Cloud Panner
  • Cloud SQL
  • Cloud Storage
  • Drive

Google BigQuery has two basic mechanisms for querying External data sources which are:

  • External tables
  • Federated queries

BigQuery External tables will be discussed in the next section as that is the focus of the article, but, Federated queries can simply be defined as a way to send a query statement to an external database and get the result back as a temporary table. Federated queries are carried out using Cloud Spanner or Cloud SQL.

Simplify Google BigQuery ETL and Analysis with Hevo’s No-code Data Pipeline

A fully managed No-code Data Pipeline platform like Hevo Data helps you integrate and load data from 100+ different sources (including 40+ free sources) to a Data Warehouse such as Google BigQuery or Destination of your choice in real-time in an effortless manner. Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. Its strong integration with umpteenth sources allows users to bring in data of different kinds in a smooth fashion without having to code a single line. 

Get Started with Hevo for Free

Check out some of the cool features of Hevo:

  • Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
  • Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the Data Pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use.
  • Connectors: Hevo supports 100+ integrations to SaaS platforms, files, Databases, analytics, and BI tools. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake Data Warehouses; Amazon S3 Data Lakes; and MySQL, SQL Server, TokuDB, DynamoDB, PostgreSQL Databases to name a few.  
  • Real-Time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
  • 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (including 40+ free sources) that can help you scale your data infrastructure as required.
  • 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

What are BigQuery External Tables?

An External Table is a table that operates like a standard Google BigQuery table since its table metadata and the table schema are both stored in Google BigQuery storage, but, the data itself resides in the external source. A BigQuery External table can either be Temporary or Permanent.

The Permanent External table is contained inside a dataset and can be managed similarly to that of a standard Google BigQuery table. With the Permanent External table, you can view the table properties, set control measures, query the table, join it with other tables, and so on. 

Google BigQuery External tables can be used with the following data sources:

  • Bigtable
  • Cloud Storage
  • Drive

In subsequent sections, we will look at how to query each of them from BigQuery but first, we have to define and explain some important terms that will be used or needed when building BigQuery external tables. 

1) What are Permanent & Temporary External Tables?

In Google BigQuery, you can query External data sources using either a Permanent or Temporary table and this decision will be based on what you want. 

A Permanent table is a table that is created in a dataset and is linked to your External data source in which you can enjoy some privileges by sharing the table with others who have access to the External data source as well as querying the table any time you need to without restrictions.

In the case of a Temporary table, whenever you query the External data source, a command is submitted in which a non-permanent table link is created. The Temporary table does not create a table in one of your Bigquery datasets therefore it can not be shared with others but it can be useful for a one-time, ad-hoc query, or when performing Extract, Transform, and Load (ETL) operations.

2) Creating a URL for Your Data Source

Before you can build a BigQuery External table, you will need to retrieve the URL of the data source as this will be used in creating the BigQuery External table.

In this section, we will look at how to retrieve the URI of Cloud Bigtable, Cloud Storage, and Drive.

A) Retrieve Cloud Bigtable URL

To retrieve the Cloud Bigtable URL, you can carry out the following steps:

  • Step 1: Open the Cloud Bigtable Console.
  • Step 2: Now, you can retrieve the following details about your Cloud Bigtable data source:
    • Project ID
    • Cloud Bigtable Instance ID
    • The name of your Cloud Bigtable table
  • Step 3: Compose the cloud Bigtable URL using the following items in a given format.
    • project_id: This is the project that contains your Cloud Bigtable instance.
    • instance_id: This is the Cloud Bigtable instance ID
    • table_name: This is the name of the table you are querying.

The Bigtable URL will look like this:

https://googleapis.com/bigtable/projects/project_id/instances/instance_id/tables/table_name

For more information on how to retrieve your Bigtable URL, you can visit here.

B) Retrieve Cloud Storage URL

The Cloud Storage URL comprises your bucket name and the object (filename).

To retrieve the Cloud Storage URL, you can carry out the following steps:

  • Step 1: Open the Cloud Storage Console.
  • Step 2: Now, you can browse to the location of the project (file) that contains the source data.
  • Step 3: At the top of the Cloud Storage console, compose your URL by replacing the  gs://bucket/file with the appropriate path, for example, gs://mybucket/myfile.json.bucket is the Cloud Storage bucket name and myfile is the name of the object (file) containing the data.

For more information on how to retrieve your Cloud Storage URL, you can visit here.

C) Retrieve Drive Data URL

To provide the Drive URL, you must share a link to the file, and it is usually in this format:

https://docs.google.com/spreadsheets/d/FILE_ID or https://drive.google.com/open?id=FILE_ID.FILE_ID is the alphanumeric ID for your Drive file.

To share the file, you can carry out the following steps:

  • Step 1: Choose the file you want to share. This can be done on a computer, an Android device, or an iPhone and iPad.
  • Step 2: On your computer, open any of the following apps:
    • Google Drive
    • Docs, Sheets
    • Slides
  • Step 3: Click on the “share” button.

For more information on how to retrieve your Drive data URI, you can visit here.

3) Querying Cloud Bigtable Data

Google Cloud Bigtable can be defined as its sparsely populated NoSQL database which can be used to scale data to billions of rows, thousands of columns, and petabytes of data. 

To query Cloud Bigtable, you can use either a Permanent or Temporary table. You can also do this in the Console, using API, bq command-line tool, or Java.

For further information on how to query Cloud Bigtable data and using the other options listed above, you can visit here.

To query Cloud Bigtable data using a Permanent table through the bq command-line tool, you can carry out the following steps:

  • Step 1: Create a table in the bq query command-line tool using the bq mk command. 
  •  Step 2: Identify the table schema using a table definition file as shown below.
bq mk 
--external_table_definition=definition_file 
dataset.table
  • definition_file refers to the path that points to the table definition file on your local machine.
  • dataset is the nomenclature of the dataset in which the table is contained.
  • table denotes the identifier of the table you are creating.

To create and query Bigtable data Temporary External tables with bq query command-line tool using a table definition file, you can carry out the following steps:

  • Step 1: Enter the bq query command with the –external_table_definition flag.
  • Step 2: Supply the –location flag and set the value to your location. This is optional and can be overlooked.

An example of the query using the bq command line is shown below:

bq --location=location query 
--use_legacy_sql=false 
--external_table_definition=table::definition_file 
'query'
  • location is the name of your location and the –location flag is optional.
  • table is used to represent the name of the temporary table you are creating as part of the process.
  • definition_file is the path to the table definition file on your local machine.
  • query denotes the query that is being submitted to the temporary table.

4) Querying Cloud Storage Data

Google BigQuery supports the querying of the following Cloud Storage data in these formats Comma-Separated Values (CSV), JSON (newline-delimited), Avro, ORC, Parquet, Datasource exports, and Filestore exports.

It also supports the querying of Cloud Storage data from these storage classes:

  • Standard
  • Nearline
  • Coldline
  • Archive

Just like Bigtable, you can decide to query Cloud Storage data using either a Permanent or Temporary table using the bq command-line, DDL, API, Java, etc.

For other options and more information about creating and querying Cloud Storage data, visit here.

To create and query a Permanent External table using the bq query command-line tool, the table is created using the bq mk command line with the –external_table_defintion flag and the table schema is identified using a table definition file stored in your local machine, an inline schema definition, or a JSON schema file also stored in your local machine. 

To create a Permanent table linked to your Cloud Storage data source using a table definition file, you can carry out the following steps:

  • Step 1: Query the table linked to an External data source using the bq query command line with the –external_table_definition flag.
  • Step 2: You can identify the table schema using a table definition file. Then enter the following command as shown below:
bq mk 
--external_table_definition=definition_file 
dataset.table
  • definition_file specifies the location of the table definition file on your local machine.
  • dataset is the name of the dataset that the table is contained in.
  • table refers to the name of the table you intend to create.

To create a Permanent table linked to your Cloud Storage data source using an inline schema definition, enter the following command as shown below:

bq mk 
--external_table_definition=schema@source_format=Cloud Storage URI 
dataset.table
  • schema appears in the format field:data_type and represents the schema definition.
  • source_format is used to show the external data source format, for example, CSV.
  • Cloud Storage URL refers to your Cloud Storage URL.
  • dataset is the title of the dataset that the table is contained in.
  • table refers to the name of the table you intend to create.

To create and query the Cloud Storage data Temporary table, you can use a table definition file with the query, create an inline schema with a query, or use a JSON schema definition file with the query. This can be done using the bq command-line tool, API, or client libraries.

To use the bq command-line tool, you can carry out the following steps:

  • Step 1: Query the table linked to an External data source using the bq query command line with the –external_table_definition flag
  • Step 2: Identify the table schema using a table definition file stored in your local machine, an inline schema definition, or a JSON schema file.

To query a Temporary table linked to your Cloud Storage data source using a table definition, enter the following command as shown below:

bq --location=location query 
--external_table_definition=table::definition_file 
'query'
  • location is used to point to the location path and the –location flag is optional
  • table is the label of the temporary table you are creating.
  • definition_file constitutes the path to the table definition file on your local machine.
  • query is the query you are submitting to the Temporary table.

To query a Temporary table linked to your Cloud Storage data source using an inline schema definition, enter the following command as shown below.

bq --location=location query 
--external_table_definition=table::schema@source_format=Cloud Storage URI 
'query'
  • location is the name of your location and using the –location flag is optional. 
  • table is the name of the temporary table you are creating.
  • schema is the inline schema definition in the format field:data_type.
  • source_format is the format of the external data source, for example, CSV.
  • Cloud Storage URL is your Cloud Storage URL.
  • query is the query you are submitting to the temporary table.

5) Querying Drive Data

You can query Drive External data source by providing the Drive URL path to your data and then creating a BigQuery External table that references the data source. The table that references the Drive can be Permanent or Temporary. 

To create and query a Permanent External table, you can use the Cloud Console, the bq command-line tool’s mk command, create an ExternalDataConfiguration when using the tables.insert API and the client libraries. 

To get more about querying Drive data using both permanent and temporary tables, and other useful information as regards querying the drive data, you can visit here.

To create a permanent external table using the bq command-line tool, you can carry out the following steps:

  • Step 1: Use the bq mk command-line tool to create a table linked to the External data source.
  • Step 2: Identify the table schema using a table definition file stored in your local machine, an inline schema definition, or a JSON schema file.

To create a Permanent table linked to your Cloud Storage data source using a table definition, enter the following command as shown below:

bq mk 
--external_table_definition=DEFINITION_FILE 
DATASET.TABLE
  • DEFINITION_FILE is the path to the table definition file on your local machine.
  • DATASET is the name of the dataset that contains the table.
  • TABLE is the name of the table you are creating.

To create a Permanent table linked to your Cloud Storage data source using an inline schema definition, enter the following command as shown below.

bq mk 
--external_table_definition=SCHEMA@SOURCE_FORMAT=DRIVE_URI 
DATASET.TABLE
  • SCHEMA is the schema definition in the format FIELD:DATA_TYPE.
  • SOURCE_FORMAT is CSV, NEWLINE_DELIMITED_JSON, AVRO, or GOOGLE_SHEETS.
  • DRIVE_URL is your Drive URL.
  • DATASET is the label of the dataset that the table is found in.
  • TABLE is the name of the table you want to create.

To query Drive data using Temporary tables, you can run a command to combine a table definition file with a query, an inline schema definition with a query, or a JSON schema definition file with a query using bq command-line, API, Python, and Java.

To create and query a Temporary table using the bq command-line, you can carry out the following steps:

  • Step 1: Use the bq query command with the –external_table_definition flag to query a Temporary table linked to the external source.
  • Step 2: Identify the table schema using a table definition file stored in your local machine, an inline schema definition, or a JSON schema file

To query a Temporary table linked to your Cloud Storage data source using a table definition, enter the following command as shown below:

bq --location=LOCATION query 
--external_table_definition=TABLE::DEFINITION_FILE 
'QUERY'
  • LOCATION is your location and using the –location flag is optional.
  • TABLE is the name of the temporary table you are creating.
  • DEFINITION_FILE is the location/route to the table definition file on your local machine.
  • QUERY refers to the query that is to be submitted to the temporary table.

To query a Temporary table linked to your Cloud Storage data source using an inline schema definition, enter the following command as shown below.

bq --location=LOCATION query 
--external_table_definition=TABLE::SCHEMA@SOURCE_FORMAT=DRIVE_URI 
'QUERY'
  • LOCATION denotes the location. The –location flag is optional and not compulsory.
  • TABLE is used to point to the temporary table you are trying to create.
  • SCHEMA is usually in the format FIELD:DATA_TYPE and it refers to the inline schema definition.
  • SOURCE_FORMAT is CSV, NEWLINE_DELIMITED_JSON, AVRO, or GOOGLE_SHEETS.
  • DRIVE_URL is your Drive URL.
  • QUERY is the query you are submitting to the Temporary table.

Limitations of BigQuery External Tables

The are several limitations of using BigQuery External tables. They include the following:

  • There is no data consistency for BigQuery External data tables as Bigquery does not guarantee it. Any change to the data while the query is running can result in an unexpected outcome.
  • BigQuery External tables are read-only tables and cannot be modified using DML or other methods.
  • The performance of querying BigQuery External tables may not be as high as querying data in a native Bigquery table.
  • Bigquery jobs that export data from a BigQuery External table cannot be run.
  • You cannot use the TableDataList JSON API method to retrieve data from BigQuery External tables.
  • BigQuery External tables do not support table clustering.
  • You have a limit of four (4) concurrent queries.
  • You cannot reference a BigQuery External table using a wildcard table query.
  • Whenever you query a BigQuery External data source other than Cloud Storage, the results are not cached.
  • Querying data in Cloud Bigtable is only available in limited locations.

Conclusion

This article has covered a lot on how to build BigQuery external tables to enable you to extract your data that are not stored on BigQuery from disparate data sources for further operations. It has exposed you to a wide range of topics as regards this by explaining what external tables are and detailing how to create them amongst other things.

In this article, you have learned about BigQuery External Tables. This article also provided information on Google BigQuery, its key features, External Data Sources, different usage of BigQuery External Tables, and the limitations of BigQuery External Tables in detail. For further information on BigQuery JSON Extract, BigQuery Create View Command, BigQuery Partition Tables, you can visit the former links.

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations with a few clicks.

Visit our Website to Explore Hevo

Hevo Data with its strong integration with 100+ data sources (including 40+ Free Sources) allows you to not only export data from your desired data sources & load it to the destination of your choice but also transform & enrich your data to make it analysis-ready. Hevo also allows integrating data from non-native sources using Hevo’s in-built Webhooks Connector. You can then focus on your key business needs and perform insightful analysis using BI tools. 

Want to give Hevo a try?

Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You may also have a look at the amazing price, which will assist you in selecting the best plan for your requirements.

Share your experience of understanding Google BigQuery External Tables in the comment section below! We would love to hear your thoughts.

No-code Data Pipeline for Google BigQuery