An External data source on BigQuery can be used to directly query data from diverse data locations even though the data is not stored in BigQuery. BigQuery gives you the ability to build BigQuery External tables that can connect to your data that is not stored in it so you can extract the data and make use of them without having to copy or transfer them from the data source systems.
This makes things flexible as you can use this connection to analyze your data to get valid deductions that will help improve your organization.
In this article, you will gain information about BigQuery External Tables. You will also gain a holistic understanding of Google BigQuery, its key features, External Data Sources, different usage of BigQuery External Tables, and the limitations of BigQuery External Tables. Read along to find out in-depth information about undergoing Google BigQuery External Tables.
What is Google BigQuery?
Google BigQuery is a Cloud-based Data Warehouse that provides a Big Data Analytic Web Service for processing petabytes of data. It is intended for analyzing data on a large scale. It consists of two distinct components: Storage and Query Processing. It employs the Dremel Query Engine to process queries and is built on the Colossus File System for storage. These two components are decoupled and can be scaled independently and on-demand.
What are External Data Sources?
An External Data Source is a data source whose data is not stored in Google BigQuery storage but can be queried directly from Google BigQuery. This can be helpful when performing ELT of workloads, loading, and cleaning of your data in the external storage. It also comes in handy when joining Google BigQuery tables with frequently changing data from External Data sources as you can query the external data directly before loading it into BigQuery storage to avoid reloading the data every time changes occur.
Google BigQuery supports the following External Data Sources:
- Bigtable
- Cloud Panner
- Cloud SQL
- Cloud Storage
- Drive
Google BigQuery has two basic mechanisms for querying External data sources which are:
- External tables
- Federated queries
BigQuery External tables will be discussed in the next section as that is the focus of the article, but, Federated queries can simply be defined as a way to send a query statement to an external database and get the result back as a temporary table. Federated queries are carried out using Cloud Spanner or Cloud SQL.
What are BigQuery External Tables?
An External Table is a table that operates like a standard Google BigQuery table since its table metadata and the table schema are both stored in Google BigQuery storage, but, the data itself resides in the external source. A BigQuery External table can either be Temporary or Permanent.
The Permanent External table is contained inside a dataset and can be managed similarly to that of a standard Google BigQuery table. With the Permanent External table, you can view the table properties, set control measures, query the table, join it with other tables, and so on.
Google BigQuery External tables can be used with the following data sources:
- Bigtable
- Cloud Storage
- Drive
In subsequent sections, we will look at how to query each of them from BigQuery but first, we have to define and explain some important terms that will be used or needed when building BigQuery external tables.
1) What are Permanent & Temporary External Tables?
In Google BigQuery, you can query External data sources using either a Permanent or Temporary table and this decision will be based on what you want.
A Permanent table is a table that is created in a dataset and is linked to your External data source in which you can enjoy some privileges by sharing the table with others who have access to the External data source as well as querying the table any time you need to without restrictions.
In the case of a Temporary table, whenever you query the External data source, a command is submitted in which a non-permanent table link is created. The Temporary table does not create a table in one of your Bigquery datasets therefore it can not be shared with others but it can be useful for a one-time, ad-hoc query, or when performing Extract, Transform, and Load (ETL) operations.
2) Creating a URL for Your Data Source
Before you can build a BigQuery External table, you will need to retrieve the URL of the data source as this will be used in creating the BigQuery External table.
In this section, we will look at how to retrieve the URI of Cloud Bigtable, Cloud Storage, and Drive.
A) Retrieve Cloud Bigtable URL
To retrieve the Cloud Bigtable URL, you can carry out the following steps:
- Step 1: Open the Cloud Bigtable Console.
- Step 2: Now, you can retrieve the following details about your Cloud Bigtable data source:
- Project ID
- Cloud Bigtable Instance ID
- The name of your Cloud Bigtable table
- Step 3: Compose the cloud Bigtable URL using the following items in a given format.
- project_id: This is the project that contains your Cloud Bigtable instance.
- instance_id: This is the Cloud Bigtable instance ID
- table_name: This is the name of the table you are querying.
The Bigtable URL will look like this:
https://googleapis.com/bigtable/projects/project_id/instances/instance_id/tables/table_name
For more information on how to retrieve your Bigtable URL, you can visit here.
B) Retrieve Cloud Storage URL
The Cloud Storage URL comprises your bucket name and the object (filename).
To retrieve the Cloud Storage URL, you can carry out the following steps:
- Step 1: Open the Cloud Storage Console.
- Step 2: Now, you can browse to the location of the project (file) that contains the source data.
- Step 3: At the top of the Cloud Storage console, compose your URL by replacing the gs://bucket/file with the appropriate path, for example, gs://mybucket/myfile.json.bucket is the Cloud Storage bucket name and myfile is the name of the object (file) containing the data.
For more information on how to retrieve your Cloud Storage URL, you can visit here.
C) Retrieve Drive Data URL
To provide the Drive URL, you must share a link to the file, and it is usually in this format:
https://docs.google.com/spreadsheets/d/<em>FILE_ID</em>
or https://drive.google.com/open?id=<em>FILE_ID.FILE_ID</em>
is the alphanumeric ID for your Drive file.
To share the file, you can carry out the following steps:
- Step 1: Choose the file you want to share. This can be done on a computer, an Android device, or an iPhone and iPad.
- Step 2: On your computer, open any of the following apps:
- Google Drive
- Docs, Sheets
- Slides
- Step 3: Click on the “share” button.
For more information on how to retrieve your Drive data URI, you can visit here.
3) Querying Cloud Bigtable Data
Google Cloud Bigtable can be defined as its sparsely populated NoSQL database which can be used to scale data to billions of rows, thousands of columns, and petabytes of data.
To query Cloud Bigtable, you can use either a Permanent or Temporary table. You can also do this in the Console, using API, bq command-line tool, or Java.
For further information on how to query Cloud Bigtable data and using the other options listed above, you can visit here.
To query Cloud Bigtable data using a Permanent table through the bq command-line tool, you can carry out the following steps:
- Step 1: Create a table in the bq query command-line tool using the bq mk command.
- Step 2: Identify the table schema using a table definition file as shown below.
bq mk
--external_table_definition=definition_file
dataset.table
- definition_file refers to the path that points to the table definition file on your local machine.
- dataset is the nomenclature of the dataset in which the table is contained.
- table denotes the identifier of the table you are creating.
To create and query Bigtable data Temporary External tables with bq query command-line tool using a table definition file, you can carry out the following steps:
- Step 1: Enter the bq query command with the –external_table_definition flag.
- Step 2: Supply the –location flag and set the value to your location. This is optional and can be overlooked.
An example of the query using the bq command line is shown below:
bq --location=location query
--use_legacy_sql=false
--external_table_definition=table::definition_file
'query'
- location is the name of your location and the –location flag is optional.
- table is used to represent the name of the temporary table you are creating as part of the process.
- definition_file is the path to the table definition file on your local machine.
- query denotes the query that is being submitted to the temporary table.
4) Querying Cloud Storage Data
Google BigQuery supports the querying of the following Cloud Storage data in these formats Comma-Separated Values (CSV), JSON (newline-delimited), Avro, ORC, Parquet, Datasource exports, and Filestore exports.
It also supports the querying of Cloud Storage data from these storage classes:
- Standard
- Nearline
- Coldline
- Archive
Just like Bigtable, you can decide to query Cloud Storage data using either a Permanent or Temporary table using the bq command-line, DDL, API, Java, etc.
To create and query a Permanent External table using the bq query command-line tool, the table is created using the bq mk command line with the –external_table_defintion flag and the table schema is identified using a table definition file stored in your local machine, an inline schema definition, or a JSON schema file also stored in your local machine.
To create a Permanent table linked to your Cloud Storage data source using a table definition file, you can carry out the following steps:
- Step 1: Query the table linked to an External data source using the bq query command line with the –external_table_definition flag.
- Step 2: You can identify the table schema using a table definition file. Then enter the following command as shown below:
bq mk
--external_table_definition=definition_file
dataset.table
- definition_file specifies the location of the table definition file on your local machine.
- dataset is the name of the dataset that the table is contained in.
- table refers to the name of the table you intend to create.
To create a Permanent table linked to your Cloud Storage data source using an inline schema definition, enter the following command as shown below:
bq mk
--external_table_definition=schema@source_format=Cloud Storage URI
dataset.table
- schema appears in the format field:data_type and represents the schema definition.
- source_format is used to show the external data source format, for example, CSV.
- Cloud Storage URL refers to your Cloud Storage URL.
- dataset is the title of the dataset that the table is contained in.
- table refers to the name of the table you intend to create.
To create and query the Cloud Storage data Temporary table, you can use a table definition file with the query, create an inline schema with a query, or use a JSON schema definition file with the query. This can be done using the bq command-line tool, API, or client libraries.
To use the bq command-line tool, you can carry out the following steps:
- Step 1: Query the table linked to an External data source using the bq query command line with the –external_table_definition flag
- Step 2: Identify the table schema using a table definition file stored in your local machine, an inline schema definition, or a JSON schema file.
To query a Temporary table linked to your Cloud Storage data source using a table definition, enter the following command as shown below:
bq --location=location query
--external_table_definition=table::definition_file
'query'
- location is used to point to the location path and the –location flag is optional.
- table is the label of the temporary table you are creating.
- definition_file constitutes the path to the table definition file on your local machine.
- query is the query you are submitting to the Temporary table.
To query a Temporary table linked to your Cloud Storage data source using an inline schema definition, enter the following command as shown below.
bq --location=location query
--external_table_definition=table::schema@source_format=Cloud Storage URI
'query'
- location is the name of your location and using the –location flag is optional.
- table is the name of the temporary table you are creating.
- schema is the inline schema definition in the format field:data_type.
- source_format is the format of the external data source, for example, CSV.
- Cloud Storage URL is your Cloud Storage URL.
- query is the query you are submitting to the temporary table.
5) Querying Drive Data
You can query Drive External data source by providing the Drive URL path to your data and then creating a BigQuery External table that references the data source. The table that references the Drive can be Permanent or Temporary.
To create and query a Permanent External table, you can use the Cloud Console, the bq command-line tool’s mk command, create an ExternalDataConfiguration when using the tables.insert API and the client libraries.
To create a permanent external table using the bq command-line tool, you can carry out the following steps:
- Step 1: Use the bq mk command-line tool to create a table linked to the External data source.
- Step 2: Identify the table schema using a table definition file stored in your local machine, an inline schema definition, or a JSON schema file.
To create a Permanent table linked to your Cloud Storage data source using a table definition, enter the following command as shown below:
bq mk
--external_table_definition=DEFINITION_FILE
DATASET.TABLE
- DEFINITION_FILE is the path to the table definition file on your local machine.
- DATASET is the name of the dataset that contains the table.
- TABLE is the name of the table you are creating.
To create a Permanent table linked to your Cloud Storage data source using an inline schema definition, enter the following command as shown below.
bq mk
--external_table_definition=SCHEMA@SOURCE_FORMAT=DRIVE_URI
DATASET.TABLE
- SCHEMA is the schema definition in the format FIELD:DATA_TYPE.
- SOURCE_FORMAT is CSV, NEWLINE_DELIMITED_JSON, AVRO, or GOOGLE_SHEETS.
- DRIVE_URL is your Drive URL.
- DATASET is the label of the dataset that the table is found in.
- TABLE is the name of the table you want to create.
To query Drive data using Temporary tables, you can run a command to combine a table definition file with a query, an inline schema definition with a query, or a JSON schema definition file with a query using bq command-line, API, Python, and Java.
To create and query a Temporary table using the bq command-line, you can carry out the following steps:
- Step 1: Use the bq query command with the –external_table_definition flag to query a Temporary table linked to the external source.
- Step 2: Identify the table schema using a table definition file stored in your local machine, an inline schema definition, or a JSON schema file
To query a Temporary table linked to your Cloud Storage data source using a table definition, enter the following command as shown below:
bq --location=LOCATION query
--external_table_definition=TABLE::DEFINITION_FILE
'QUERY'
- LOCATION is your location and using the –location flag is optional.
- TABLE is the name of the temporary table you are creating.
- DEFINITION_FILE is the location/route to the table definition file on your local machine.
- QUERY refers to the query that is to be submitted to the temporary table.
To query a Temporary table linked to your Cloud Storage data source using an inline schema definition, enter the following command as shown below.
bq --location=LOCATION query
--external_table_definition=TABLE::SCHEMA@SOURCE_FORMAT=DRIVE_URI
'QUERY'
- LOCATION denotes the location. The –location flag is optional and not compulsory.
- TABLE is used to point to the temporary table you are trying to create.
- SCHEMA is usually in the format FIELD:DATA_TYPE and it refers to the inline schema definition.
- SOURCE_FORMAT is CSV, NEWLINE_DELIMITED_JSON, AVRO, or GOOGLE_SHEETS.
- DRIVE_URL is your Drive URL.
- QUERY is the query you are submitting to the Temporary table.
Limitations of BigQuery External Tables
The are several limitations of using BigQuery External tables. They include the following:
- There is no data consistency for BigQuery External data tables as Bigquery does not guarantee it. Any change to the data while the query is running can result in an unexpected outcome.
- BigQuery External tables are read-only tables and cannot be modified using DML or other methods.
- The performance of querying BigQuery External tables may not be as high as querying data in a native Bigquery table.
- Bigquery jobs that export data from a BigQuery External table cannot be run.
- You cannot use the TableDataList JSON API method to retrieve data from BigQuery External tables.
- BigQuery External tables do not support table clustering.
- You have a limit of four (4) concurrent queries.
- You cannot reference a BigQuery External table using a wildcard table query.
- Whenever you query a BigQuery External data source other than Cloud Storage, the results are not cached.
- Querying data in Cloud Bigtable is only available in limited locations.
Conclusion
- This article has covered a lot on how to build BigQuery external tables to enable you to extract your data that are not stored on BigQuery from disparate data sources for further operations.
- It has exposed you to a wide range of topics as regards this by explaining what external tables are and detailing how to create them amongst other things.
- In this article, you have learned about BigQuery External Tables. This article also provided information on Google BigQuery, its key features, External Data Sources, different usage of BigQuery External Tables, and the limitations of BigQuery External Tables in detail.
- For further information on BigQuery JSON Extract, BigQuery Create View Command, BigQuery Partition Tables, you can visit the former links.
Share your experience of understanding Google BigQuery External Tables in the comment section below! We would love to hear your thoughts.
Ofem Eteng is a seasoned technical content writer with over 12 years of experience. He has held pivotal roles such as System Analyst (DevOps) at Dagbs Nigeria Limited and Full-Stack Developer at Pedoquasphere International Limited. He specializes in data science, data analytics and cutting-edge technologies, making him an expert in the data industry.