Data-driven enterprises must have properly organized data in order to gain important business insights. Previously, businesses relied largely on Excel sheets or CSV files for data storage. However, in recent years, the amount of data collected by corporations has increased at an exponential rate. This might include information on how people interact with their product or service, what they think of their offers, how effectively their marketing efforts are performing, and so on.
Relying just on Excel sheets will not be sufficient. As a result, businesses are turning to Cloud Storage Services like Google BigQuery, which allows you to consolidate all of your data into a single system and analyze it using SQL queries and business intelligence tools. In the Standard SQL environment, you can efficiently query multiple tables using the BigQuery Wildcard expression.
In this article, you will learn how to efficiently use the BigQuery Wildcard Tables.
Tables of Content
What is Google BigQuery?
Image Source
Hosted on the Google Cloud Platform, BigQuery is the enterprise data warehouse for supporting analytics initiatives of organizations. It’s a serverless Software as a Service (SaaS) application with excellent Data Management, Access Control, and Machine Learning capabilities (Google BigQuery ML).
With features like Exabyte-Scale Storage and petabyte-scale SQL queries, Google BigQuery excels at analyzing large volumes of data and rapidly fulfills your Big Data processing needs.
The platform makes use of Columnar-storage, which enables data querying and result aggregation easier and more efficiently, resulting in a more dynamic business model.
For huge and complicated datasets, BigQuery features an analytical engine that provides rapid memory analysis at high concurrency. Moreover, analyzing data with Google Data Studio and Looker is a breeze with this platform. Also, BigQuery integrates with data in other Cloud products and applications via American National Standards Institute (ANSI)-compliant SQL, Open Database Connectivity (ODBC), and Java Database Connectivity (JDBC) drivers.
BigQuery enables customers to focus on finding valuable insights using familiar SQL without the requirement for a database administrator because there is no infrastructure to manage.
Key Features of Google BigQuery
- Easy to use: Google BigQuery is a user-friendly platform where users only need a basic grasp of SQL commands, ETL tools, and other similar concepts to get started. It supports a wide range of media formats, including JSON, AVRO, CSV, and XML data types.
- Fully Managed: As a completely managed service, there is no need for an in-house setup. To get started, simply login to the Google Cloud project using a browser. Google BigQuery takes care of complex maintenance and setup procedures like Server/VM Administration, Server/VM Sizing, Memory Management, and many more by providing serverless execution.
- Best-in-Class Performance: Because Google BigQuery is built on a column-based architecture, it has several advantages over standard row-based storage, including better storage efficiency and the ability to scan data more quickly. Supports nested tables for efficiently storing and retrieving data, thereby minimizing slot consumption, querying time, and data usage.
- Security: As all data is encrypted and in transit by default, Google BigQuery enables Column-level security that checks identity and access status and creates security policies. It is compatible with security requirements such as HIPAA, FedRAMP, PCI DSS, ISO/IEC, SOC 1, 2, 3, and others since it is part of the Google Cloud environment.
- Scalability: To help customers to scale processing and memory resources as per their requirements, BigQuery separates compute and storage. This implies, it offers a lot of vertical and horizontal scalability and can do real-time searches on petabytes of data in a short amount of time.
- Reliability: Data is automatically replicated using Google BigQuery, so even if your data is lost, you’ll always have a backup. Furthermore, the availability of query point-in-time snapshots from 7 days feature allows you to quickly monitor and analyze changes in data using Google BigQuery. As a result, your data will remain safe and secure with Google BigQuery. The Geoexpansion function protects your geographical data from being shared. This function, however, is only available in the United States, Asia, and Europe.
- Flexible Pricing: The Google BigQuery platform is offered in two subscription models: on-demand and flat-rate. Although users will be charged for the amount of data returned from each query and for the amount of data storage used, data exporting, loading, and copying are free. This separates BigQuery from conventional computational and storage resources.
For more information on the benefits of Google BigQuery, you can visit the Google Cloud BigQuery page.
Google BigQuery Architecture
Image Source
As mentioned earlier, the serverless architecture of BigQuery allows storage and computing to expand independently on demand. Customers benefit from this arrangement since they don’t need expensive computing resources up and running all of the time, giving them a lot of flexibility and cost management. This is in stark contrast to typical node-based cloud data warehouses or on-premise Massively Parallel Processing (MPP) systems.
BigQuery makes use of a large number of multi-tenant services that are powered by low-level Google infrastructure technologies, including Dremel, Colossus, Jupiter, and Borg. While Dremel, a huge multi-tenant cluster that runs SQL queries, is Compute in this case, Colossus, Google’s worldwide storage system, is used for storing. Between storage and computation, exists a ‘shuffle,’ which uses Google’s Jupiter network to transport data incredibly quickly from one location to another. Borg (Google’s predecessor to Kubernetes), allocates hardware resources and manages the mixers and slots.
Google BigQuery Hierarchy
BigQuery is organized in a four-level hierarchy, i.e.,
- Projects: Data storage containers at the top level.
- Datasets: They let you arrange your data and contain one or more tables of data within projects.
- Tables: Tables carry real data inside datasets.
- Jobs: These are data-related tasks that include performing queries, loading data, and exporting data.
Image Source
Hevo Data is a No-code Data Pipeline that offers a fully managed solution to set up Data Integration for 100+ Data Sources (Including 30+ Free Sources) and will let you directly load data to a data warehouse like Google BigQuery or a destination of your choice. It will automate your data flow in minutes without writing any line of code. Its fault-tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully automated solution to manage data in real-time and always have analysis-ready data.
Get Started with Hevo for Free
Let’s look at some of the salient features of Hevo:
- Fully Managed: It requires no management and maintenance as Hevo is a fully automated platform.
- Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer.
- Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
- Schema Management: Hevo can automatically detect the schema of the incoming data and maps it to the destination schema.
- Connectors: Hevo supports 100+ integrations to SaaS platforms, files, Databases, analytics, and BI tools. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake Data Warehouses; Amazon S3 Data Lakes; and MySQL, SQL Server, TokuDB, DynamoDB, PostgreSQL Databases to name a few.
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Live Monitoring: Advanced monitoring gives you a one-stop view to watch all the activities that occur within pipelines.
- Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-Day Free Trial!
Prerequisites
- An active BigQuery account.
- Knowledge of SQL.
- Knowledge of databases.
What is BigQuery Table?
A BigQuery table is made up of rows of individual records. Each record is divided into columns (also called fields). Each table has a schema that specifies the column names, data types, and other details. Users can provide a table’s schema when it is created, or they can create a table without one and later declare schema in the query or load operation that populates the table with data.
The following table types are supported by BigQuery:
- Tables backed by native BigQuery storage are referred to as Native tables.
- External tables are tables that are supported by storage that is not part of BigQuery.
- Views are virtual tables that are defined by SQL queries.
Note: When you build and use tables in BigQuery, your costs are determined by the amount of data contained in the tables and BigQuery Table Partitions, as well as the queries you execute against the table data.
Many table actions, such as loading, copying, and exporting data, are free — but they are subject to BigQuery’s quotas and restrictions. Let’s understand BigQuery Tables through the following 2 critical aspects:
A) Creating Tables
Depending on the data source, there are many ways to construct a table in BigQuery:
- Create an empty table and build up a data structure for it manually.
- Create a table with the results of a SQL query that was previously computed.
- Copy or Upload a file (in CSV, AVRO, JSON, Parquet, ORC, or Google Sheets format), from your PC and paste it.
- Create a table that connects an external source instead of downloading or streaming data: Cloud Bigtable, Cloud Storage, or Google Drive are all viable options.
In this article, we will focus on the first option –
- Go to the BigQuery Page and click on the Create Table option after selecting the dataset to which you wish to add the table.
- Select Empty Table in the Source field, and Table in the native format of the destination item in the Table Type field. Give the table a name.
- Set the schema for the table. The schema is made up of four parts: two that are required (column name and data type) and two that are optional (column mode and description). The data will be easier to deal with if the types and field modes are properly chosen.
Image Source
One can set schema by:
- Step 1: Fill out the name, type, and mode for each column by clicking the Add Field button.
- Step 2: Using the Edit as text switch, enter the table structure as a JSON array.
Other ways of doing the same are:
- Using the Cloud Console, enter the schema.
- Using the bq command-line tool, provide the schema inline.
- Using the bq command-line tool, provide a JSON schema file.
- When calling the API’s tables.insert method, include the schema in a table resource.
B) Query Big Data in BigQuery
You may query the data in your tables after you’ve loaded it into BigQuery. There are two sorts of queries that BigQuery can handle:
- Interactive Queries: Here the query is run as soon as it is possible.
- Batch Queries: Here, BigQuery queues each batch query on your behalf and starts it as soon as idle resources become available, which is generally within a few minutes.
Generally, the following are the components of basic queries:
- SELECT (required): this specifies which columns should be included in the query.
- FROM (required): the table containing the SELECT statement’s columns.
- WHERE: It is a filtering criterion for records
- ORDER BY: This option allows you to arrange the results in ascending or descending order.
- GROUP BY: It is a method for accumulating data in a result set.
Example:
SELECT * FROM table WHERE x = y
GROUP BY project_short_name,
ORDER BY num ASC
Query Multiple Tables using BigQuery WildCard Table
You may query more than one table with a single SQL expression using a BigQuery Wildcard table. A BigQuery Wildcard table is a collection of tables that match the BigQuery Wildcard phrase. Only standard SQL supports BigQuery Wildcard tables.
The * expression for a BigQuery Wildcard can be used to refer to many tables or files. This functionality allows you to load, extract, and query data from a variety of sources, destinations, and tables.
For example, the following FROM clause uses the BigQuery Wildcard expression books* to match all tables in the stories_books dataset that begin with the string books.
FROM `bigquery-public-data.stories_books.books*`
The syntax for a BigQuery Wildcard table is:
SELECT
FROM `<project-id>.<dataset-id>.<table-prefix>*`
WHERE Bool_expression
Here,
- <project-id> : Project ID for the Cloud Platform. If you use your default project ID, this is not required.
- <dataset-id> : BigQuery dataset ID.
- <table-prefix>: A string that is common to all tables that the BigQuery Wildcard character matches. It is not necessary to use a table prefix. When the table prefix is not specified, the dataset matches all tables.
- * (wildcard character): One or more characters of a table name are represented with the BigQuery Wildcard character “*”. A BigQuery Wildcard table name can only have the wildcard character as the last character.
Let’s understand BigQuery Wildcard in detail with the following 2 critical aspects:
A) _TABLE_SUFFIX Pseudo Column
The TABLE SUFFIX pseudo column in the WHERE clause is supported in queries with BigQuery Wildcard tables. Use the _TABLE_SUFFIX pseudo column in a WHERE clause with a constant expression to limit a query to a certain group of tables.
The use of _TABLE_SUFFIX with BigQuery Wildcard Tables reduces the number of bytes scanned, lowering the cost of running your queries.
Syntax:
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE(‘%Y%m%d’,DATE_SUB(CURRENT_DATE(), INTERVAL 36 MONTH))
AND
FORMAT_DATE(‘%Y%m%d’,DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
Example: In a dataset ‘library-bq.scholasticbooks’, there is information about books published from 1950-to 2020. If we want to query just the table range of 2010-2019, the query will look like this:
SELECT BookMeta_Identifier, BookMeta_Date
FROM `library-bq.scholasticbooks.*`
WHERE _TABLE_SUFFIX BETWEEN '2010' AND '2019'
ORDER BY BookMeta_Identifier DESC
LIMIT 1000
There are a number of things users can do using BigQuery Wildcard tables. For instance,
- Load data from different files into a single table.
bq load project_id:dataset_name.table_name
gs://my_data/input/prefix/* ./schema.json
- Load data from different files into multiple tables:
bq load project_id:dataset_name.table_name
gs://my_data/input/prefix_1/*
gs://my_data/input/prefix_5/* gs://my_data/input/prefix_25/* ./schema.json
- Export data from BigQuery to GCS:
bq extract project_id:dataset_name.table_name
gs://my_data/extract/prefix/file_prefix_*.json
Things to Remember:
- Views are not supported by the BigQuery Wildcard table feature. The query fails if the BigQuery Wildcard table matches any view in the dataset. Whether or not your query has a WHERE clause on the _TABLE_SUFFIX pseudo column to filter out the view, this is true.
- Even if the Use Cached Results option is selected, cached results are not currently supported for searches against multiple tables using a BigQuery Wildcard. You will be charged for each query if you perform the same BigQuery Wildcard query several times.
- Only native BigQuery storage is supported by wildcard tables. When querying an external table or view, BigQuery Wildcards are not allowed.
- Because the BigQuery Wildcard table name contains the special character (*), you must enclose it in backtick (` )characters. You cannot use quotes(‘) either.
- The number of tables scanned in a BigQuery Wildcard table is not limited by filters on the _TABLE_SUFFIX or _PARTITIONTIME pseudo columns that incorporate JavaScript user-defined functions.
B) Table Wildcard Functions
In legacy SQL, a table wildcard function is identical to a comma-separated union of all the tables that the BigQuery Wildcard function has matched. BigQuery only accesses and charges you for tables that match the BigQuery Wildcard when you use a table wildcard function. When running a query, the FROM clause specifies the table’s BigQuery Wildcard functionalities.
To enjoy the equivalent BigQuery Wildcard properties, users can employ the following table wildcard functions to query multiple tables:
- TABLE_DATE_RANGE(): Queries several daily tables over a date period.
Syntax:
TABLE_DATE_RANGE(prefix, timestamp1, timestamp2)
Example:
If our scholasticbooks dataset has a series of daily tables with names in the format: scholasticbooks.booksYYYYMMDD, we can query specific daily tables from January 1st, 2010 to December 31st, 2019 with this query:
SELECT title, author
FROM TABLE_DATE_RANGE(scholasticbooks.books, TIMESTAMP('2010-01-01'),TIMESTAMP('2019-12-31'))
LIMIT 1000
- TABLE_DATE_RANGE_STRICT(): Queries several daily tables that cover a date range, with no missing dates.
Syntax:
TABLE_DATE_RANGE_STRICT(prefix, timestamp1, timestamp2)
This function is equivalent to TABLE_DATE_RANGE. The main difference is that TABLE_DATE_RANGE_STRICT fails and produces a Not Found: Table <table name> error if any daily table is absent from the sequence.
- TABLE_QUERY(): Searches for tables whose names satisfy a given predicate.
Syntax:
TABLE_QUERY(dataset, expr)
Example:
For the TABLE_QUERY function, we will use the same example we used for TABLE_DATE_RANGE. We will also add an expression in the REGEXP_MATCH function to ensure we only query tables from the 2010s:
SELECT BookMeta_Identifier, BookMeta_Date
FROM TABLE_QUERY([scholasticbooks.books],
'REGEXP_MATCH(table_id , r"^201[d]")')
ORDER BY BookMeta_Identifier DESC
LIMIT 1000
For more information on BigQuery Wildcard Tables, you can visit the Google BigQuery Documentation.
When to Use the Wildcard table?
Wildcard tables are advantageous when your dataset contains multiple tables with the alike named tables that have compatible schemas. Such datasets typically contain tables, and each table represents one-day, one-month, or one-year data.
For example, the NOAA Global Surface Summary of the Day Weather Data, a public dataset hosted by BigQuery, contains tables for each year from 1929 to the present. If you need to name all 12 tables in the FROM clause, the query that scans for all table IDs from 1929 to 1940 will be very long.
#standardSQL
SELECT max, ROUND((max-32)*5/9,1)
celsius, mo, da, year
FROM ( SELECT * FROM
`bigquery-public-data.noaa_gsod.gsod1929` UNION ALL
SELECT * FROM
`bigquery-public-data.noaa_gsod.gsod1930` UNION ALL
SELECT * FROM
`bigquery-public-data.noaa_gsod.gsod1931` UNION ALL
# ... Tables omitted for brevity
SELECT * FROM
`bigquery-public-data.noaa_gsod.gsod1940` )
WHERE
max != 9999.9 # code for missing data
ORDER BY
max DESC
The same query with a wildcard table is much simpler.
#standardSQL
SELECT
max, ROUND((max-32)*5/9,1) celsius, mo, da, year
FROM
`bigquery-public-data.noaa_gsod.gsod19*`
WHERE
max != 9999.9 # code for missing data
AND _TABLE_SUFFIX BETWEEN '29'
AND '40'
ORDER BY
max DESC
Conclusion
In this article, we learned the basics of BigQuery and its features. The article also provides in-depth knowledge about the concepts behind BigQuery Wildcard Table to help you understand and implement them efficiently. BigQuery is the one-stop solution for all the growing data storage and analytics challenges faced by businesses. It provides on-demand scaling and allows the required number of computing resources to handle the varying workloads of a firm. The user-friendly interface and the Standard SQL ecosystem allow Data Analysts to effortlessly execute complex queries to generate business reports in real-time.
Once the market demand for your products and services increases, the customer base will gradually begin to grow. Enormous amounts of data are generated associated with your customers, orders, etc. Continuously monitoring, handling, and integrating the data across all the applications used in your enterprise can be a time-consuming and tedious task. You would require to devote a section of your engineering bandwidth to Integrate, Clean, Transform and Load into your BigQuery Data Warehouse for further business analysis. All of this can be efficiently automated by a Cloud-Based ETL Tool like Hevo Data.
Visit our Website to Explore Hevo
Hevo Data, a No-code Data Pipeline can seamlessly transfer data, allowing you to focus on your important business objectives. This platform transfers data from a vast sea of sources into a Data Warehouse such as Google BigQuery or a destination of your choice. It a reliable, secure, and fully automated that doesn’t require you to write any code!
If you are using Google BigQuery as Date Warehouse and Analytics platform in your enterprise and searching for a stress-free alternative to Manual Data Integration, then Hevo can effectively automate this for you. Hevo, with its strong integration with 100+ sources & BI tools, allows you to not only export & load data but also transform & enrich your data & make it analysis-ready in a jiffy.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and simplify your Data Integration process. Do check out the pricing details to understand which plan meets all your business needs.
Share with us your experience of using the Google BigQuery Wildcard Tables. Let us know in the comments section below!