Addressing real-world data requires us to merge datasets from multiple data sources so that we could do data analysis or create new datasets. Even if the datasets are similar, there’s a chance they relate to distinct tables.
Integrating data into one single table is a must and SQL Operators like BigQuery UNION can help combine the result sets of 2 or more SELECT statements. With Google Cloud Platform BigQuery, Companies can run super-fast SQL queries on terabytes of data, in seconds, without requiring any hardware or infrastructure for operations.
BigQuery remains to be one of the best enterprise Data Warehouse solutions with interactive analysis of massive datasets and can combine data from multiple datasets using BigQuery UNION Operators.
Let’s get started by looking at how some of the current BigQuery UNION Operators may assist us in merging data, describing its many uses along with examples and exploring some common questions like the differences between BigQuery Union ALL and BigQuery Union DISTINCT.
Google BigQuery: A Quick Tour
BigQuery is Google’s Data Warehousing Solution. If you are unaware, a Data Warehouse is a Storage Solution optimized for Online Analytics Processing (OLAP). This is different from a Database, which is optimized for Online Transactional Processing (OLTP). Thus, in simple words, you can use BigQuery when you want to run lengthy and perhaps complex queries on massive data.
Like several other Data Warehouses, BigQuery organizes data in columns instead of rows. This allows for Parallel Query Execution. BigQuery is Serverless, i.e., it allocates Computing Resources according to the requirements, so that you need not worry about allocating the required resources for Computation (unlike Amazon Redshift). It is well suited for querying large chunks of data in a short duration (Spiky Workloads).
BigQuery supports two SQL dialects for querying data: Standard SQL and Legacy SQL. We will see how to perform UNION queries in BigQuery, in both dialects. But what are UNION queries? Let’s find out.
Ensure Data Integrity with Hevo’s No-code Data Pipeline
Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready.
Start for free now!
Get Started with Hevo for Free
Overview of BigQuery UNION Queries
BigQuery UNION queries are frequently used in Data Analysis.
Consider the following example: You have an employee database containing tables for each function: Engineering, Sales, Marketing, HR, etc. All tables have similar columns, and you wish to get the list of all employees in the company, irrespective of the function.
One way is to query each table separately and then join the individual results, which is a long and time-consuming process. Alternatively, you can get the results using a single query, using the BigQuery UNION operator.
The BigQuery UNION operator essentially concatenates the results of two or more SELECT queries, along with the columns. Therefore, it is necessary that the subqueries have the same column names, types, and the same ordering of columns.
Having understood the idea behind the BigQuery UNION queries, we will now look at the implementation of these queries in Google BigQuery.
Load your Data from Source to Destination within minutes
No credit card required
Prerequisites for BigQuery UNION Queries
To understand BigQuery UNION queries better, you only need an understanding or rather, a familiarity with SQL queries. To implement these in practice, having a BigQuery account, with some data tables will help.
If you wish to try out BigQuery without providing your credit card information, you can try using their Sandbox. If you don’t have your own data tables to work with, you can use BigQuery’s Public Datasets.
BigQuery UNION Queries Using Standard SQL
Standard SQL is the default SQL dialect in BigQuery now. So let’s look at this dialect first. We’ll also look at some of the frequently used concepts related to BigQuery UNION queries in the discussion of this dialect. Later, we will cover the Legacy SQL dialect in brief.
Types of BigQuery UNION SQL Queries
If you have worked with UNION queries in SQL before, then BigQuery UNION queries won’t be very different. There is just one extra condition: the UNION keyword has to be followed by a second keyword (either ALL or DISTINCT). Thus, there are two types of BigQuery Union SQL queries:
- If you wish to preserve duplicate rows in the final result, you can use UNION ALL, and
- If you wish to keep just one instance of a duplicated row in the final result, you can use UNION DISTINCT.
Syntax for BigQuery UNION SQL Queries
The syntax for BigQuery Union SQL queries is as follows:
query1 UNION (ALL/DISTINCT) query2 UNION (ALL/DISTINCT) query3 …
Note that if you wish to use UNION ALL and UNION DISTINCT in the same query, then you must have parentheses to separate the two operations. For example,
query1 UNION ALL (query2 UNION DISTINCT query3)
This is a valid syntax.
query1 UNION ALL query2 UNION DISTINCT query3
This is an invalid syntax.
Parentheses are not required when the same operation is being repeated. For example:
query1 UNION ALL query2 UNION ALL query3
This is a valid syntax.
Explained with an Example
We’ll continue with the example discussed in the introduction. Say you have two tables: EngineeringEmployees and SalesEmployees. For simplicity, let’s assume they contain only 3 columns. Here’s how the two tables look like:
Table 1: EngineeringEmployees
name | age | yearsincompany |
Alice | 24 | 2 |
Bob | 30 | 5 |
Charlie | 25 | 2 |
Dave | 27 | 3 |
Table 2: SalesEmployees
name | age | yearsincompany |
Dave | 27 | 3 |
Emma | 35 | 10 |
Frank | 28 | 4 |
Grace | 24 | 1 |
Now, if you wish to get all the employees in the company, then your query will be:
SELECT * from EngineeringEmployees
UNION ALL
SELECT * from SalesEmployees
The result will be:
name | age | yearsincompany |
Alice | 24 | 2 |
Bob | 30 | 5 |
Charlie | 25 | 2 |
Dave | 27 | 3 |
Dave | 27 | 3 |
Emma | 35 | 10 |
Frank | 28 | 4 |
Grace | 24 | 1 |
As you can see, Dave is present twice. This is because the entry for Dave is present in both the tables (maybe it is a different Dave, or maybe Dave switched roles in his career; depends on how the data is organized in the company’s tables). Let’s assume it is the same Dave (the age and yearsincompany being similar strongly indicate that). In order to get the list of unique employees, you can use UNION DISTINCT.
SELECT * from EngineeringEmployees
UNION DISTINCT
SELECT * from SalesEmployees
name | age | yearsincompany |
Alice | 24 | 2 |
Bob | 30 | 5 |
Charlie | 25 | 2 |
Dave | 27 | 3 |
Emma | 35 | 10 |
Frank | 28 | 4 |
Grace | 24 | 1 |
Notice that Dave appears only once now. Please note the following:
- UNION DISTINCT removes duplicate rows. If any column in two rows is different in value, then the two rows aren’t duplicates of each other.
- UNION DISTINCT will remove duplicates even if they are within the same table. If in the above example, EngineeringEmployees had two entries for Dave (with same age and yearsincompany), even then the final result would have only one entry for Dave.
Example Using a Public Dataset
We will look at the geo_census_tracts dataset available under the bigquery-public-datasets.
If you wish to get the results for Alabama and Alaska, your query will be (in the editor):
SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters
from bigquery-public-data.geo_census_tracts.census_tracts_alabama
UNION ALL
SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters
from bigquery-public-data.geo_census_tracts.census_tracts_alaska
You can explore this data further. Suppose you wish to get only those counties whose area_land_meters value is > 10,000,000, then your query will be:
SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters
from bigquery-public-data.geo_census_tracts.census_tracts_alabama WHERE area_land_meters > 10000000
UNION ALL
SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters
from bigquery-public-data.geo_census_tracts.census_tracts_alaska WHERE area_land_meters > 10000000
I hope this gives you a fair idea of BigQuery UNION queries in Google BigQuery, using Standard SQL.
Integrate BigQuery to BigQuery
Integrate MySQL to BigQuery
BigQuery UNION Queries Using Legacy SQL
With Legacy SQL, the comma delimiter is the equivalent of UNION ALL queries. In other words, you simply add the tables one after the other, separated by a comma, after the FROM keyword.
Syntax for BigQuery UNION Legacy SQL Queries
The syntax is, thus,
SELECT columns FROM [table1],[table2],...
To give an example, the query on the geo-census-tracts discussed in the above section, with area_land_meters > 10,000,000, can be written as follows in Legacy SQL:
SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters
from [bigquery-public-data.geo_census_tracts.census_tracts_alabama], [bigquery-public-data.geo_census_tracts.census_tracts_alaska]
WHERE area_land_meters > 10000000
BigQuery UNION Legacy SQL Queries: Explained with an Example
What about UNION DISTINCT queries? Well, that’s not very straightforward with Legacy SQL. One would have thought that SELECT DISTINCT in the above query would have given the same effect. But SELECT DISTINCT is not supported in Legacy SQL. If you try, you will get the following error:
SELECT DISTINCT is currently not supported. Please use GROUP BY instead to get the same effect.
As the error message suggests, you can use GROUP BY. Thus, the above query can be written as follows, for the UNION DISTINCT effect:
SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters
from [bigquery-public-data.geo_census_tracts.census_tracts_alabama], [bigquery-public-data.geo_census_tracts.census_tracts_alaska]
WHERE area_land_meters > 10000000
GROUP BY state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters
As you can see, we are specifying all the columns in the GROUP BY clause, meaning only the unique combinations of ALL the columns will appear in the result. If we did not specify some columns, we’d have to use an aggregate function (like COUNT() or MAX()) with those columns, for the query to work.
If you wish to verify that this query indeed works like UNION DISTINCT, you can replace ‘Alaska’ with ‘Alabama’ in the above query (meaning both the tables become the same), and run the query twice: once with GROUP BY and once without it.
The number of rows in the result will be half in the query with GROUP BY as compared to the query without GROUP BY.
Learn More About:
Standard SQL vs Legacy SQL BigQuery
Conclusion
We have seen the types, syntax, and example queries for BigQuery UNION operations for both Standard SQL and Legacy SQL dialects. We have also seen scenarios with lucid examples both from normal use cases and public datasets, where such queries are required. Being a Data Warehouse user, your Organisation might be dealing with megabytes or petabytes of data, and when it comes to analysis, these processes can get tough.
Hevo Data provides a seamless, no-code solution for transferring data between various sources and destinations like Google BigQuery. With its strong integration with 150+ sources, it ensures data is transformed, enriched, and ready for analysis. Try Hevo for free with a 14-day trial and explore its full range of features.
Sign up today to streamline your data workflows effortlessly!
FAQ
Does BigQuery have UNION?
Yes, BigQuery supports the UNION operator to combine the results of two or more queries. It helps in consolidating data from multiple tables into a single result set, ensuring that the columns in all queries are compatible in terms of number and types.
How to UNION two tables in BigQuery?
You can use the UNION or UNION ALL operator in SQL. For example:
SELECT * FROM table1
UNION ALL
SELECT * FROM table2;
What is the difference between UNION and UNION DISTINCT?
UNION implicitly applies DISTINCT, meaning it removes duplicate rows. UNION ALL (distinct’s opposite) includes all rows, including duplicates. Hence, UNION DISTINCT is redundant as UNION already filters duplicates.
Yash is a trusted expert in the data industry, recognized for his role in driving online success for companies through data integration and analysis. With a Dual Degree (B. Tech + M. Tech) in Mechanical Engineering from IIT Bombay, Yash brings strategic vision and analytical rigor to every project. He is proficient in Python, R, TensorFlow, and Hadoop, using these tools to develop predictive models and optimize data workflows.