Addressing real-world data requires us to merge datasets from multiple data sources so that we could do data analysis or create new datasets. Even if the datasets are similar, there’s a chance they relate to distinct tables.

Integrating data into one single table is a must and SQL Operators like BigQuery UNION can help combine the result sets of 2 or more SELECT statements. With Google Cloud Platform BigQuery, Companies can run super-fast SQL queries on terabytes of data, in seconds, without requiring any hardware or infrastructure for operations.

BigQuery remains to be one of the best enterprise Data Warehouse solutions with interactive analysis of massive datasets and can combine data from multiple datasets using BigQuery UNION Operators. 

Let’s get started by looking at how some of the current BigQuery UNION Operators may assist us in merging data, describing its many uses along with examples and exploring some common questions like the differences between BigQuery Union ALL and BigQuery Union DISTINCT.

Google BigQuery: A Quick Tour

bigquery transaction: thumbnail

BigQuery is Google’s Data Warehousing Solution. If you are unaware, a Data Warehouse is a Storage Solution optimized for Online Analytics Processing (OLAP). This is different from a Database, which is optimized for Online Transactional Processing (OLTP). Thus, in simple words, you can use BigQuery when you want to run lengthy and perhaps complex queries on massive data.

Like several other Data Warehouses, BigQuery organizes data in columns instead of rows. This allows for Parallel Query Execution. BigQuery is Serverless, i.e., it allocates Computing Resources according to the requirements, so that you need not worry about allocating the required resources for Computation (unlike Amazon Redshift). It is well suited for querying large chunks of data in a short duration (Spiky Workloads).

BigQuery supports two SQL dialects for querying data: Standard SQL and Legacy SQL. We will see how to perform UNION queries in BigQuery, in both dialects. But what are UNION queries? Let’s find out.

Overview of BigQuery UNION Queries

BigQuery UNION queries are frequently used in Data Analysis. 

Consider the following example: You have an employee database containing tables for each function: Engineering, Sales, Marketing, HR, etc. All tables have similar columns, and you wish to get the list of all employees in the company, irrespective of the function. 

One way is to query each table separately and then join the individual results, which is a long and time-consuming process. Alternatively, you can get the results using a single query, using the BigQuery UNION operator.

The BigQuery UNION operator essentially concatenates the results of two or more SELECT queries, along with the columns. Therefore, it is necessary that the subqueries have the same column names, types, and the same ordering of columns.

Having understood the idea behind the BigQuery UNION queries, we will now look at the implementation of these queries in Google BigQuery.

Prerequisites for BigQuery UNION Queries

To understand BigQuery UNION queries better, you only need an understanding or rather, a familiarity with SQL queries. To implement these in practice, having a BigQuery account, with some data tables will help.

If you wish to try out BigQuery without providing your credit card information, you can try using their Sandbox. If you don’t have your own data tables to work with, you can use BigQuery’s Public Datasets.

BigQuery UNION Queries Using Standard SQL

Standard SQL is the default SQL dialect in BigQuery now. So let’s look at this dialect first. We’ll also look at some of the frequently used concepts related to BigQuery UNION queries in the discussion of this dialect. Later, we will cover the Legacy SQL dialect in brief.

Types of BigQuery UNION SQL Queries

If you have worked with UNION queries in SQL before, then BigQuery UNION queries won’t be very different. There is just one extra condition: the UNION keyword has to be followed by a second keyword (either ALL or DISTINCT). Thus, there are two types of BigQuery Union SQL queries:

  • If you wish to preserve duplicate rows in the final result, you can use UNION ALL, and 
  • If you wish to keep just one instance of a duplicated row in the final result, you can use UNION DISTINCT.

Syntax for BigQuery UNION SQL Queries

The syntax for BigQuery Union SQL queries is as follows:

query1 UNION (ALL/DISTINCT) query2 UNION (ALL/DISTINCT) query3 …

Note that if you wish to use UNION ALL and UNION DISTINCT in the same query, then you must have parentheses to separate the two operations. For example,

query1 UNION ALL (query2 UNION DISTINCT query3)

This is a valid syntax.

query1 UNION ALL query2 UNION DISTINCT query3

This is an invalid syntax.

Parentheses are not required when the same operation is being repeated. For example:

query1 UNION ALL query2 UNION ALL query3

This is a valid syntax.

Explained with an Example

We’ll continue with the example discussed in the introduction. Say you have two tables: EngineeringEmployees and SalesEmployees. For simplicity, let’s assume they contain only 3 columns. Here’s how the two tables look like:

Table 1: EngineeringEmployees

nameageyearsincompany
Alice242
Bob305
Charlie252
Dave273

Table 2: SalesEmployees

nameageyearsincompany
Dave273
Emma3510
Frank284
Grace241

Now, if you wish to get all the employees in the company, then your query will be:

SELECT * from EngineeringEmployees
UNION ALL
SELECT * from SalesEmployees

The result will be:

nameageyearsincompany
Alice242
Bob305
Charlie252
Dave273
Dave273
Emma3510
Frank284
Grace241

As you can see, Dave is present twice. This is because the entry for Dave is present in both the tables (maybe it is a different Dave, or maybe Dave switched roles in his career; depends on how the data is organized in the company’s tables). Let’s assume it is the same Dave (the age and yearsincompany being similar strongly indicate that). In order to get the list of unique employees, you can use UNION DISTINCT. 

SELECT * from EngineeringEmployees
UNION DISTINCT
SELECT * from SalesEmployees
nameageyearsincompany
Alice242
Bob305
Charlie252
Dave273
Emma3510
Frank284
Grace241

Notice that Dave appears only once now. Please note the following:

  • UNION DISTINCT removes duplicate rows. If any column in two rows is different in value, then the two rows aren’t duplicates of each other.
  • UNION DISTINCT will remove duplicates even if they are within the same table. If in the above example, EngineeringEmployees had two entries for Dave (with same age and yearsincompany), even then the final result would have only one entry for Dave.

Example Using a Public Dataset

We will look at the geo_census_tracts dataset available under the bigquery-public-datasets

If you wish to get the results for Alabama and Alaska, your query will be (in the editor):

SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters 
from bigquery-public-data.geo_census_tracts.census_tracts_alabama
 
UNION ALL 
 
SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters 
from bigquery-public-data.geo_census_tracts.census_tracts_alaska

You can explore this data further. Suppose you wish to get only those counties whose area_land_meters value is > 10,000,000, then your query will be:

SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters 
from bigquery-public-data.geo_census_tracts.census_tracts_alabama WHERE area_land_meters > 10000000
 
UNION ALL 
 
SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters 
from bigquery-public-data.geo_census_tracts.census_tracts_alaska WHERE area_land_meters > 10000000

I hope this gives you a fair idea of BigQuery UNION queries in Google BigQuery, using Standard SQL.

Simplify BigQuery ETL and Analysis with Hevo’s No-code Data Pipeline

Ensure Data Integrity with Hevo’s No-code Data Pipeline

Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready.

Start for free now!

Get Started with Hevo for Free

BigQuery UNION Queries Using Legacy SQL

With Legacy SQL, the comma delimiter is the equivalent of UNION ALL queries. In other words, you simply add the tables one after the other, separated by a comma, after the FROM keyword.

Syntax for BigQuery UNION Legacy SQL Queries

The syntax is, thus,

SELECT columns FROM [table1],[table2],...

To give an example, the query on the geo-census-tracts discussed in the above section, with area_land_meters > 10,000,000, can be written as follows in Legacy SQL:

SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters 
from [bigquery-public-data.geo_census_tracts.census_tracts_alabama], [bigquery-public-data.geo_census_tracts.census_tracts_alaska]
WHERE area_land_meters > 10000000

BigQuery UNION Legacy SQL Queries: Explained with an Example

What about UNION DISTINCT queries? Well, that’s not very straightforward with Legacy SQL. One would have thought that SELECT DISTINCT in the above query would have given the same effect. But SELECT DISTINCT is not supported in Legacy SQL. If you try, you will get the following error:

SELECT DISTINCT is currently not supported. Please use GROUP BY instead to get the same effect.

As the error message suggests, you can use GROUP BY. Thus, the above query can be written as follows, for the UNION DISTINCT effect:

SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters 
from [bigquery-public-data.geo_census_tracts.census_tracts_alabama], [bigquery-public-data.geo_census_tracts.census_tracts_alaska]
WHERE area_land_meters > 10000000
GROUP BY state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters 

As you can see, we are specifying all the columns in the GROUP BY clause, meaning only the unique combinations of ALL the columns will appear in the result. If we did not specify some columns, we’d have to use an aggregate function (like COUNT() or MAX()) with those columns, for the query to work. 

If you wish to verify that this query indeed works like UNION DISTINCT, you can replace ‘Alaska’ with ‘Alabama’ in the above query (meaning both the tables become the same), and run the query twice: once with GROUP BY and once without it.

The number of rows in the result will be half in the query with GROUP BY as compared to the query without GROUP BY.

Learn More About:

Standard SQL vs Legacy SQL BigQuery

Conclusion

We have seen the types, syntax, and example queries for BigQuery UNION operations for both Standard SQL and Legacy SQL dialects. We have also seen scenarios with lucid examples both from normal use cases and public datasets, where such queries are required. 

Being a Data Warehouse user, your Organisation might be dealing with megabytes or petabytes of data, and when it comes to analysis, these processes can get tough.

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations such as Google BigQuery, with a few clicks.

Hevo Data with its strong integration with 150+ sources (including 40+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools.

Want to take Hevo for a spin? Sign Up or a 14-day free trial and experience the feature-rich Hevo suite firsthand. Also checkout our unbeatable pricing to choose the best plan for your organization.

We hope this article was of help to you. Share your experience of understanding the BigQuery UNION Queries in the comment section below! We would love to hear your thoughts.

Yash Sanghvi
Technical Content Writer, Hevo Data

Yash is a trusted expert in the data industry, recognized for his role in driving online success for companies through data integration and analysis. With a Dual Degree (B. Tech + M. Tech) in Mechanical Engineering from IIT Bombay, Yash brings strategic vision and analytical rigor to every project. He is proficient in Python, R, TensorFlow, and Hadoop, using these tools to develop predictive models and optimize data workflows.