Addressing real-world data requires us to merge datasets from multiple data sources so that we could do data analysis or create new datasets. Even if the datasets are similar, there’s a chance they relate to distinct tables. Integrating data into one single table is a must and SQL Operators like BigQuery UNION can help combine the result sets of 2 or more SELECT statements.
With Google Cloud Platform BigQuery, Companies can run super-fast SQL queries on terabytes of data, in seconds, without requiring any hardware or infrastructure for operations. BigQuery remains to be one of the best enterprise Data Warehouse solutions with interactive analysis of massive datasets and can combine data from multiple datasets using BigQuery UNION Operators.
Let’s get started by looking at how some of the current BigQuery UNION Operators may assist us in merging data, describing its many uses along with examples and exploring some common questions like the differences between BigQuery Union ALL and BigQuery Union DISTINCT.
- Google BigQuery: A Quick Tour
- Prerequisites for BigQuery UNION Queries
- BigQuery UNION Queries Using Standard SQL
- BigQuery UNION Queries Using Legacy SQL
Google BigQuery: A Quick Tour
BigQuery is Google’s Data Warehousing Solution. If you are unaware, a Data Warehouse is a Storage Solution optimized for Online Analytics Processing (OLAP). This is different from a Database, which is optimized for Online Transactional Processing (OLTP). Thus, in simple words, you can use BigQuery when you want to run lengthy and perhaps complex queries on massive data.
Like several other Data Warehouses, BigQuery organizes data in columns instead of rows. This allows for Parallel Query Execution. BigQuery is Serverless, i.e., it allocates Computing Resources according to the requirements, so that you need not worry about allocating the required resources for Computation (unlike Amazon Redshift). It is well suited for querying large chunks of data in a short duration (Spiky Workloads).
BigQuery supports two SQL dialects for querying data: Standard SQL and Legacy SQL. We will see how to perform UNION queries in BigQuery, in both dialects. But what are UNION queries? Let’s find out.
Overview of BigQuery UNION Queries
BigQuery UNION queries are frequently used in Data Analysis.
Consider the following example: You have an employee database containing tables for each function: Engineering, Sales, Marketing, HR, etc. All tables have similar columns, and you wish to get the list of all employees in the company, irrespective of the function.
One way is to query each table separately and then join the individual results, which is a long and time-consuming process. Alternatively, you can get the results using a single query, using the BigQuery UNION operator. The BigQuery UNION operator essentially concatenates the results of two or more SELECT queries, along with the columns. Therefore, it is necessary that the subqueries have the same column names, types, and the same ordering of columns.
Having understood the idea behind the BigQuery UNION queries, we will now look at the implementation of these queries in Google BigQuery.
Let’s get started.
Prerequisites for BigQuery UNION Queries
To understand BigQuery UNION queries better, you only need an understanding or rather, a familiarity with SQL queries. To implement these in practice, having a BigQuery account, with some data tables will help. If you wish to try out BigQuery without providing your credit card information, you can try using their Sandbox. If you don’t have your own data tables to work with, you can use BigQuery’s Public Datasets.
Simplify BigQuery ETL and Analysis with Hevo’s No-code Data Pipeline
A fully managed No-code Data Pipeline platform like Hevo Data helps you integrate and load data from 100+ different sources (including 30+ free sources) to a Data Warehouse such as Google BigQuery or Destination of your choice in real-time in an effortless manner. Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. Its strong integration with umpteenth sources allows users to bring in data of different kinds in a smooth fashion without having to code a single line.Get Started with Hevo for Free
Check out some of the cool features of Hevo:
- Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
- Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the Data Pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use.
- Connectors: Hevo supports 100+ integrations to SaaS platforms, files, Databases, analytics, and BI tools. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake Data Warehouses; Amazon S3 Data Lakes; and MySQL, SQL Server, TokuDB, DynamoDB, PostgreSQL Databases to name a few.
- Real-Time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
- 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
- Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (including 30+ free sources) that can help you scale your data infrastructure as required.
- 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
BigQuery UNION Queries Using Standard SQL
Standard SQL is the default SQL dialect in BigQuery now. So let’s look at this dialect first. We’ll also look at some of the frequently used concepts related to BigQuery UNION queries in the discussion of this dialect. Later, we will cover the Legacy SQL dialect in brief.
Types of BigQuery UNION SQL Queries
If you have worked with UNION queries in SQL before, then BigQuery UNION queries won’t be very different. There is just one extra condition: the UNION keyword has to be followed by a second keyword (either ALL or DISTINCT). Thus, there are two types of BigQuery Union SQL queries:
- If you wish to preserve duplicate rows in the final result, you can use UNION ALL, and
- If you wish to keep just one instance of a duplicated row in the final result, you can use UNION DISTINCT.
Syntax for BigQuery UNION SQL Queries
The syntax for BigQuery Union SQL queries is as follows:
query1 UNION (ALL/DISTINCT) query2 UNION (ALL/DISTINCT) query3 …
Note that if you wish to use UNION ALL and UNION DISTINCT in the same query, then you must have parentheses to separate the two operations. For example,
query1 UNION ALL (query2 UNION DISTINCT query3)
This is a valid syntax.
query1 UNION ALL query2 UNION DISTINCT query3
This is an invalid syntax.
Parentheses are not required when the same operation is being repeated. For example:
query1 UNION ALL query2 UNION ALL query3
This is a valid syntax.
BigQuery UNION SQL Queries: Explained with an Example
We’ll continue with the example discussed in the introduction. Say you have two tables: EngineeringEmployees and SalesEmployees. For simplicity, let’s assume they contain only 3 columns. Here’s how the two tables look like:
Table 1: EngineeringEmployees
Table 2: SalesEmployees
Now, if you wish to get all the employees in the company, then your query will be:
SELECT * from EngineeringEmployees UNION ALL SELECT * from SalesEmployees
The result will be:
As you can see, Dave is present twice. This is because the entry for Dave is present in both the tables (maybe it is a different Dave, or maybe Dave switched roles in his career; depends on how the data is organized in the company’s tables). Let’s assume it is the same Dave (the age and yearsincompany being similar strongly indicate that). In order to get the list of unique employees, you can use UNION DISTINCT.
SELECT * from EngineeringEmployees UNION DISTINCT SELECT * from SalesEmployees
Notice that Dave appears only once now. Please note the following:
- UNION DISTINCT removes duplicate rows. If any column in two rows is different in value, then the two rows aren’t duplicates of each other.
- UNION DISTINCT will remove duplicates even if they are within the same table. If in the above example, EngineeringEmployees had two entries for Dave (with same age and yearsincompany), even then the final result would have only one entry for Dave.
BigQuery UNION SQL Queries Example Using a Public Dataset
We will look at the geo_census_tracts dataset available under the bigquery-public-datasets.
As you can see, it contains different tables for each state. If you wish to get the results for Alabama and Alaska, your query will be (in the editor):
SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters from bigquery-public-data.geo_census_tracts.census_tracts_alabama UNION ALL SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters from bigquery-public-data.geo_census_tracts.census_tracts_alaska
You can explore this data further. Suppose you wish to get only those counties whose area_land_meters value is > 10,000,000, then your query will be:
SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters from bigquery-public-data.geo_census_tracts.census_tracts_alabama WHERE area_land_meters > 10000000 UNION ALL SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters from bigquery-public-data.geo_census_tracts.census_tracts_alaska WHERE area_land_meters > 10000000
I hope this gives you a fair idea of BigQuery UNION queries in Google BigQuery, using Standard SQL.
BigQuery UNION Queries Using Legacy SQL
With Legacy SQL, the comma delimiter is the equivalent of UNION ALL queries. In other words, you simply add the tables one after the other, separated by a comma, after the FROM keyword.
Syntax for BigQuery UNION Legacy SQL Queries
The syntax is, thus,
SELECT columns FROM [table1],[table2],...
To give an example, the query on the geo-census-tracts discussed in the above section, with area_land_meters > 10,000,000, can be written as follows in Legacy SQL:
SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters from [bigquery-public-data.geo_census_tracts.census_tracts_alabama], [bigquery-public-data.geo_census_tracts.census_tracts_alaska] WHERE area_land_meters > 10000000
BigQuery UNION Legacy SQL Queries: Explained with an Example
What about UNION DISTINCT queries? Well, that’s not very straightforward with Legacy SQL. One would have thought that SELECT DISTINCT in the above query would have given the same effect. But SELECT DISTINCT is not supported in Legacy SQL. If you try, you will get the following error:
SELECT DISTINCT is currently not supported. Please use GROUP BY instead to get the same effect.
As the error message suggests, you can use GROUP BY. Thus, the above query can be written as follows, for the UNION DISTINCT effect:
SELECT state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters from [bigquery-public-data.geo_census_tracts.census_tracts_alabama], [bigquery-public-data.geo_census_tracts.census_tracts_alaska] WHERE area_land_meters > 10000000 GROUP BY state_fips_code, county_fips_code, lsad_name, area_land_meters, area_water_meters
As you can see, we are specifying all the columns in the GROUP BY clause, meaning only the unique combinations of ALL the columns will appear in the result. If we did not specify some columns, we’d have to use an aggregate function (like COUNT() or MAX()) with those columns, for the query to work.
If you wish to verify that this query indeed works like UNION DISTINCT, you can replace ‘Alaska’ with ‘Alabama’ in the above query (meaning both the tables become the same), and run the query twice: once with GROUP BY and once without it. The number of rows in the result will be half in the query with GROUP BY as compared to the query without GROUP BY.
We have seen the types, syntax, and example queries for BigQuery UNION operations for both Standard SQL and Legacy SQL dialects. We have also seen scenarios with lucid examples both from normal use cases and public datasets, where such queries are required.
Being a Data Warehouse user, your Organisation might be dealing with megabytes or petabytes of data, and when it comes to analysis, these processes can get tough. Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations such as Google BigQuery, with a few clicks.
Hevo Data with its strong integration with 100+ sources (including 40+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools.Visit our Website to Explore Hevo
Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You may also have a look at the amazing pricing, which will assist you in selecting the best plan for your requirements.
We hope this article was of help to you. Share your experience of understanding the BigQuery UNION Queries in the comment section below! We would love to hear your thoughts.