Venturing into Data Science and deciding on a tool to use to solve a given problem can be challenging at times especially when you have a wide array of choices. In this age of data transformation where organizations are constantly seeking out ways to improve the day to day handling of data being produced and looking for methods to minimize the cost of having these operations, it has become imperative to handle such data transformations in the Cloud as it is a lot easier to manage and is also cost-efficient.
Data Warehousing architectures have rapidly changed over the years and most of the notable service providers are now Cloud-based. Therefore, companies are increasingly on the move to align with such offerings on the Cloud as it provides them with a lower upfront cost, enhances scalability, and performance as opposed to traditional On-premise Data Warehousing systems. Google BigQuery is among one of the well-known and widely accepted Cloud-based Data Warehouse Applications.
In this article, you will gain information about Standard SQL and Legacy SQL BigQuery. You will also gain a holistic understanding of Google BigQuery, its key features, SQL, and the differences between Standard SQL and Legacy SQL BigQuery. Read along to find out in-depth information about undergoing Legacy SQL BigQuery.
Table of Contents
What is Google BigQuery?
Google BigQuery is a Cloud-based Data Warehouse that provides a Big Data Analytic Web Service for processing petabytes of data. It is intended for analyzing data on a large scale. It consists of two distinct components: Storage and Query Processing. It employs the Dremel Query Engine to process queries and is built on the Colossus File System for storage. These two components are decoupled and can be scaled independently and on-demand.
Google BigQuery is fully managed by Cloud service providers. We don’t need to deploy any resources, such as discs or virtual machines. It is designed to process read-only data. Dremel and Google BigQuery use Columnar Storage for quick data scanning, as well as a tree architecture for executing queries using ANSI SQL and aggregating results across massive computer clusters. Furthermore, owing to its short deployment cycle and on-demand pricing, Google BigQuery is serverless and designed to be extremely scalable.
For further information about Google Bigquery, follow the Official Documentation.
Key Features of Google BigQuery
Some of the key features of Google BigQuery are as follows:
1) Scalable Architecture
BigQuery has a scalable architecture and offers a petabyte scalable system that users can scale up and down as per load.
2) Faster Processing
Being a scalable architecture, BigQuery executes petabytes of data within the stipulated time and is more rapid than many conventional systems. BigQuery allows users to run analysis over millions of rows without worrying about scalability.
BigQuery is a product of Google Cloud Platform, and thus it offers fully managed and serverless systems.
BigQuery has the utmost security level that protects the data at rest and in flight.
5) Real-time Data Ingestion
BigQuery can perform real-time data analysis, thereby making it famous across all the IoT and Transaction platforms.
6) Fault Tolerance
BigQuery offers replication that replicates data across multiple zones or regions. It ensures consistent data availability when the region/zones go down.
7) Pricing Models
The Google BigQuery platform is available in both on-demand and flat-rate subscription models. Although data storage and querying will be charged, exporting, loading, and copying data is free. It has separated computational resources from storage resources. You are only charged when you run queries. The quantity of data processed during searches is billed.
What is SQL?
SQL stands for Structured Query Language and executes queries against the database for data analytics. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems.
With the help of SQL, you can:
- Create and Delete Database
- Create, Delete and Update Table
- Load/UnLoad data into Tables
- Set and Manage permissions on tables, procedures, and views
- And many more.
A fully managed No-code Data Pipeline platform like Hevo Data helps you integrate and load data from 100+ different sources (including 40+ free sources) to a Data Warehouse such as Google BigQuery or Destination of your choice in real-time in an effortless manner. Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. Its strong integration with umpteenth sources allows users to bring in data of different kinds in a smooth fashion without having to code a single line.
Get Started with Hevo for Free
Check out some of the cool features of Hevo:
Sign up here for a 14-Day Free Trial!
- Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
- Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the Data Pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use.
- Connectors: Hevo supports 100+ integrations to SaaS platforms, files, Databases, analytics, and BI tools. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake Data Warehouses; Amazon S3 Data Lakes; and MySQL, SQL Server, TokuDB, DynamoDB, PostgreSQL Databases to name a few.
- Real-Time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
- 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
- Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (including 40+ free sources) that can help you scale your data infrastructure as required.
- 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
What are the different SQL Dialects offered by BigQuery?
BigQuery offers two SQL dialects for its users, to query their data and define functions/procedures.
1) Legacy SQL BigQuery
Legacy SQL BigQuery was introduced first, and has constructs that ease and optimize querying, based on how BigQuery stores data internally.
2) Standard SQL BigQuery
Standard SQL dialect was introduced later to gain wider acceptance, interoperability, and compliance with SQL standards.
Both dialects have a lot in common, and it’s not very difficult to migrate from legacy SQL BigQuery to standard SQL BigQuery.
How are Standard SQL & Legacy SQL BigQuery different?
The multiple parameters of differences between Standard SQL and Legacy SQL BigQuery are as follows:
1) Different Rules
- In Legacy SQL BigQuery, Aliases defined in a SELECT clause can be referenced in the GROUP BY, HAVING, and ORDER BY clauses of the query, but not by the FROM, WHERE, or OMIT RECORD IF clauses nor by other expressions in the same SELECT clause.
SELECT num as num_alias, SUM(totals) as sum_total
HAVING num_alias = 20
ORDER BY sum_totals DESC
Here, you cannot use the alias in the WHERE clause, the field name totals must be used directly.
Aliases can be used in HAVING and ORDER BY though.
- Corollary: In Legacy SQL Bigquery, fields referenced in a function doesn’t need to be listed in any SELECT clause. Therefore, the following query is valid, even though the clicks field is not displayed directly:
SELECT country, SUM(clicks) FROM table GROUP BY country;
- In Legacy SQL BigQuery, you can use square brackets to escape reserved words so that you can use them as field names and aliases. For example, if you have a column named “partition“, which is a reserved word in BigQuery syntax, the queries referencing that field fail with obscure error messages unless you escape it with square brackets:
SELECT [partition] FROM
- In BigQuery you can have records in JSON/NoSQL format, where there could be nested sub-records within a record. In some queries you would need to use an aggregate function on every record, to produce an aggregate from all sub-records in that record; e.g. sum of all purchases by a person. To ease this, Legacy SQL BigQuery has the WITHIN keyword, which makes the aggregate function to aggregate across repeated values within each record.
- So, in our example, for every customer record, exactly one aggregated sum_of_all_purchases will be produced.
- This type of aggregation is referred to as scoped aggregation.
SELECT customer_name, SUM(purchase_total) WITHIN RECORD AS sum_of_all_purchases
- Historically, BigQuery did not support CROSS JOIN and BigQuery users regularly needed to write UNION ALL queries. Later, for the sake of readability and reduced verbosity, the comma (,) operator was added as an alternative to UNION ALL in Legacy SQL BigQuery.
FROM access_logs_today, access_logs_yesterday, access_logs_day_before
WHERE ip_address LIKE "192%"
2) Precautions with Legacy SQL BigQuery
If you’re using BigQuery API to load an integer that is out-of-range of [-2^53+1, 2^53-1], into an INT(64) column, you should pass it as a string to avoid data corruption.
3) New Datatypes
Some new data types were introduced in standard SQL but they do not exist in Legacy SQL BigQuery. These data types are:
As the name suggests, this data type can be used to represent a geographical location. It consists of a collection of points, lines, and polygons; which is represented as a point set, or a subset of the surface of the Earth.
An INTERVAL object represents a time duration, amount of time, or a window of time.
Interval is composed of three independent parts, namely,
- Y-M: Years and Months
- D: Days
- H:M:S.F: Hours, Minutes, Seconds, and Subseconds.
Standard SQL adds JSON as a separate type on its own, to ease functions performed on JSON types.
It’s a container of ordered fields, each STRUCT must have a type (required) and an optional field name.
For example, STRUCT<INT64> is a container of numbers of INT64 type,
STRUCT<myStr STRING(20)> contains a string of length 20, named myStr.
So, effectively, struct is a data type that has attributes in key-value pairs.
SELECT struct( "name" as emp_name, "salary" as remuneration, "New York" as address ) as ny_emp.....
Migrating from Legacy SQL BigQuery to Standard SQL
Standard SQL complies with the latest standards and has modern constructs that ease querying nested and repeated data. It supports correlated subqueries, automatic predicate push-down through JOINs, and modern data types. Hence, there is a need to migrate from Legacy SQL BigQuery to Standard SQL BigQuery.
While doing so, some things that should be kept in mind are as follows:
- In Legacy SQL BigQuery you query a table with naming conventions as projectName:databaseName.tableName, where “:” acts as a separator.
- Replace this “:” with a period, “.”, instead.
- i.e. projectName.databaseName.tableName
- A big semantic difference lies in the fact that a comma “,” in Legacy SQL BigQuery means a UNION ALL, whereas in Standard SQL it means a JOIN.
- So, in Legacy SQL BigQuery “FROM tableA, tableB” means FROM tableA UNION ALL tableB whereas in Standard SQL “FROM tableA, tableB” means a cartesian product between tableA X tableB, unless a join condition like tableA.xId=tableB.yId.
- You should prefer queries that do not use CROSS JOIN as CROSS JOIN can return a large amount of data and might result in a slow and inefficient query.
- CROSS JOIN operations do not allow ON clauses. (ON: Fields from any preceding JOIN input can be used as keys in the ON clauses of subsequent JOIN operators)
- CROSS JOIN is often used in places where window functions would be more efficient.
- There are subtle differences in how data types are implemented in Standard SQL BigQuery.
- e.g. Standard SQL has a stricter range of valid TIMESTAMP values as compared to Legacy SQL BigQuery.
- So, if you have timestamp values out of the standard range of 0001-01-01 00:00:00.000000 <– TO –> 9999-12-31 23:59:59.999999, you can use the SAFE_CAST function as pass your legacy timestamp value as a string to this function, it will return a NULL if your value of out of range, and this will allow you to improvise your timestamps accordingly.
- In Legacy SQL BigQuery you can escape reserved words with , e.g.
SELECT [partition] FROM. Whereas in Standard SQL BigQuery, you use backticks as in
SELECT `partition` FROM.
- You should map your legacy data types to appropriate standard SQL data types. For example, RECORD to STRUCT, REPEATED to ARRAY etc.
- Both dialects vary in the syntax and semantics of Views. So, you cannot query a view defined in one using the second. One of the workarounds around this is to create a new view using Standard SQL, under a new name, and replace the one earlier defined in Legacy SQL BigQuery.
- You will need to replace your Legacy SQL BigQuery functions to equivalent Standard SQL functions.
- Also, same functions sometimes behave differently between the two dialects, e.g. COUNT() in Legacy SQL BigQuery returns an approximate count whereas in Standard SQL BigQuery it will return the exact count.
- Replace OMIT IF ( Legacy SQL BigQuery) with EXISTS(Standard SQL).
- The OMIT RECORD IF clause is a construct that is unique to BigQuery. It is particularly useful for dealing with nested, repeated schemas. It is similar to a WHERE clause, but different in two important ways.
- Firstly, it uses an exclusionary condition, which means that records are omitted if the expression returns true, but kept if the expression returns false or null.
- Secondly, the OMIT RECORD IF clause can (and usually does) use scoped aggregate functions in its condition.
OMIT RECORD IF
COUNT(payload.pages.page_name) <= 80;
- Replace FLATTEN with JOIN [Bigquery will automatically retain the nested and REPEATED associations].
- Replace NOT IN( Legacy SQL BigQuery) with NOT EXISTS/IS NOT NULL(Standard SQL BigQuery).
In this article, you have learned about the comparative study of Standard SQL vs Legacy SQL BigQuery. This article also provided information on Google BigQuery, its key features, SQL, and the differences between Standard SQL BigQuery and Legacy SQL BigQuery in detail. For further information on BigQuery JSON Extract, BigQuery Create View Command, BigQuery Partition Tables, you can visit the former links.
Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations with a few clicks.
Visit our Website to Explore Hevo
Hevo Data with its strong integration with 100+ data sources (including 40+ Free Sources) allows you to not only export data from your desired data sources & load it to the destination of your choice but also transform & enrich your data to make it analysis-ready. Hevo also allows integrating data from non-native sources using Hevo’s in-built Webhooks Connector. You can then focus on your key business needs and perform insightful analysis using BI tools.
Want to give Hevo a try?
Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You may also have a look at the amazing price, which will assist you in selecting the best plan for your requirements.
Share your experience of understanding Google BigQuery Analytic Functions in the comment section below! We would love to hear your thoughts.