Exploratory Data Analysis (EDA), also known as Data Exploration, is an approach for the Data Analysis Process that employs various techniques to better understand the data we have. EDA typically demands a high level of expertise working with programming languages such as Python and R Programming. However, not everyone who works with data is knowledgeable in these languages. This is where BigQuery Analysis comes in.

BigQuery is a Google Data Warehouse with built-in Geographic Data Intake, Storage, and Analysis tools. To handle complex data and examine massive datasets, it uses ordinary Structured Query Language (SQL) Queries. SQL is easy to learn and use, hence BigQuery Analysis is heavily sought-after by Business and Data Analysts to perform Data Exploration.

What is BigQuery?

BigQuery Analysis: BigQuery

Google BigQuery is a Cloud-based Data Warehouse that provides a Big Data Analytic Web Service for processing petabytes of data. It is intended for analyzing data on a large scale. It consists of two distinct components: Storage and Query Processing. It employs the Dremel Query Engine to process queries and is built on the Colossus File System for storage. These two components are decoupled and can be scaled independently and on-demand.

Google BigQuery is fully managed by Cloud service providers. We don’t need to deploy any resources, such as discs or virtual machines. It is designed to process read-only data. Dremel and Google BigQuery use Columnar Storage for quick data scanning, as well as a tree architecture for executing queries using ANSI SQL and aggregating results across massive computer clusters.

Furthermore, owing to its short deployment cycle and on-demand pricing, Google BigQuery is serverless and extremely scalable.

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA), also known as Data Exploration, is an approach for the Data Analysis Process that employs various techniques to better understand the data we have. This can refer to a number of things such as:

  • Identifying outliers, missing values, human error, or biased sampling.
  • Understanding the importance of the variables and removing useless ones.
  • Analyzing the relationship between dataset features (variables).
  • Ultimately, maximize the insights into a data set.

The facts you will uncover during Exploratory Data Analysis will steer the direction of your Machine Learning or analytics projects. Exploratory data analysis outlines key features of the data needed to generate more educated hypotheses that will lead to more promising outcomes.

Ultimately Data Analysis seeks to achieve 2 goals:

  • Offer insights into the relationships between variables.
  • Represent the dataset using different stats.

Exploratory Data Analysis and Data Preparation

There is some debate about whether Exploratory Data Analysis should be done before the Data Preparation step or after. Data workers have agreed that Exploratory Data Analysis should be viewed as an intrinsically cyclical process. Preparing your data will probably spur new questions that will require more Data Exploration, and so on. As a result, it’s essential to implement new techniques that allow you to quickly analyze, prepare, and repeat in combination with technologies that improve agility.

Types of Exploratory Data Analysis

EDA methods are cross-classified in 2 different ways, where each method is either graphical or non-graphical. And then, each approach is either univariate or multivariate. Graphical Exploratory Analysis heavily relies on visuals that are used to uncover patterns, outliers, trends, and unpredictable results.

Graphical Univariate

Graphical Univariate Data Analysis utilizes visual tools to display data, such as:

  • Box Plots: Box Plots are used to depict some of the most important data of a dataset, such as their quartiles at 5 data points—lowest, first, median, third and maximum values.
  • Stem-and-leaf Plots: These plots present all data values and the shape of the distribution.
  • Histograms: They are one of the best ways to learn a lot about your data such as central tendency, spread, modality, shape, and outliers. Histograms provide insight into the probability distribution that a dataset follows. These are typically represented as a bar chart that organizes the data set into a series of individual values or ranges of values. 
  • Line Graphs: This graphical representation is one of the most basic chart types. It can be used to plot data points on a graph and has applications in almost every field of study. 

Graphical Multivariate

Multivariate data utilizes graphics to display the connections between two or more data sets. The most used representation is a Grouped Bar Plot or a Bar Chart. Each group illustrates one level of one of the variables, and each bar within a group defines the levels of the other variable.

Other common types of Multivariate Graphics include:

  • Multivariate Chart: A graphical presentation of the connections between factors and response.
  • Run Chart: A line graph of data plotted over time.
  • Bubble Chart: This Data Visualization Graph shows multiple circles (bubbles) in a two-dimensional plot.
  • Heatmaps: Also known as Shading Matrices, Heatmaps are Data Visualization techniques that use colors to compare numbers in a set of data.
  • Pictograms: Substitute numbers with images to visually illustrate data. They’re common when designing infographics.
  • Scattergrams or Scatterplots: These graphical EDAs are employed to depict two variables in a set of data and then look for a relationship between the two variables.

Univariate Non-Graphical EDA

Univariate Non-Graphical Exploratory Data Analysis methods focus on interpreting the underlying sample distribution and observing the population, and this includes Outlier detection. 

Univariate EDA for quantitative data creates preliminary estimations regarding the population distribution of the variable by taking into consideration the data from the sample. The key features of the assessed population distribution include:

  • Center
  • Spread 
  • Modality
  • Shape 
  • Outliers

The measures of central tendency comprise Mean, Median, Mode, with the mean being the most used measure of central tendency. The median is the most common choice for skewed distribution or when there is concern about outliers. Measures of spread enclose:

  • Variance, 
  • Standard Deviation
  • Interquartile Range. 

Spread indicates how far away from the center we are still likely to find data values. Univariate Exploratory Data Analysis also helps locate the Skewness (measurement of asymmetry) and Kurtosis (estimation of peakedness comparative to a Gaussian shape).

Multivariate Non-Graphical EDA

Multivariate Non-Graphical EDA methods typically display the correlation between two or more variables in the form of either cross-tabulation or statistics. We can create a statistic for every combination of a categorical variable (typically explanatory) and one quantitative variable (usually outcome).

Now that you’re familiar with various Data Exploration techniques, let’s dive straight into BigQuery Analysis.

How to do BigQuery Analysis for Exploratory Data

One of the most challenging aspects of performing EDA is finding skilled people who have business knowledge and expertise on newer technology and toolsets. EDA typically demands a high level of expertise working with programming languages such as Python and R Programming. However, not everyone who works with data is knowledgeable in these languages.

The solution? Platforms like Google BigQuery ML that utilize only Structured Query Language (SQL) dialects can be used by Business and Data Analysts.

Analysts can use BigQuery Analysis to perform Data Exploration and get the results within minutes or seconds. The difficult pursuit of getting a better understanding of the data is now within reach.

Example: EDA in Retail

A retail business can use BI applications to look at data in order to calculate sales in terms of how many items were sold, how much customers spent, what customers bought, and the seasonality of sales. There are numerous other data facts that retailers can review using BigQuery Analysis. Let’s go over some of these BigQuery Analysis functions.

Exploring a Table

The first action we should take when working with a new data set is to analyze the data and comprehend what each column or field includes. One path to achieving this is to select all of the columns but only a limited number of records. As BigQuery Analysis charges fees for the amount of data processed and returned by a query it is advisable to use the LIMIT clause to obtain a few example records as displayed in the below example.

SELECT *
  FROM `handy-bonbon-142723.qvc_sample_data.sample_qvc_data`
  LIMIT 5

Another method that can be used to assess a  random sample of the data is to use the RAND() function, which will return a  pseudo-random number on the interval 0..1. To obtain a limited selection of records use RAND() in the WHERE clause. Here’s how you can do that using BigQuery Analysis.

 SELECT *
  FROM `handy-bonbon-142723.qvc_sample_data.sample_qvc_data`
  WHERE RAND() < 0.001

It can be useful to get a clear understanding of the extent of data. For example, which years and months do the data include. Here’s how you can do that using BigQuery Analysis.

 SELECT FORMAT_DATE("%Y-%m", DATE(order_dt)) AS Order_Month, 
         COUNT(*) AS OrderCount
  FROM `handy-bonbon-142723.qvc_sample_data.sample_qvc_data`
  GROUP BY 1
  ORDER BY 1 ;

A different approach to obtaining the year and month would be to use the EXTRACT function as demonstrated below. Here’s how you can do that using BigQuery Analysis.

 SELECT EXTRACT(YEAR FROM order_dt) ||"-"||EXTRACT(MONTH FROM order_dt)  AS Order_Month, 
         COUNT(*) AS OrderCount
  FROM `handy-bonbon-142723.qvc_sample_data.sample_qvc_data`
  GROUP BY 1
  ORDER BY 1 ;

Here is a way to discover the total sales for all transactions in this data set using BigQuery Analysis.

SELECT   SUM(TOTAL_LINE_AMT) AS total
  FROM    `handy-bonbon-142723.qvc_sample_data.sample_qvc_data`

Keep in mind that due to floating-point storage issues, BigQuery Analysis usually returns numeric results that feature several decimal places. One of the most convenient ways to hide these results is to utilize the ROUND function.

 SELECT   ROUND(  SUM(TOTAL_LINE_AMT), 2)  AS total
  FROM    `handy-bonbon-142723.qvc_sample_data.sample_qvc_data`

Let’s assume that next, we want to discover the total sales amount for each department. Here’s how you can do that using BigQuery Analysis.

 SELECT   merchandise_dept_desc, 
           SUM(TOTAL_LINE_AMT) AS depttotal
  FROM     `handy-bonbon-142723.qvc_sample_data.sample_qvc_data`
  GROUP BY merchandise_dept_desc
  ORDER BY merchandise_dept_desc;

We can also get the total number of packages that have been dispatched to each state/region. Here’s how you can do that using BigQuery Analysis.

 
 SELECT ship_to_state, SUM(Actual_Total_Package_Qty ) AS Total_Packages
  FROM `handy-bonbon-142723.qvc_sample_data.sample_qvc_data`
  GROUP BY ship_to_state
  ORDER BY ship_to_state;

Additional Resources on Bigquery Analysis

Conclusion

EDA is an essential step that has to be taken prior to diving into Machine Learning or Statistical Modeling as it offers the context required to create a  suitable model for the problem and to accurately analyze its results.

  • EDA is beneficial to the Data Scientist to ensure that the results they get are accurately interpreted and relevant to the business contexts.
  • This blog introduced you to BigQuery and Exploratory Data Analysis and later took you through BigQuery Analysis.
  • BigQuery makes Business Analysis more efficient through intuitive and easy-to-use services.
  • Moreover, analyzing and visualizing data from multiple sources in BigQuery can be cumbersome. This is where Hevo comes in.

Share your experience of understanding BigQuery Analysis in the comments.

Roxana Raducanu
Freelance Technical Content Writer, Hevo Data

Roxana loves solving the intricacies of data integration and analysis to data teams by offering informative content to assist individuals in understanding these complex subjects.

No-code Data Pipeline for BigQuery