Organizations collect a huge volume of data but struggle to generate valuable insights. Although Google BigQuery assists businesses in handling large datasets by standard SQL, using SQL alone is not sufficient to perform complex Data Analysis and Visualization. Google Cloud Console enables analysts to use BigQuery with the Python Pandas package. This feature gives precise control of necessary data for performing Data Analysis and Visualization tasks.
This article gives an overview of using Pandas BigQuery. It introduces you to BigQuery and the fundamentals of Python for Data Analysis. Moreover, it also lists the steps to create a project in Google Cloud Console, Load and Analyze data, and Visualize data. So, read along to gain an understanding of using Pandas BigQuery.
Table of Contents
Prerequisites
You need to meet the following requirements to understand Pandas BigQuery:
- Understanding of Pandas.
- Understanding of BigQuery.
- A Google Cloud Account.
Introduction to BigQuery
Image Source
BigQuery is a fully managed, petabyte-scale, Cloud Data Warehouse solution that allows real-time Analytics over large datasets. Since its inception in 2011, BigQuery has used standard SQL at the backend to run superfast queries using Google’s infrastructure. It also facilitates analysts to swiftly manage, query, integrate, and create Machine Learning models on a given data. With BigQuery, administrators can also control data and authenticate users, thereby providing fine-grained column-level security.
To learn more about, Google BigQuery, visit the official site here. Further, in this article, you will learn about how you can analyze and visualize your data using Pandas BigQuery.
Fundamentals of Python for Data Analysis
Image Source
Python is a high-level, general-purpose programming language used by developers for task automation and developing web and software applications. However, Python has also been accompanied by many non-programmers for Analytics and complex calculations, making it a multi-purpose programming language. With an extensive collection of libraries, data analysis tasks can be performed swiftly using Python.
You can explore more about Python here.
Popular Python Libraries for Data Analysis
There are countless open-source libraries available in Python that benefit many data-driven enthusiasts. A library in Python is a collection of modules with a predefined set of codes to provide desired results. Some of the most popular libraries used for Data Analysis include:
1) NumPy
NumPy (Numerical Python) is a Python library with high-performance capabilities to perform complex mathematical tasks on multi-dimensional Arrays and Matrices. It consists of inbuilt high-level functions, including Linear Algebra and Fourier transform, to allow developers to perform strenuous tasks.
2) Matplotlib
Matplotlib is a two-dimensional (2-D) plotting library that helps in Data Visualization. It allows Analysts to draw statistical inferences by visualizing charts like Box Plots, Scatter Plots, and Line Graphs.
3) Pandas
Pandas, also known as ‘Python Data Analysis Library’, is one of the most popular packages built on top of NumPy for Data Analysis. Any Data science project life-cycle undergoes exploratory Data Analysis, and Pandas library plays a vital role through data structures like Series and Dataframe to uncover trends, patterns, and underlying relations.
Understanding Need for Pandas BigQuery
While BigQuery analyzes large data sets quickly using standard SQL syntax, Pandas uses Python programming to optimize results. To get better performance, BigQuery and Pandas can be integrated by creating a Google Compute Engine (GCE) instance to achieve precise and faster data analysis.
Key Features of Pandas
Pandas is well known for Data Cleaning and Manipulation tasks. Here are a few more advantages:
- Handling Data: Pandas library helps represent big data effectively to facilitate complex analysis. It offers several methods for detecting, removing, and replacing null values in a Dataframe structure.
- Cleaning Data: The extracted data is crude by default, and performing analysis on such data would inherit higher variance. Pandas library tidies up complex data to provide better accuracy.
- Merging Data: Pandas package is extremely efficient in performing merge operations, thereby assisting in insights generation as Data Analysis involves users to blend and join multiple datasets.
- Reading Data: Since data is generated in various formats, it becomes essential for libraries to read multiple file formats. Pandas support a huge number of file formats, including JSON, CSV, and HD5.
Moreover, in the next sections, you will learn about how you can load, analyze and visualize your data using Pandas BigQuery.
Hevo Data, a No-code Data Pipeline, helps load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ Data Sources including 30+ Free Sources. It is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo loads the data onto the desired Data Warehouse/destination such as Google BigQuery and enriches the data and transforms it into an analysis-ready form without having to write a single line of code.
Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.
GET STARTED WITH HEVO FOR FREE
Check out why Hevo is the Best:
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled securely and consistently with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Simplify your Data Analysis with Hevo today!
SIGN UP HERE FOR A 14-DAY FREE TRIAL!
Pandas BigQuery: Steps to Create a New Google Cloud Project
The Google Cloud Console provides a graphical interface for running BigQuery SQL queries. You must have an active Google Cloud account. Below are the steps to use BigQuery from the Cloud Console:
- If you are using Google Cloud for the first time, create a new account. (New customers get $300 free credits to run, test, and deploy workloads.)
- In the Google Cloud Console, create a new Google Cloud Project and download the JSON key from the project selector page.
- For any new project, BigQuery is automatically enabled. However, if you want to activate BigQuery in any preexisting project, you can use BigQuery API. To check further steps, follow this documentation.
- Setup the Python development environment and authenticate the same.
Let’s discuss the Pricing Plan for the Google Cloud Project. The pricing consists of two components:
- Analysis Pricing: It is the cost incurred to process queries that include SQL queries, user-defined Functions, and Scripts. While running queries, it offers a choice of two pricing models:
- On-Demand Pricing: In this pricing model, you pay for the number of bytes processed by each query.
- Flat-Rate Pricing: With this pricing model, you pay for dedicated processing capacity provided in desired slots like flex slot (60 seconds), monthly and annually.
- Storage Pricing: It is the cost incurred to store data that you load in BigQuery. There are two types of data storage, and both types provide free first 10GB storage per month as discussed below:
- Active Storage: It includes any table or table partition that has been modified in the last 90 days.
- Long-Term Storage: includes any table or table partition that has not been modified in the last 90 days.
Pandas BigQuery: Steps to Load and Analyze Data
To leverage Pandas BigQuery, you have to install BigQueryPython (version 1.9.0) and BigQuery Storage API Python client library. They can be installed using ‘pip’ or ‘conda’ as shown below:
Syntax for pip:
pip install --upgrade 'google-cloud-bigquery[bqstorage,pandas]'
Syntax for conda:
conda install -c conda-forge google-cloud-bigquery
google-cloud-bigquery-storage
pandas
pyarrow
To load and analyze data using Pandas BigQuery, below are the following ways:
1) Using iPython Magics
For using iPython magics in BigQuery, below are the steps:
- Start a Jupyter notebook server and create a new Jupyter notebook.
- To load iPython magics for BigQuery, use the below query:
%load_ext google.cloud.bigquery
- BigQuery Storage API, by default, uses google-cloud-bigquery Python packages (version 1.26.0). To download results, use %%bigquery magics as shown below:
%%bigquery tax_forms
SELECT * FROM `bigquery-public-data.irs_990.irs_990_2012`
2) Using Python Client Libraries
Download Query Results Using the BigQuery Client Library
A user can download and analyze query results by calling both Pandas BigQuery. For instance, you want to store a large set of questions in StackOverflow whose tags have ‘google-bigquery’ in a Pandas Dataframe. The following task can be achieved using the below steps:
- Import a BigQuery library from google.cloud and create an instance.
- Download query results in a variable ‘query_string’ using standard SQL embedded within three double-quotes.
- Define a variable ‘dataframe’ to store query_string results. By using the ‘to_dataframe’ method, the results generated from the query get stored in the variable ‘dataframe,’ and you can observe the top five records using the head function as shown below:
from google.cloud import bigquery
bqclient = bigquery.Client()
# Download query results.
query_string = """
SELECT
CONCAT(
'https://stackoverflow.com/questions/',
CAST(id as STRING)) as url,
view_count
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%google-bigquery%'
ORDER BY view_count DESC
"""
dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(
# Optionally, explicitly request to use the BigQuery Storage API. As of
# google-cloud-bigquery version 1.26.0 and above, the BigQuery Storage
# API is used by default.
create_bqstorage_client=True,
)
)
print(dataframe.head())
Download Table Data Using the BigQuery Client Library
One can also download all table records using the ‘list_rows’ method, which returns a RowIterator object. All the rows can be downloaded using BigQuery Storage API by calling the ‘to_dataframe’ method. The following task can be achieved by using the following code:
from google.cloud import bigquery
bqclient = bigquery.Client()
# Download a table.
table = bigquery.TableReference.from_string(
"bigquery-public-data.utility_us.country_code_iso"
)
rows = bqclient.list_rows(
table,
selected_fields=[
bigquery.SchemaField("country_name", "STRING"),
bigquery.SchemaField("fips_code", "STRING"),
],
)
dataframe = rows.to_dataframe(
# Optionally, explicitly request to use the BigQuery Storage API. As of
# google-cloud-bigquery version 1.26.0 and above, the BigQuery Storage
# API is used by default.
create_bqstorage_client=True,
)
print(dataframe.head())
Download Table Data Using the BigQuery Storage API Client Library
BigQuery Storage API Client library is used for precise control over filters. When performing only a row filter, a read session can be used in place of a query. While a user can create a TableReadOptions object to select columns or filter rows, you can also create a read session using the ‘create_read_session’ method in BigQuery Storage API as shown below:
your_project_id = "project-for-read-session"
from google.cloud import bigquery_storage
from google.cloud.bigquery_storage import types
import pandas
bqstorageclient = bigquery_storage.BigQueryReadClient()
project_id = "bigquery-public-data"
dataset_id = "new_york_trees"
table_id = "tree_species"
table = f"projects/{project_id}/datasets/{dataset_id}/tables/{table_id}"
# Select columns to read with read options. If no read options are
# specified, the whole table is read.
read_options = types.ReadSession.TableReadOptions(
selected_fields=["species_common_name", "fall_color"]
)
parent = "projects/{}".format(your_project_id)
requested_session = types.ReadSession(
table=table,
# Avro is also supported, but the Arrow data format is optimized to
# work well with column-oriented data structures such as pandas
# DataFrames.
data_format=types.DataFormat.ARROW,
read_options=read_options,
)
read_session = bqstorageclient.create_read_session(
parent=parent, read_session=requested_session, max_stream_count=1,
)
For conditions involving streams in a session, you can read rows using the ‘read_rows’ method and loop across pages by availing the pages property. To convert a message into a Pandas Dataframe, you can use the ‘to_dataframe’ method as shown below:
# This example reads from only a single stream. Read from multiple streams
# to fetch data faster. Note that the session may not contain any streams
# if there are no rows to read.
stream = read_session.streams[0]
reader = bqstorageclient.read_rows(stream.name)
# Parse all Arrow blocks and create a dataframe.
frames = []
for message in reader.rows().pages:
frames.append(message.to_dataframe())
dataframe = pandas.concat(frames)
print(dataframe.head())
As Google Cloud charges for the resources, it is recommended to either delete the entire project or keep the project and delete individual resources after work completion. Below are the steps to clean resources:
- Scroll down to the Manage Resources in the Google Cloud Console.
- From the available project list, select the project to be deleted, and click Delete.
- In the dialogue box, type project ID and click Shut Down to delete the desired project.
Pandas BigQuery: Steps to Visualize Data
In this section, you will learn how to visualize your data using Pandas BigQuery.
1) Using BigQuery Client Library
To query and visualize data in BigQuery, using Jupyter notebook, follow the below steps:
Step 1: Start a Jupyter notebook using the following command in your terminal.
jupyter notebook
Step 2: As the Jupyter notebook opens in a browser window — click the New button — select Python 3 to create an iPython notebook.
Image Source
Step 3: Click on Untitled, rename notebook to ‘BigQuery tutorial’, and click Rename.
Step 4: Load the following magic commands in Python with minimal code from the BigQuery client library.
%load_ext google.cloud.bigquery
Step 5: Use cell command ‘%%bigquery’ that helps run SQL queries and return a Pandas Dataframe. For instance, if you want to return ‘year’ and ‘birth_count’ columns from the natality table in Pandas DataFrame, use the below code:
%%bigquery
SELECT
source_year AS year,
COUNT(is_male) AS birth_count
FROM `bigquery-public-data.samples.natality`
GROUP BY year
ORDER BY year DESC
LIMIT 15
Image Source
Step 6: If the task is to store the result in a variable, you can run the below query by providing an argument (total_births) to %%bigquery:
%%bigquery total_births
SELECT
source_year AS year,
COUNT(is_male) AS birth_count
FROM `bigquery-public-data.samples.natality`
GROUP BY year
ORDER BY year DESC
LIMIT 15
Step 7: Once you have saved data to total_births, use the magic command below to activate ‘matplotlib’ — a library used by Pandas to plot graphs.
%matplotlib inline
Step 8: To visualize a bar chart, use the Pandas DataFrame.plot() method.
total_births.plot(kind="bar", x="year", y="birth_count")
Image Source
Step 9: If the task is to visualize the number of births by weekday, you can make a line chart as shown below:
%%bigquery births_by_weekday
SELECT
wday,
SUM(CASE WHEN is_male THEN 1 ELSE 0 END) AS male_births,
SUM(CASE WHEN is_male THEN 0 ELSE 1 END) AS female_births
FROM `bigquery-public-data.samples.natality`
WHERE wday IS NOT NULL
GROUP BY wday
ORDER BY wday ASC
births_by_weekday.plot(x="wday")
Image Source
Step 10: From the above visualization, the number of births considerably decreases on Saturday(6) and Sunday (7). To create a checkpoint (maintain the previous state), click on File and select the ‘Save and Checkpoint’ icon in the toolbar.
2) Using Pandas Dataframe
The previous approach (magical command) gives minimal syntax to interact with BigQuery, where %%bigquery utilizes BigQuery client library for Python at the backend. In contrast, using the BigQuery client library for Python directly will offer more control over queries and enable more complex configurations. To query and visualize data in BigQuery using pandas DataFrames, use the following steps:
Step 1: Import the following BigQuery client library for Python and initialize a client, where the BigQuery client is used to send and receive messages from the BigQuery API.
from google.cloud import bigquery
client = bigquery.Client()
Step 2: Run SQL queries using Client.query() method. For instance, if you want to retrieve an annual count of plural births by plurality, use the below code:
sql = """
SELECT
plurality,
COUNT(1) AS count,
year
FROM
`bigquery-public-data.samples.natality`
WHERE
NOT IS_NAN(plurality) AND plurality > 1
GROUP BY
plurality, year
ORDER BY
count DESC
"""
df = client.query(sql).to_dataframe()
df.head()
Step 3: To visualize the queried results in your DataFrame, insert the following code to create a stacked bar chart of the count of plural births over time:
pivot_table = df.pivot(index="year", columns="plurality", values="count")
pivot_table.plot(kind="bar", stacked=True, figsize=(15, 7))
Image Source
So, now you have gained a comprehensive overview of loading, analyzing, and visualizing data using Pandas BigQuery.
Conclusion
In this article, you were introduced to Google BigQuery and various Python libraries for Data Analysis. You also understood the need for Pandas BigQuery. In addition, you learned the steps to Load, Analyze, and Visualize your Data using Pandas BigQuery.
Data Analysis and Visualization results play a vital role in making future business decisions. Although Google BigQuery assists in performing simple analysis of large datasets using standard SQL at the backend, Analysts use Pandas BigQuery to investigate and visualize data extensively. By using both libraries, organizations can optimize results, costs, and more.
Moreover, extracting complex data from a diverse set of data sources using simple python scripts can be quite challenging and cumbersome, however, a simpler alternative like Hevo is the right solution for you!
Hevo Data is a No-Code Data Pipeline that offers a faster way to move data from 100+ Data Sources including 30+ Free Sources, into your Data Warehouse such as BigQuery to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.
VISIT OUR WEBSITE TO EXPLORE HEVO
Want to take Hevo for a spin?
SIGN UP and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.
Share your experience of Loading, Analyzing, and Visualizing data with Pandas BigQuery in the comments section below!