BigQuery is one of the most prominent Big Data Warehouses, which allows you to store, manage, and analyze colossal amounts of data. It comes with a built-in query engine to run SQL queries on the BigQuery data and acquire meaningful insights. However, the BigQuery platform does not have a built-in Data Visualization feature. You can connect BigQuery with external tools such as BigQuery Jupyter Notebook to create effective visualizations for identifying the patterns and trends in the BigQuery data.
Since Python provides you with a vast set of data visualization libraries, you can connect BigQuery with Jupyter Notebook to create interactive dashboards and perform Data Analysis by executing very few lines of Python code. This article will help you connect BigQuery Jupyter Notebook.
Table of Contents
- What is BigQuery?
- What is Jupyter Notebook?
- Steps to Connect BigQuery Jupyter Notebook
Fundamental knowledge of Python.
What is BigQuery?
Introduced by Google in 2010, BigQuery is a Cloud-based and serverless Data Warehousing platform that enables you to manage, process, and analyze Big Data. BigQuery is a fully managed and serverless Data Warehousing service that allows you to process and analyze Terabytes of data in a matter of seconds and Petabytes of data in less than a minute. Since it is a serverless computing model, BigQuery lets you execute SQL queries to seamlessly analyze big data while requiring no infrastructure management.
Google provides a complete package to their users with bulk data loading feature on Google Cloud Storage. This big data can easily be accessed using BigQuery by using a browser tool, a command-line tool, or by making calls to the BigQuery REST API with client libraries such as Java, PHP, or Python. On top of that, Google BIgQuery comes with built-in Artificial Intelligence and Machine Learning model development and implementation capabilities. Machine Learning models can easily be created and executed using SQL queries.
Simplify BigQuery ETL and Data Integration using Hevo’s No-code Data Pipeline
Hevo Data helps you directly transfer data from 100+ Data Sources (including 30+ free sources) to BigQuery, Business Intelligence tools, Data Warehouses, or a destination of your choice in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code.Get started with hevo for free
Let’s look at some of the salient features of Hevo:
- Fully Managed: It requires no management and maintenance as Hevo is a fully automated platform.
- Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer.
- Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
- Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
- Scalable Infrastructure: Hevo has in-built integrations for 100’s of sources that can help you scale your data infrastructure as required.
- Live Monitoring: Advanced monitoring gives you a one-stop view to watch all the activities that occur within Data Pipelines.
- Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
What is Jupyter Notebook?
Steps to Connect BigQuery Jupyter Notebook
Follow the below-mentioned steps to establish a BigQuery Jupyter Notebook connection.
- BigQuery Jupyter Notebook Connection Prerequisites
- Enabling BigQuery API
- Getting an Authentication file
- Connecting BigQuery Jupyter Notebook
BigQuery Jupyter Notebook Connection Prerequisites
To establish a BigQuery Jupyter Notebook connection, you have to satisfy 3 prerequisites.
- You should have an active Google Cloud account and a pre-installed Jupyter Notebook.
- If you have not previously installed Jupyter Notebook, you can download and install the Anaconda distribution package.
- After installing the Anaconda Distribution package, you must launch the Anaconda Navigator to install Jupyter Notebook.
Enabling BigQuery API
- You can use BigQuery APIs to extract or fetch data from BigQuery datasets. In other words, the BigQuery service provides you with various REST APIs that are used to collect and transmit data from external data sources.
- You can access BigQuery API using exclusive client libraries like Python, Java, and C++. When your application doesn’t permit you to use other client libraries, you will have to use HTTP commands to manually access the BigQuery API.
- You can enable BigQuery API by using two methods: Using Cloud Console and Cloud Shell.
Method 1: Enabling BigQuery API from Cloud Console
- Visit the Google Cloud Platform.
- Now, go to the navigation menu and click on “APIs & Services“, as shown in the above image.
- In the search bar, type “BigQuery API” and press Enter.
- Now, you are redirected to the new page where you can see various APIs of BigQuery. Click on the “BigQuery API” displayed at the top.
- Then, you are again taken to the new page, where you are prompted to enable the BigQuery API. Click on the “ENABLE” button.
- On successfully enabling the BigQuery API, you can see the “API Enabled” tag, as shown in the above image.
- You can also see the Activation status as “Enabled” in the overview tab of Google Cloud Platform.
Method 2: Enabling BigQuery API from Cloud Shell
- Open your Cloud Shell and execute the following command to enable the BigQuery API.
gcloud services enable bigquery.googleapis.com
- You can check whether the BigQuery API is successfully enabled by executing the command given below.
gcloud services list
Getting an Authentication File
- To access data from BigQuery using the BigQuery API, you have to create a service account and download an authentication file for Jupyter Notebook.
- For creating a new service account, go to the “Create Service Account” page on the Google Cloud Platform.
- Now, you are prompted to enter the Service account name, ID, and description. After filling in the details, click on the “Create and Continue” button.
- You can see the newly created service account in the “Credentials” field of the Google Cloud Platform, thereby ensuring the service account is created successfully.
- Now, you have to create an authentication key for the service account. In the “Credentials” field, click the pencil icon beside the newly created service account. Then, click on the “Add Key” button and select the “Create New Key” option.
- In the next step, select the key type as JSON and click on the “Create” button.
- Now, the JSON file with the authentication information is downloaded to your local machine.
Connecting BigQuery Jupyter Notebook
- At this stage, you are all set to connect BigQuery Jupyter Notebook. Now, open a command prompt and enter the following code to install the necessary packages for connecting BigQuery with Jupyter Notebook.
pip install google-cloud pip install google-cloud-bigquery[pandas] pip install google-cloud-storage
- You can also install the packages from Jupyter Notebook by executing the one-line command. Open your Jupyter Notebook and run the following command in the code cell.
pip install --upgrade 'google-cloud-bigquery[bqstorage,pandas]'
- Now, you have to load the BigQuery client library by executing the command given below.
- In the further steps, you will be querying the data from the BigQuery public dataset samples.
- Now, execute the following commands to run a SQL query and store results in the form of DataFrame.
%%bigquery SELECT source_year AS year, COUNT(is_male) AS birth_count FROM `bigquery-public-data.samples.natality` GROUP BY year ORDER BY year DESC LIMIT 15
- After executing the above command, you get an output that resembles the following image.
- Enter the below command in the next cell block to repeat the same query. However, this time, the query saves the results in the form of a DataFrame to a new variable called “total_births,” which is passed to “%% bigquery” as an argument. The total_births data can also be further analyzed and visualized using Python.
%%bigquery total_births SELECT source_year AS year, COUNT(is_male) AS birth_count FROM `bigquery-public-data.samples.natality` GROUP BY year ORDER BY year DESC LIMIT 15
- After executing the above command, a new DataFrame named “total_births” is created to perform data visualization in the next step.
- For performing data visualization, you have to install the Matplotlib library. Execute the following command to install the Matplotlib package.
- Now, execute the following code to visualize the “total_births” data in the form of a Bar Chart using the Pandas DataFrame.plot() method.
total_births.plot(kind="bar", x="year", y="birth_count")
- After executing the above code, you get the output that resembles the following image.
- You can further analyze the sample data to retrieve the number of births by weekdays.
%%bigquery births_by_weekday SELECT wday, SUM(CASE WHEN is_male THEN 1 ELSE 0 END) AS male_births, SUM(CASE WHEN is_male THEN 0 ELSE 1 END) AS female_births FROM `bigquery-public-data.samples.natality` WHERE wday IS NOT NULL GROUP BY wday ORDER BY wday ASC
- The above-given query saves the results in the form of DataFrame to a new variable called “births_by_weekday,” which is passed to “%% bigquery” as an argument.
- You can further plot the “births_by_weekday” data using the plot method by executing the code given below.
- After executing the above command, you will get the output, as shown below.
- You can also use the BigQuery client library to analyze BigQuery data. Run the following code to import the BigQuery client library for Python.
from google.cloud import bigquery
- Then, execute the command given below to initialize the BigQuery client.
client = bigquery.Client()
- In the next step, run the following code that includes the client.query() method to retrieve the annual count of plural births by plurality (2 for twins, 3 for triplets, and so on).
sql = """ SELECT plurality, COUNT(1) AS count, year FROM `bigquery-public-data.samples.natality` WHERE NOT IS_NAN(plurality) AND plurality > 1 GROUP BY plurality, year ORDER BY count DESC """ df = client.query(sql).to_dataframe() df.head()
- Then, convert the result set “df” to a pivot table by executing the following command.
pivot_table = df.pivot(index="year", columns="plurality", values="count")
- Now, run the code given below to create a stacked bar according to the count of plural births over time.
pivot_table.plot(kind="bar", stacked=True, figsize=(15, 7))
- The output of the above code resembles the following output.
- Now, run the following query to retrieve the count of births by the number of gestation weeks using the client.query()method.
sql = """ SELECT gestation_weeks, COUNT(1) AS count FROM `bigquery-public-data.samples.natality` WHERE NOT IS_NAN(gestation_weeks) AND gestation_weeks <> 99 GROUP BY gestation_weeks ORDER BY gestation_weeks """ df = client.query(sql).to_dataframe()
- Then, plot the output of the above code by running the following code snippets.
ax = df.plot(kind="bar", x="gestation_weeks", y="count", figsize=(15, 7)) ax.set_title("Count of Births by Gestation Weeks") ax.set_xlabel("Gestation Weeks") ax.set_ylabel("Count")
- The output of the above code resembles the following image.
On executing the above-mentioned steps, you have successfully connected BigQuery Jupyter Notebook and also performed Data Analysis and visualization using the BigQuery APIs and Client Libraries.
In this article, you learned about BigQuery, Jupyter Notebook, how to enable BigQuery API and connect it to Jupyter Notebook. This article only focused on connecting BigQuery Jupyter Notebook via client libraries. However, you can also connect any external database servers like Oracle, SQL Server, and MySQL with Jupyter Notebook by using the respective ODBC drivers or connectors of the Database.
BigQuery supports many more Data Sources and allows loading data from other traditional and Cloud-based applications, SaaSs, CRMs, BI tools, etc. You may require a Data Integration tool to do extract and load data from multiple sources into Snowflake, and this is where Hevo comes in.visit our website to explore hevo
Hevo Data with its strong integration with 100+ Sources & BI tools allows you to not only export data from sources & load data in the destinations such as BigQuery, but also transform & enrich your data, & make it analysis-ready so that you can focus only on your key business needs and perform insightful analysis using BI tools.
Share your experience of working with BigQuery Jupyter Notebook in the comments section below.