Connecting BigQuery to S3: 4 Easy Steps

• December 14th, 2021

BigQuery to S3 FI

With the passage of time, data has evolved into one of the most valuable assets and it is considered an important part of decision-making for any organization. Today, most of the organizations’ data is scattered across multiple public clouds. They are constantly on the lookout for solutions that make it simple to ingest data from multiple sources while also allowing them to customize the data ingestion process to their specific needs.

This article will guide you through the process of setting up BigQuery to S3 Integration using 4 simple steps. It will provide you with a brief overview of Google BigQuery and Amazon S3 along with their key features. You will also explore how to query data after setting up BigQuery to S3 Integration in further sections. Let’s get started.

Note: Currently, Hevo Data doesn’t support S3 as a destination.

Table of Contents

Prerequisites

You will have a much easier time connecting BigQuery to S3 if you have gone through the following prerequisites:

  • An active Google Cloud Platform account.
  • An active Amazon S3 account.
  • You’ll need an AWS IAM (Identity and Access Management) user with an access key and a secret key to configure the AWS CLI (Command Line Interface).

Introduction to Google BigQuery

Google BigQuery Logo
Image Source

Today, most organizations are looking for ways to extract business insights from data. But it can be a challenge to ingest, store, and analyze that data when the volume and scope increase. Google’s enterprise Data Warehouse, BigQuery, has been designed to make large-scale data analysis accessible to everyone. When the size of recorded data grows to gigabytes, terabytes, or even petabytes, an enterprise surely needs a more efficient system like a Data Warehouse.

Google BigQuery can handle massive amounts of data e.g., logs from outlets of retail chains down to the SQL (Structured Query Language) level or IoT (Internet of Things) data from millions of utility meters, telecom usage, and vehicles across the globe. Powered by Google, BigQuery is certainly a Platform as a Service (PaaS) offering, with a fully managed Data Warehouse as a serverless architecture. It enables the organizations to focus on analytics instead of managing the infrastructure.

Key Features of Google BigQuery

Some of the key features of Google BigQuery include:

  • Real-Time Analytics: Google BigQuery offers Real-Time Analytics based upon high-speed streaming insertion API (Application Programming Interface). The user just needs to incorporate the real-time data and Google BigQuery can analyze it instantaneously. 
  • Fully Managed Platform: Being a fully managed provision, one doesn’t need to set up or install anything and even doesn’t need the Database Administrator. One can simply log in to the google cloud project from a browser and get started. 
  • Easy Querying and Data Analysis: Data in Google BigQuery is stored in a structured table, which means one can use standard SQL for easy querying and data analysis. It is perfect for Big Data because Google BigQuery manages all the storage and scaling operations automatically for the client. 
  • Data Transfer Service: Google BigQuery also supports the Data Transfer Service via which users can get data from multiple sources on a scheduled basis e.g., Google Marketing Platform, Google Ads, YouTube, Partner SaaS applications to BigQuery, Teradata, and Amazon S3. 
  • Automatic Backup and Restore: Google BigQuery offers automatic Backup and Restore options. It keeps track of the performed changes on a 7-days basis so that comparison with previous versions can be done if necessary and recall any changes accordingly.

To know more about Google BigQuery, visit this link.

Introduction to Amazon S3

Amazon S3 Logo
Image Source

Amazon Simple Storage Service generally referred to as Amazon S3, is a storage solution that is characterized by top-of-the-line Scalability, Performance, Security, and Data Availability as far as the evolving realm of Information Communication and Technology (ICT) is concerned. The enterprises may vary in the scale of their operations and the relevant industry sector, yet Amazon S3 can be used by all of them for storage and protection of data.

The use-cases of data storage via Amazon S3 cover a wide array of dimensions which may include but aren’t limited to IoT Frameworks, Mobile Applications, Enterprise Applications, Big Data Analytics, Data Lakes, Backup and Restore archives, and Hosting of Websites, etc. It provides a sheer amount of control to the concerned management for organizing, optimizing, and configuring access to the stored data for meeting the outlined compliance, business, and organizational requirements. 

A holistic overview of how Amazon S3 works is as follows:

Working of Amazon S3
Image Source

Key Features of Amazon S3

Some of the key features of Amazon S3 include:

  • Storage Classes: Based upon the user access frequency, Amazon S3 offers a variety of Storage Classes optimized for different use-cases. For example, for storing the mission-critical enterprise information which would require frequent access by concerned individuals, S3 Standard Access can be used. Another storage class is S3 Glacier Instant Retrieval which helps end-users archive the data in the most economical manner; similar features can also be accessed via S3 Glacier Flexible Retrieval, and S3 Glacier Deep Archive. 
  • Storage Management: Amazon S3 offers a host of information Storage Management features that can be leveraged to optimize the cost, meetup regulatory requirements, reduce latency, and save various copies of the concerned data sets, essential for compliance requirements. These features include S3 Lifecycle, S3 Object Lock, S3 Replication, and S3 Batch Operations. 
  • Data Processing: Not only does Amazon S3 allow for storage but also assists in Data Processing for transforming data and triggering workflows so that automation of processes can be enabled at a larger scale. The features that might assist in data processing are S3 Object Lambda and Event Notifications. 
  • Storage Logging and Monitoring: Many automated and manual account logging and monitoring features accompany the data storage in Amazon S3 which enable the efficient monitoring and control of data resources. These may include Amazon CloudWatch Metrics for Amazon S3, AWS Cloud Trail, Server Access Logging, AWS Trusted Advisor, etc. 
  • Analytics and Insights: Getting visibility into the stored data can be done in Amazon S3 via a host of tools including Amazon S3 Storage Lens, Storage Class Analysis, and S3 Inventory with Inventory Reports. This feature not only empowers the end-users to get a better understanding of the data set but also optimizes the storage space and the relevant requirements. 

To know more about Amazon S3, visit this link.

Simplify ETL using Hevo’s No-code Data Pipeline

Hevo Data helps you directly transfer data from 150+ data sources (including 40+ free sources) to Google BigQuery, Data Warehouses, or a destination of your choice in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

Hevo takes care of all your data preprocessing needs required to set up the integration and lets you focus on key business activities and draw a much powerful insight on how to generate more leads, retain customers, and take your business to new heights of profitability. It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination.

Get Started with Hevo for Free

Check out what makes Hevo amazing:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Steps to Set Up BigQuery to S3 Integration

Now that you have a basic grasp of both tools let’s try to understand the steps to set up BigQuery to S3 Integration. Extracting data from BigQuery to S3 is a very convenient process with the use of the tool named AWS Glue. The pictorial representation is given below:

BigQuery to S3 Integration
Image Source

Below are the steps you can follow to connect BigQuery to S3:

Step 1: Configuring Google Account

The first step in setting up BigQuery to S3 Integration is to configure the Google Account. The Google account needs to be configured for the subject transfer. You can follow the below-mentioned procedures to configure your Google account:

  • Download the Service Account Credentials in the form of JSON files via Google Cloud.
  • On the Secrets Manager console, choose Store a new secret.
  • For Secret type, select Other type of secrets as shown below.
Storing a New Secret
Image Source
  • Next, enter your key as credentials and the value as the base64-encoded string.
  • Leave the rest of the options at their default.
  • Click on Next.
  • Give a name to the secret bigquery_credentials.
  • Follow through the rest of the steps to store the secret.

Step 2: Creating an AWS Glue IAM Role

After you have successfully configured your Google account to set up BigQuery to S3 Integration, you can follow the below-mentioned procedures to create an AWS Glue IAM Role:

  • Attach the relevant policies i.e., AmazonEC2ContainerRegistryReadOnly & AWSGlueServiceRole (AWS managed) for creation of IAM role, given the requisite permissions for the AWS Glue job. 
  • Sample policy in this regard is shown below:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "GetDescribeSecret",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": "arn:aws:secretsmanager::<<account_id>>:secret:<<your_secret_id>>"
        },
        {
            "Sid": "S3Policy",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListBucket",
                "s3:GetBucketAcl",
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<<your_s3_bucket>>",
                "arn:aws:s3:::<<your_s3_bucket>>/*"
            ]
        }
    ]
}

Step 3: Subscribing to Glue for BigQuery Connector

To subscribe to the connector, you can complete the following steps:

Subscribing AWS Glue Connector
Image Source
  • Review the terms and conditions, pricing, and other details.
  • Click on Continue to Configuration.
  • For Delivery Method, choose your delivery method.
  • For Software Version, choose your software version.
  • Next, click on Continue to Launch as shown below.
Launch AWS Glue Connector
Image Source
  • Under Usage instructions, select Activate the Glue connector in AWS Glue Studio as shown below.
Activating the Glue Connector
Image Source
  • For Name, enter a name for your connection (for example, bigquery) as shown below.
Connection Properties
Image Source
  • Optionally, choose a VPC (Virtual Private Cloud), subnet, and security group.
  • For AWS Secret, choose bigquery_credentials.
  • Next, click on Create connection.

Step 4: Building the ETL Jobs for BigQuery to S3

Now, you can build the ETL jobs for BigQuery to S3 integration using the below-mentioned procedures:

  • On Glue Studio, choose Jobs.
  • For Source, choose BigQuery.
  • For Target, choose S3.
  • Next, click on Create as shown below.
Setting Up Source and Target in AWS Glue
Image Source
  • Choose ApplyMapping and delete it.
  • Choose BigQuery.
  • For Connection, choose bigguery.
  • Expand Connection options.
  • Click on the Add new option button as shown below.
Adding New Option in AWS Glue Studio
Image Source
  • Add following Key/Value.
    • Key: parentProject, Value: <<google_project_id>>
    • Key: table, Value: bigquery-public-data.covid19_open_data.covid19_open_data
Connection Options
Image Source
  • Choose S3 bucket.
  • Choose format and Compression Type.
  • Specify S3 Target Location.
Specifying S3 Target Location
Image Source
  • Choose Job details.
  • For Name, enter BigQuery_S3.
  • For IAM Role, choose the role you created.
  • For Type, choose Spark.
  • For the Glue version, choose Glue 2.0 – Supports Spark 2.4, Scala 2, Python3.
  • Leave the rest of the options as defaults.
  • Click on Save.
  • To run the job, Click the Run button as shown below.
BigQuery S3 Run Details
Image Source
  • Once the job run succeeds, check the S3 bucket for data.
Checking S3 Bucket
Image Source

With this, you have successfully set up your BigQuery to S3 Integration. It’s as simple as that.

Querying Data after Setting Up BigQuery to S3 Integration

Once the integration of BigQuery to S3 is successful, Glue Crawlers can be used to get the data in the S3 bucket; a covid table is created and data can be queried accordingly. A sample demonstration is shown below:

Querying Data after BigQuery to S3 Integration
Image Source

Conclusion

One can see that integration of BigQuery to S3 complements the processing power of the former while the latter serves as a catalyst in this regard. Given the increased dependence of enterprises on ICT and as the threats of cybersecurity are also evolving, this integration adds a lot of value to business operations.

It can surely be concluded that BigQuery to S3 Integration further complements the process of information storage and extraction, thus enabling the users to derive key insights which result in efficient and effective managerial decisions.

Visit our Website to Explore Hevo

Businesses can use automated platforms like Hevo Data to set the integration and handle the ETL process. It helps you directly transfer data from 150+ data sources (including 40+ free sources) to a destination of your choice in a fully automated and secure manner without having to write any code and will provide you with a hassle-free experience.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Share your experience of setting up BigQuery to S3 Integration in the comments section below!

No-code Data Pipeline for your Data Warehouse