With the passage of time, data has evolved into one of the most valuable assets and it is considered an important part of decision-making for any organization. Today, most of the organizations’ data is scattered across multiple public clouds. They are constantly on the lookout for solutions that make it simple to ingest data from multiple sources while also allowing them to customize the data ingestion process to their specific needs.
This article will guide you through the process of setting up BigQuery to S3 Integration using 4 simple steps. It will provide you with a brief overview of Google BigQuery and Amazon S3 along with their key features. You will also explore how to query data after setting up BigQuery to S3 Integration in further sections. Let’s get started.
Note: Currently, Hevo Data doesn’t support S3 as a destination.
Prerequisites
You will have a much easier time connecting BigQuery to S3 if you have gone through the following prerequisites:
- An active Google Cloud Platform account.
- An active Amazon S3 account.
- You’ll need an AWS IAM (Identity and Access Management) user with an access key and a secret key to configure the AWS CLI (Command Line Interface).
Introduction to Google BigQuery
Today, most organizations are looking for ways to extract business insights from data. But it can be a challenge to ingest, store, and analyze that data when the volume and scope increase. Google’s enterprise Data Warehouse, BigQuery, has been designed to make large-scale data analysis accessible to everyone. When the size of recorded data grows to gigabytes, terabytes, or even petabytes, an enterprise surely needs a more efficient system like a Data Warehouse.
Google BigQuery can handle massive amounts of data e.g., logs from outlets of retail chains down to the SQL (Structured Query Language) level or IoT (Internet of Things) data from millions of utility meters, telecom usage, and vehicles across the globe. Powered by Google, BigQuery is certainly a Platform as a Service (PaaS) offering, with a fully managed Data Warehouse as a serverless architecture. It enables the organizations to focus on analytics instead of managing the infrastructure.
Key Features of Google BigQuery
Some of the key features of Google BigQuery include:
- Real-Time Analytics: Google BigQuery offers Real-Time Analytics based upon high-speed streaming insertion API (Application Programming Interface). The user just needs to incorporate the real-time data and Google BigQuery can analyze it instantaneously.
- Fully Managed Platform: Being a fully managed provision, one doesn’t need to set up or install anything and even doesn’t need the Database Administrator. One can simply log in to the google cloud project from a browser and get started.
- Easy Querying and Data Analysis: Data in Google BigQuery is stored in a structured table, which means one can use standard SQL for easy querying and data analysis. It is perfect for Big Data because Google BigQuery manages all the storage and scaling operations automatically for the client.
- Data Transfer Service: Google BigQuery also supports the Data Transfer Service via which users can get data from multiple sources on a scheduled basis e.g., Google Marketing Platform, Google Ads, YouTube, Partner SaaS applications to BigQuery, Teradata, and Amazon S3.
- Automatic Backup and Restore: Google BigQuery offers automatic Backup and Restore options. It keeps track of the performed changes on a 7-days basis so that comparison with previous versions can be done if necessary and recall any changes accordingly.
To know more about Google BigQuery, visit this link.
Introduction to Amazon S3
Amazon Simple Storage Service generally referred to as Amazon S3, is a storage solution that is characterized by top-of-the-line Scalability, Performance, Security, and Data Availability as far as the evolving realm of Information Communication and Technology (ICT) is concerned. The enterprises may vary in the scale of their operations and the relevant industry sector, yet Amazon S3 can be used by all of them for storage and protection of data.
The use-cases of data storage via Amazon S3 cover a wide array of dimensions which may include but aren’t limited to IoT Frameworks, Mobile Applications, Enterprise Applications, Big Data Analytics, Data Lakes, Backup and Restore archives, and Hosting of Websites, etc. It provides a sheer amount of control to the concerned management for organizing, optimizing, and configuring access to the stored data for meeting the outlined compliance, business, and organizational requirements.
A holistic overview of how Amazon S3 works is as follows:
Key Features of Amazon S3
Some of the key features of Amazon S3 include:
- Storage Classes: Based upon the user access frequency, Amazon S3 offers a variety of Storage Classes optimized for different use-cases. For example, for storing the mission-critical enterprise information which would require frequent access by concerned individuals, S3 Standard Access can be used. Another storage class is S3 Glacier Instant Retrieval which helps end-users archive the data in the most economical manner; similar features can also be accessed via S3 Glacier Flexible Retrieval, and S3 Glacier Deep Archive.
- Storage Management: Amazon S3 offers a host of information Storage Management features that can be leveraged to optimize the cost, meetup regulatory requirements, reduce latency, and save various copies of the concerned data sets, essential for compliance requirements. These features include S3 Lifecycle, S3 Object Lock, S3 Replication, and S3 Batch Operations.
- Data Processing: Not only does Amazon S3 allow for storage but also assists in Data Processing for transforming data and triggering workflows so that automation of processes can be enabled at a larger scale. The features that might assist in data processing are S3 Object Lambda and Event Notifications.
- Storage Logging and Monitoring: Many automated and manual account logging and monitoring features accompany the data storage in Amazon S3 which enable the efficient monitoring and control of data resources. These may include Amazon CloudWatch Metrics for Amazon S3, AWS Cloud Trail, Server Access Logging, AWS Trusted Advisor, etc.
- Analytics and Insights: Getting visibility into the stored data can be done in Amazon S3 via a host of tools including Amazon S3 Storage Lens, Storage Class Analysis, and S3 Inventory with Inventory Reports. This feature not only empowers the end-users to get a better understanding of the data set but also optimizes the storage space and the relevant requirements.
To know more about Amazon S3, visit this link.
Hevo Data helps you directly transfer data from 150+ data sources (including 40+ free sources) to Google BigQuery, Data Warehouses, or a destination of your choice in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.
Hevo takes care of all your data preprocessing needs required to set up the integration and lets you focus on key business activities and draw a much powerful insight on how to generate more leads, retain customers, and take your business to new heights of profitability. It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination.
Get Started with Hevo for Free
Steps to Set Up BigQuery to S3 Integration
Now that you have a basic grasp of both tools let’s try to understand the steps to set up BigQuery to S3 Integration. Extracting data from BigQuery to S3 is a very convenient process with the use of the tool named AWS Glue. The pictorial representation is given below:
Below are the steps you can follow to connect BigQuery to S3:
Step 1: Configuring Google Account
The first step in setting up BigQuery to S3 Integration is to configure the Google Account. The Google account needs to be configured for the subject transfer. You can follow the below-mentioned procedures to configure your Google account:
- Download the Service Account Credentials in the form of JSON files via Google Cloud.
- On the Secrets Manager console, choose Store a new secret.
- For Secret type, select Other type of secrets as shown below.
- Next, enter your key as credentials and the value as the base64-encoded string.
- Leave the rest of the options at their default.
- Click on Next.
- Give a name to the secret bigquery_credentials.
- Follow through the rest of the steps to store the secret.
Step 2: Creating an AWS Glue IAM Role
After you have successfully configured your Google account to set up BigQuery to S3 Integration, you can follow the below-mentioned procedures to create an AWS Glue IAM Role:
- Attach the relevant policies i.e., AmazonEC2ContainerRegistryReadOnly & AWSGlueServiceRole (AWS managed) for creation of IAM role, given the requisite permissions for the AWS Glue job.
- Sample policy in this regard is shown below:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GetDescribeSecret",
"Effect": "Allow",
"Action": [
"secretsmanager:GetResourcePolicy",
"secretsmanager:GetSecretValue",
"secretsmanager:DescribeSecret",
"secretsmanager:ListSecretVersionIds"
],
"Resource": "arn:aws:secretsmanager::<<account_id>>:secret:<<your_secret_id>>"
},
{
"Sid": "S3Policy",
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:GetBucketAcl",
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::<<your_s3_bucket>>",
"arn:aws:s3:::<<your_s3_bucket>>/*"
]
}
]
}
Integrate Amazon S3 to BigQuery
Integrate Amazon RDS to BigQuery
Step 3: Subscribing to Glue for BigQuery Connector
To subscribe to the connector, you can complete the following steps:
- Review the terms and conditions, pricing, and other details.
- Click on Continue to Configuration.
- For Delivery Method, choose your delivery method.
- For Software Version, choose your software version.
- Next, click on Continue to Launch as shown below.
- Under Usage instructions, select Activate the Glue connector in AWS Glue Studio as shown below.
- For Name, enter a name for your connection (for example, bigquery) as shown below.
- Optionally, choose a VPC (Virtual Private Cloud), subnet, and security group.
- For AWS Secret, choose bigquery_credentials.
- Next, click on Create connection.
Step 4: Building the ETL Jobs for BigQuery to S3
Now, you can build the ETL jobs for BigQuery to S3 integration using the below-mentioned procedures:
- On Glue Studio, choose Jobs.
- For Source, choose BigQuery.
- For Target, choose S3.
- Next, click on Create as shown below.
- Choose ApplyMapping and delete it.
- Choose BigQuery.
- For Connection, choose bigguery.
- Expand Connection options.
- Click on the Add new option button as shown below.
- Add following Key/Value.
- Key: parentProject, Value: <<google_project_id>>
- Key: table, Value: bigquery-public-data.covid19_open_data.covid19_open_data
- Choose S3 bucket.
- Choose format and Compression Type.
- Specify S3 Target Location.
- Choose Job details.
- For Name, enter BigQuery_S3.
- For IAM Role, choose the role you created.
- For Type, choose Spark.
- For the Glue version, choose Glue 2.0 – Supports Spark 2.4, Scala 2, Python3.
- Leave the rest of the options as defaults.
- Click on Save.
- To run the job, Click the Run button as shown below.
- Once the job run succeeds, check the S3 bucket for data.
With this, you have successfully set up your BigQuery to S3 Integration. It’s as simple as that.
Querying Data after Setting Up BigQuery to S3 Integration
Once the integration of BigQuery to S3 is successful, Glue Crawlers can be used to get the data in the S3 bucket; a covid table is created and data can be queried accordingly. A sample demonstration is shown below:
Conclusion
One can see that integrating BigQuery to S3 complements the processing power of the former while the latter serves as a catalyst in this regard. Given the increased dependence of enterprises on ICT, and as the threats of cybersecurity are also evolving, this integration adds a lot of value to business operations.
It can surely be concluded that BigQuery to S3 Integration further complements the process of information storage and extraction, thus enabling the users to derive key insights which result in efficient and effective managerial decisions.
Visit our Website to Explore Hevo
Businesses can use automated platforms like Hevo Data to set the integration and handle the ETL process. It helps you directly transfer data from 150+ data sources (including 40+ free sources) to a destination of your choice in a fully automated and secure manner without having to write any code and will provide you with a hassle-free experience.
FAQ
How do I connect Google BigQuery to AWS?
To connect Google BigQuery to AWS, you can use data transfer tools like Google Cloud Storage Transfer Service, AWS Data Transfer, or third-party tools such as Fivetran or Hevo to move data between BigQuery and AWS services like S3.
What is GCP equivalent to S3?
The GCP equivalent to AWS S3 is Google Cloud Storage, which offers scalable and secure object storage similar to S3.
What is the AWS equivalent of BigQuery?
The AWS equivalent of Google BigQuery is Amazon Redshift, which is a cloud-based data warehouse designed for large-scale data analytics.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.
Share your experience of setting up BigQuery to S3 Integration in the comments section below!
Syeda is a technical content writer with a profound passion for data. She specializes in crafting insightful content on a broad spectrum of subjects, including data analytics, machine learning, artificial intelligence, big data, and business intelligence. Through her work, Syeda aims to simplify complex concepts and trends for data practitioners, making them accessible and engaging for data professionals.