Are you currently running on Amazon S3 but your business requires you to stream your data into a Data Warehouse and gain insights in real-time? Or are you having issues with adoption that require you to hire a lot of Database Engineers before you can run efficient queries? Or do you feel you paying too much for the queries you’re executing?
Whatever the issue you’re experiencing, having a data pipeline that helps migrate your data seamlessly from Amazon S3 to BigQuery will aid in putting these issues to bed. This article will introduce you to Amazon S3 and Google BigQuery. It will also provide you with a detailed explanation of how you can migrate your data from Amazon S3 to BigQuery. It will introduce you to 2 different methods that you can choose from for moving data from Amazon S3 to BigQuery.
Introduction to Amazon S3
Amazon S3 is a cloud-based storage service developed by Amazon Web Services (AWS). The name S3 is coined from the combination of (Simple Storage Service). It offers an interface that is web-based which allows you to store, manage, and retrieve your data in real-time. It runs on the same network in which Amazon runs its global conglomerate of websites. One of the major appeals of Amazon S3 is that gives developers access to a scalable, fast, and secure storage infrastructure built on AWS.
For more information on Amazon S3, click here.
Introduction to Google BigQuery
Google BigQuery is a serverless and highly scalable Data Warehouse built by Google Cloud Platform (GCP), Google’s Cloud Computing Subsidiary. BigQuery allows you to process up to terabytes of data at neck-breaking speed. It runs a fully serverless system that abstracts you from managing any form of infrastructure, hence you are given the liberty to focus mainly on analytics. With no extra cost, BigQuery allows you to scale up and down. This is possible to its unmatched parallel compute engine. Your data is queried using the familiar SQL (Structured Query Language) which enables you to gain novel and in-depth insights from your data.
For more information on Google BigQuery, click here.
Time to stop hand-coding your data pipelines and start using Hevo’s No-Code, Fully Automated ETL solution. With Hevo, you can replicate data from a growing library of 150+ plug-and-play integrations and 15+ destinations — SaaS apps, databases, data warehouses, and much more.
Hevo’s ETL empowers your data and business teams to integrate multiple data sources or prepare your data for transformation. Hevo’s Pre and Post Load Transformations accelerates your business team to have analysis-ready data without writing a single line of code!
Gain faster insights, build a competitive edge, and improve data-driven decision-making with a modern ETL solution. Hevo is the easiest and most reliable data replication platform that will save your engineering bandwidth and time multifold.
Start your data journey with the fastest ETL on the cloud!
Sign up here for a 14-day Free Trial!
Methods to Move Data from Amazon S3 to BigQuery
There are majorly 2 methods that you can use to easily move data from Amazon S3 to BigQuery:
Let’s walk through these 2 methods one by one.
Method 1: Move Data from Amazon S3 to BigQuery using Custom ETL Scripts
Moving Data from Amazon S3 to BigQuery demands technical proficiency as it involves Manual Integration. There are 5 steps to move data from Amazon S3 to BigQuery. Following are those steps:
Step 1: Getting Authenticated to Interact with the AWS Bucket that Contains Your Data
You would have a bucket is an AWS resource that lets you store and manage your data. AWS provides the user with APIs to manage this resource. Your AWS bucket must be globally unique. So whatever name you decide for your bucket must be exclusive to you as you cannot use the name of another bucket in the same region. It is always advised and is a best practice to make use of regions that are geographically closer to you.
The user who creates a bucket has full administrative access to the bucket. AWS offers a service called IAM (Identity and Access Management). It allows the user with administrative access to securely control access to the AWS bucket and resources. With IAM you can give access to who can access and migrate your data. Whoever wants to migrate your data needs to be authenticated by the administrative user.
The following permissions need to be granted to the user:
- S3: List Bucket: This allows the user to list objects in the bucket.
- S3: Get Object: This allows the user to read objects in the bucket.
- S3: Delete Object: This allows the user to able to delete objects in the bucket.
- S3: GetBucketLocation: This allows the user to be able to access the location of the bucket.
To give a user these permissions you need to create a custom bucket policy. A bucket policy is an IAM action that grants other AWS account users permissions to a bucket and the objects contained in that bucket. The steps to create a custom bucket are listed out below:
- Log in to your AWS console.
- From the home page, select Bucket Name, select the bucket in which you will be creating a policy.
- Select Permissions and choose Bucket Policy.
- In the Bucket Policy text editor, you can edit the permissions, an illustration is given below
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "creating a policy document",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::Account-ID:user/Sam"
},
"Action": [
"s3:GetBucket",
"s3:ListBucket",
"s3:DeleteBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::mynewbucket"
}
]
}
In the code snippet above, we granted the permissions listed above to a user name Sam.
- Click on the policy generator icon and select Save.
From these steps, we have been able to create a bucket policy, that will allow the user to migrate data to BigQuery.
Step 2: Get Access Keys that will Enable you to Ingest into GCS
After getting authenticated, you will need to get the access keys that will enable you to be able to ingest your data into GCS. The steps to achieve this are given below:
- To generate access keys for your user account you have to create a bucket policy. An illustration is given below:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "CreateOwnAccessKeys",
"Effect": "Allow",
"Action": [
"iam:CreateAccessKey",
"iam:GetUser",
"iam:ListAccessKeys"
],
"Resource": "arn:aws:iam::*:user/sam"
}
]
}
- After saving the bucket policy. Go to your IAM home page.
- In the upper right corner, you will find a navigation bar. Here, choose your user name and select My Security Credentials as shown below.
- You will be taken to an AWS IAM credentials tab, you will see an Access key for the API access section. Select Create Access Key.
- Then choose the Download.csv icon to download your access keys locally to your computer. These will be the passcodes that will enable you to ingest your data in GCS.
You need to make sure you save these access keys in a safe place as if it is misplaced you will not be able to generate a new set of access keys.
Step 3: Ingest Data into GCS (Google Cloud Storage) Using the AWS S3 Option
After gaining all the needed access, the next step is to ingest your data into GCS. GCS provides a transfer option to migrate your data from AWS S3. The steps to do these are outlined below:
- Log in to your GCP account, from the hamburger menu, select Storage, and select browser on the left tab.
- Create a new bucket. It is pertinent the name given to your new bucket is globally unique.
- After creating a new bucket, on the left side of the storage page select the Transfer icon and click on Create Transfer.
- The window will take you to a page where you have to specify your source. Select Amazon S3.
- Fill the form with all the credentials you obtained when authenticating your AWS IAM access. Your data will be transferred into the GCS bucket. It also provides an option to choose how frequently you want to ingest the data whether daily or weekly. From this populated bucket, we will ingest our data into BigQuery.
Step 4: Load the Ingested Data in BigQuery from GCS
After transferring your data into GCS, the next step is to migrate the data into BigQuery for analysis. Steps to achieve these are given below:
- Go to the GCP (Google Cloud Platform) home page, and from the hamburger menu on the left-hand side of the page, select BigQuery.
- Create a new dataset and fill in all the requirements requested.
- Create a new table, select the GCS bucket you just populated with data from S3.
- Choose your desired name for the destination table and fill it in.
- Select auto-detect schema in the select schema section.
- Ensure your file format is on CSV.
- Click on create a table to finalize the ingestion process.
You can also make use of the command line to ingest your data into BigQuery. An example is shown below:
In this code snippet, you get to specify your own schema.
bq --location= Europe load --source_format = CSV your_dataset.your_table
gs://my-amazon-bucket/amazon_data.csv ./your_schema.json
The auto-detect feature can also be used to indicate your schema. An example is shown below.
bq --location= Europe load --autodetect = CSV your_dataset.your_table
gs://my-amazon-bucket/amazon_data.csv
Step 5: Update the Final Table in BigQuery
When your data is ingested in GCS, it is not always up to date as GCS is a staging area for BigQuery. So your data will be stored in a temporary table. The temporary table is a means to deal with updates and inserts that are dissimilar. So anytime you stream new data in GCS, your temporary table will need to be updated before the changes can be effected in BigQuery. The ways to update your temporary table in BigQuery will be explained below:
Inserting new rows from the temporary table to update the final table
UPDATE FinalTable ft
SET ft.value = tt.value
FROM TemporaryTable tt
WHERE ft.id = tt.id; INSERT FinalTable (id,value)
SELECT id, value
FROM TemporaryTable
WHERE NOT id IN (SELECT id FROM FinalTable)
This code updates the final table by adding the new rows that were missing in the earlier table.
All rows which are present in the temporary table are deleted from the final table.
DELETE FinalTable ft
WHERE ft.id IN (SELECT id FROM TemporaryTable);
INSERT data_set_name.FinalTable(id, value)
SELECT id, value
FROM data_set_name.TempoaryTable;
This code deletes all the rows that are the same in the final table and now updates the final table with the new rows from the temporary table.
Integrate Amazon S3 to BigQuery
Integrate Amazon S3 to Snowflake
Integrate Amazon S3 to Redshift
Challenges with Moving Data from Amazon S3 to BigQuery using Custom ETL Script
Writing Custom ETL code to move your data from Amazon S3 to BigQuery has a lot of drawbacks as that affects your data pipeline and leaves you in a less than ideal scenario. These limitations are:
- The above approach only works seamlessly if you plan on moving your data in a one-off situation
- Custom code requires a lot of engineering resources to ensure that your data is consistent during your entire ETL process. This can become a hindrance if you have a resource bandwidth around the project
- Custom ETL codes don’t scale well when your data size is increasing exponentially as you always need to update your code to keep up
- More often than not, businesses today need data in real-time to power their analytics engines. Building a pipeline for such a use case can be very convoluted and cumbersome
Writing Custom ETL codes looks quite seamless and error-free on the surface, but is a very troublesome and problematic process. It has the propensity to make you re-evaluate if your data migration project was truly worth it.
Here is a simpler alternative.
Method 2: Move Data from Amazon S3 to BigQuery using Hevo Data
Hevo Data, a No-code Data Pipeline helps to Load Data from any data source such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services and simplifies the ETL process. It supports 150+ data sources including Amazon S3, etc., and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo loads the data onto the desired Data Warehouse such as Google BigQuery, enriches the data, and transforms it into an analysis-ready form without writing a single line of code.
Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different Business Intelligence (BI) tools as well.
Get Started with Hevo for free
Check out why Hevo is the Best:
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
For a data pipeline tool, we expect to be able to seamlessly integrate our data sources with custom scheduling, logging, and alerts. Hevo provides all these features and allows us to get access to data when we critically need it.
– Chris Lockhart, Data Science Manager, Scratchpay
Sign up here for a 14-day Free Trial!
Conclusion
This article introduced you to Amazon S3 and Google BigQuery. It also provided 2 easy methods that you can use while moving data from Amazon S3 to BigQuery. The 1st method includes Manual Integration between the 2 platforms while the 2nd method is hassle-free, easy to use, and automated. Hevo Data, a No-code Data Pipeline can help you transfer your data more easily and effectively.
Visit our Website to Explore Hevo
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.
Share your experience of moving data from Amazon S3 to BigQuery in the comments section below!
Bukunmi is curious about learning on complex concepts and latest trends in data science and combines his flair for writing to curate content for data teams to help them solve business challenges.