Press "Enter" to skip to content

Amazon S3 to BigQuery ETL – Steps to Move Data

S3 to BigQuery Data MigrationAre you currently running on Amazon S3 but your business needs requires you to stream your data into a data warehouse and gain insights in real-time? Or are you having issues with adoption that requires you to hire a lot of Database Engineers before you can run efficient queries? Or do you feel your paying too much for the queries you’re executing? Whatever the issue you’re experiencing, having a data pipeline that helps migrate your data seamlessly from Amazon S3 to BigQuery will aid in putting these issues to bed. In this blog post, I will be giving a detailed explanation of how you can migrate your data from Amazon S3 to Google BigQuery.

Overview of Amazon S3 and Google BigQuery

Amazon S3 is a cloud-based storage service developed by Amazon Web Services (AWS). The name S3 is coined from the combination of (Simple Storage Service). It offers an interface that is web-based which allows you to store, manage and retrieve your data in real-time. It runs on the same network in which Amazon runs its global conglomerate of web sites. One of the major appeals of Amazon S3 is that gives developers access to a scalable, fast and secure storage infrastructure built on AWS.

BigQuery is a serverless and highly scalable data warehouse built by Google Cloud Platform (GCP), Google’s Cloud Computing Subsidiary. BigQuery allows you to process up to terabytes of data at neck-breaking speed. It runs a fully serverless system that abstracts you from managing any form of infrastructure, hence you are given the liberty to focus mainly on analytics. With no extra cost, BigQuery allows you to scale up and down. This is possible to its unmatched parallel compute engine. Your data is queried using the familiar SQL (Structured Query Language) which enables you to gain novel and in-depth insights from your data.

Moving Data from Amazon S3 to BigQuery

here are two ways to move your data from Amazon S3 to Google BigQuery. The two methods are:

  1. Method 1: Using a fully managed third-party Data Pipeline Platform  like Hevo (14-day free trial)
  2. Method 2: Hand coding ETL scripts to move data from Amazon S3 to Google BigQuery

In this post, I will be giving a detailed explanation of the second method stated above.

Towards the end of this article, I will be making a comparison of both methods and giving my thoughts on which method you should employ.

Writing Custom ETL Scripts to move data from Amazon S3 to BigQuery

The steps to execute this method are listed below:

  1. Getting Authenticated to be able to interact with the AWS bucket that contains your data
  2. Get access keys that will enable you to ingest into GCS
  3. Ingest data into GCS (Google Cloud Storage) using the AWS S3 option
  4. Load the ingested data into BigQuery from GCS
  5. Update the final table in BigQuery

Step 1: Getting Authenticated to be able to interact with the AWS bucket that contains your data

You would have a bucket is an AWS resource that lets you store and manage your data. AWS provides the user with API’s to manage this resource. Your AWS bucket must be globally unique, so whatever name you decide to your bucket must be exclusive to you as you cannot use the name of another bucket in the same region. It is always advised and is a best practice to make use of regions that are geographically closer to you.

The user who creates a bucket has full administrative access to the bucket. AWS offers a service called IAM (Identity and Access Management). It allows the user with administrative access to securely control access to the AWS bucket and resources. With IAM you can give access to who can access and migrate your data. Whoever wants to migrate your data needs to be authenticated by the administrative user.

The following permissions need to be granted to the user:

  1. S3: List Bucket: Allows the user to list objects in the bucket
  2. S3: Get Object: Allows the user to read objects in the bucket
  3. S3: Delete Object: Allows the user to able to delete objects in the bucket
  4. S3: GetBucketLocation: Allows the user to be able to access the location of the bucket

To give a user these permissions you need to create a custom bucket policy. A bucket policy is an IAM action that grants other AWS account users permissions to a bucket and the objects contained in that bucket. The steps to create a custom bucket are listed out below:

  1. Login to your AWS console
  2. From the home page, select Bucket Name, select the bucket which you will be creating a policy for
  3. Select Permissions and choose Bucket Policy
    Give Permission for S3 bucket
  4. In the Bucket policy text editor, you can edit the permissions, an illustration is given below
    {
     "Version": "2012-10-17",
     "Statement": [
       {
         "Sid": "creating a policy document",
         "Effect": "Allow",
         "Principal": {
           "AWS": "arn:aws:iam::Account-ID:user/Sam"
         },
         "Action": [
           "s3:GetBucket",
           "s3:ListBucket",
      "s3:DeleteBucket",
      "s3:GetBucketLocation"
         ],
         "Resource": "arn:aws:s3:::mynewbucket"
       }
     ]
    }

    In the code snippet above, we granted the permissions listed above to a user name Sam

  5. Click on the policy generator icon and select save

From these steps, we have been able to create a bucket policy, that will allow the user to migrate data to BigQuery.

Step 2: Get access keys that will enable you to Ingest into GCS

After getting authenticated, you will need to get the access keys that will enable you to be able to ingest your data into GCS. The steps to achieve this are given below:

  1. To generate access keys for your user account you have to create a bucket policy. An illustration is given below:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "CreateOwnAccessKeys",
                "Effect": "Allow",
                "Action": [
                    "iam:CreateAccessKey",
                    "iam:GetUser",
                    "iam:ListAccessKeys"
                ],
                "Resource": "arn:aws:iam::*:user/sam"
            }
        ]
     }
    
  2. After saving the bucket policy. Go to your IAM home page
  3. In the upper right corner, you will find a navigation bar.  Here, choose your user name and select My security credentials
  4. You will be taken to an AWS IAM credentials tab, you will see an Access key for API access section. Select Create access keySelect S3 Access Key
  5. Then choose the Download.csv icon to download your access keys locally to your computer. These will be the passcodes that will enable you to ingest your data in GCS

 You need to make sure you save these access keys in a safe place as if it is misplaced you will not be able to generate a new set of access keys.

Step 3: Ingest data into GCS (Google Cloud Storage) using the AWS S3 option

After gaining all the needed access, the next step is to ingest your data into GCS. GCS provides a transfer option to migrate your data from AWS S3. The steps to do these are outlined below:

  1. Login to your GCP account, from the hamburger menu, select storage and select browser on the left tab
  2. Create a new bucket. It is pertinent the name given to your new bucket is globally unique
  3. After creating a new bucket, on the left side of the storage page select the Transfer icon and click on create TransferTransfer S3 Data into GCS
  4. The window will take you to a page where you have to specify your source. Select Amazon S3
  5. Fill the form with all the credentials you obtained when authenticating your AWS IAM access. Your data will be transferred into the GCS bucket. It also provides an option to choose how frequently you want to ingest the data whether daily or weekly. From this populated bucket, we will ingest our data into BigQuery

Step 4: Load the ingested data in BigQuery from GCS

After transferring your data into GCS, the next step is to migrate the data into BigQuery for analysis. Steps to achieve these are given below:

  1. Go to the GCP (Google Cloud Platform) home page, and from the hamburger menu on the left-hand side of the page, select BigQuery
  2. Create a new dataset and fill all the requirements requested
  3. Create a new table, select the GCS bucket you just populated with data from S3
  4. Choose your desired name for the destination table and fill it in
  5. Select auto-detect schema in the select schema section
  6. Ensure your file format is on CSV
  7. Click on create a table to finalise the ingestion process

You can also make use of command line to ingest your data into BigQuery. An example is shown below:

In this code snippet, you get to specify your own schema.

bq  --location= Europe load --source_format = CSV  your_dataset.your_table
gs://my-amazon-bucket/amazon_data.csv ./your_schema.json

The auto-detect feature can also be used to indicate your schema. An example is shown below.

bq  --location= Europe load --autodetect = CSV  your_dataset.your_table
gs://my-amazon-bucket/amazon_data.csv


Step 5: Update the target table in BigQuery

When your data is ingested in GCS, it not always up to date as GCS is a staging area for BigQuery. So your data will be stored in a temporary table. The temporary table is a means to deal with updates and inserts that are dissimilar. So anytime you stream new data in GCS, your temporary table will need to be updated before the changes can be effected in BigQuery. The ways to update your temporary table in BigQuery will be explained below:

Inserting new rows from the temporary table to update the final table

UPDATE FinalTable ft
SET ft.value = tt.value
FROM TemporaryTable tt
WHERE ft.id = tt.id; INSERT FinalTable (id,value)
SELECT id, value
FROM TemporaryTable
WHERE NOT id IN (SELECT id FROM FinalTable)

This code updates the final table by adding the new rows that were missing in the earlier table.

All rows which are present in the temporary table are deleted from the final table.

DELETE FinalTable ft
WHERE ft.id IN (SELECT id FROM TemporaryTable);
INSERT data_set_name.FinalTable(id, value)
SELECT id, value
FROM  data_set_name.TempoaryTable;

This code deletes all the rows that are the same in the final table and now updates the final table with the new rows from the temporary table.

Limitations and Challenges with writing Custom ETL code to Move Data from Amazon S3  to BigQuery

Writing Custom ETL code to move your data has a lot of drawbacks as that affects your data pipeline and leaves you in a less than ideal scenario. These limitations are:

  1. The above approach only works seamlessly if you plan on moving your data from AWS S3 to BigQuery in a one-off situation
  2. Custom code requires a lot of engineering resources to ensure that your data is consistent during your entire ETL process. This can become a hindrance if you have a resource bandwidth around the project
  3. Custom ETL codes don’t scale well when your data size is increasing exponentially as you always need to update your code to keep up
  4. More often than not, business today need data in real time to power their analytics engines. Building a pipeline for such a use case can be very convoluted and cumbersome

Writing Custom ETL codes looks quite seamless and error-free on the surface, but is a very troublesome and problematic process. It has the propensity to make you re-evaluate if your data migration project was truly worth it.

AN EASIER WAY TO ETL DATA FROM AMAZON S3 TO BIGQUERY:

Using a simple-to-use Data Pipeline platform like Hevo can help you move your data from Amazon S3 to BigQuery in no time. Hevo’s real-time streaming architecture ensures that you have accurate, latest data in your warehouse. Forget ETL Scripts, Forget Coding. With Hevo, move your data from S3 to BigQuery in 3 simple steps:

  • Connect to Amazon S3 source by providing connection settings
  • Select the file format (JSON/CSV) and Create Schema Folders
  • Configure BigQuery Warehouse where the data needs to be moved

Hevo helps you move data from Amazon S3 to BigQuery in a simple, effective and consistent manner. Hevo can move data from not just S3, but many other additional data sources.

Sign up for a 14-Day Free Trial with Hevo for an error-free, efficient data loading from Amazon S3 to BigQuery.

ETL Data to Redshift, Bigquery, Snowflake

Move Data from any Source to Warehouse in Real-time

Sign up today to get $500 Free Credits to try Hevo!
Start Free Trial

Related Posts