Amazon Aurora to BigQuery ETL – Steps to Move Data

on Tutorial • June 13th, 2019 • Write for Hevo

In this day businesses are generating a huge amount of data regularly. To make important decisions this raw data is very essential. However, there are a few major challenges in the process. It is very difficult to analyze such huge amount of data (Petabyte) using a traditional database like MySQL, Oracle, SQL Server, etc. In order to get any tangible insight from this data, you would need to move data to Data Warehouse like Google BigQuery. This post provides a step by step walkthrough on how to migrate data from Amazon Aurora to BigQuery data warehouse.

What is Amazon Aurora?

Amazon Aurora is a MySQL-compatible relational database used by businesses. Aurora offers better performance and cost-effective price than traditional MySQL. It is primarily used for a transactional or operational database. It is specifically not recommended for analytics.

What is Google BigQuery?

BigQuery is a Google-managed cloud-based data warehouse service. This is intended to store, process and analyze large volume (Petabytes) of data to make data analysis more accurate. BigQuery is known to give quick results with very minimal cost and great performance. Since infrastructure is managed by Google, you as a developer, data analyst or data scientist can focus on uncovering meaningful insights using native SQL.

Performing ETL from Amazon Aurora to BigQuery

There are majorly two ways of migrating data from Aurora to BigQuery. The two ways are:

Method 1: Using a Hassle-free, easy-to-use Data Pipeline Platform such as Hevo Data (comes with a 14-day free trial)

Method 2: Write custom ETL scripts to move data

In this post, we will cover the second method (Custom Code) in detail. Towards the end of the post, you can also find a quick comparison of both data replication methods so that you can evaluate your requirements and choose wisely.

Aurora to BigQuery ETL Using Custom Code:

The steps involved in migrating data from Aurora DB to BigQuery are as follows:

  1. Getting data out of Amazon Aurora
  2. Preparing Amazon Aurora data
  3. Upload data to Google Cloud Storage
  4. Upload to BigQuery from GCS
  5. Update the target table in BigQuery

Step 1: Getting data out of Amazon Aurora

By writing SQL queries we can export data from Aurora. The SELECT queries enable us to pull the data we want. You can specify filters and order of the data. You can also limit results.

A command-line tool called mysqldump lets you export entire tables and databases in a format you specify (i.e. delimited text, CSV, or SQL queries).

mysql -u user_name -p --database=db_name --host=rds_hostname --port=rdsport --batch -e "select * from table_name" | sed 's/t/","/g;s/^/"/;s/$/"/;s/n//g' > file_name

Step 2: Preparing Amazon Aurora data

You need to make sure the target BigQuery table is perfectly aligned with source Aurora table, specifically column sequence and data type of columns.

Step 3: Loading data into Google BigQuery

You can use the bq command-line tool to upload the files to your datasets, adding schema and data type information. In GCP quickstart guide you can find the syntax of bq command line. Iterate through this process as many times as it takes to load all of your tables into BigQuery.

Once the data has been extracted from the Aurora database the next step is to upload it to the GCS There are multiple ways this can be achieved. The various methods are explained below.

(A) Using Gsutil

The gsutil utility will help us upload a local file to GCS(Google Cloud Storage) bucket.

To copy a file to GCS:

gsutil cp local_copy.csv  gs://gcs_bucket_name/path/to/folder/

To copy an entire folder to GCS:

gsutil local_dir_name -r dir gs://gcs_bucket_name/path/to/parent_folder/

(B) Using Web console

An alternative means to upload the data from your local machine to GCS is using the web console. To use the web console alternative, follow the steps laid out below:

1. First of all, you need to Login to your GCP account. You ought to have a working Google account to make use of GCP. In the menu option, click on storage and navigate to the browser on the left tab

2.Create a new bucket to upload your data. Make sure the name you choose is globally unique

3. Click on the bucket name that you have created in step 2, this will ask to you browse the file from your local machine

4. Choose the file and click on the upload button. Once you see a progress bar wait for the action to be completed. You can see the file is loaded in the bucket.

Step 4: Upload data to BigQuery from GCS bucket

You can upload data to BigQuery from GCS using two methods:  (A) Using console UI (B) Using the command line

(A) Uploading the data using the web console UI:

1. Go to the BigQuery from the menu option

2. On UI click on create a dataset, provide dataset name and location

3. Then click on the name of created dataset name. Click on create table option and provide the dataset name, table name, project name, table type.

(B) Using data using the command line

To open the command line tool, on the GCS home page click on the cloud shell icon shown below:

The Syntax of the bq command line to load the file in the BigQuery table:

bq --location=[LOCATION] load --source_format=[FORMAT] [DATASET].[TABLE]
[PATH_TO_SOURCE] [SCHEMA]

[LOCATION] is an optional parameter that represents Location name like “us-east”

[FORMAT] to load CSV file set it to CSV

[DATASET] dataset name.

[TABLE] table name to load the data.

[PATH_TO_SOURCE] path to source file present on the GCS bucket.

[SCHEMA] Specify the schema

Note: Autodetect flag recognizes the table schema

You can specify your schema using bq command line:

bq --location=US load --source_format=CSV your_dataset.your_table gs://your_bucket/your_data.csv ./your_schema.json

Your target table schema can also be autodetected:

bq --location=US load --autodetect --source_format=CSV your_dataset.your_table  gs://mybucket/data.cs

BigQuery command line interface offers us to 3 options to write to an existing table.

  1. Overwrite the table

     

    bq --location = US load 
    --autodetect 
    --replace 
    --source_file_format = CSV
    your_target_dataset_name.your_target_table_name gs://source_bucket_name/path/to/file/source_file_name.csv
  2. Append data to the table

     

    bq --location = US load 
    --autodetect 
    --noreplace 
    --source_file_format = CSV
    your_target_dataset_name.your_table_table_name gs://source_bucket_name/path/to/file/source_file_name.csv 
    ./schema_file.json
  3. Adding new fields in the target table
    bq --location = US load
     --noreplace
     --schema_update_option = ALLOW_FIELD_ADDITION 
    --source_file_format = CSV your_target_dataset.your_target_table gs://bucket_name/source_data.csv ./target_schema.json 

Step 5: Update the target table in BigQuery

The data that was matched in the above-mentioned steps have not done complete data updates on the target table. The data is stored in an intermediate data table, this is because GCS is a staging area for BigQuery upload. Hence, the data is stored in an intermediate table before been uploaded to BigQuery

There two ways of updating the final table as explained below:

  1. Update the rows in the final table, Then insert new rows from the intermediate table

     

    UPDATE target_table  t
    SET t.value = s.value
    FROM  intermediate_table  s
    WHERE t.id = s.id;
    INSERT target_table (id, value)
    SELECT id, value
    FROM  intermediate_table  WHERE NOT id IN (SELECT id FROM target_table);
  2. Delete all the rows from the final table which are in the intermediate table, Then insert all the rows newly loaded in intermediate table. Here the intermediate table will be in truncate and load mode
    DELETE FROM final_table f
    WHERE f.id IN (SELECT id from intermediate_table);
    
    INSERT data_setname.target_table(id, value) SELECT id, value 
    FROM data_set_name.intermediate_table;

Limitations Of writing Custom Scripts and Developing ETL to move data from Aurora to BigQuery:

  1. Writing custom code would benefit only if you are looking for one-time data migration from Amazon Aurora to BigQuery
  2. When you have a use case where data needs to be migrated on an ongoing basis or in real-time, you would have to move it in an incremental manner. The above custom code ETL would fail here. You would need to write additional lines of code to achieve this real-time data migration
  3. There are chances that the custom code breaks if the source schema gets changed
  4. If in future you identify the data transformations needs to be applied on data, you would need extra time and resources
  5. Since you have developed this custom code to migrate data you have to maintain the standard of the code to achieve the business goals
  6. In the custom code approach, You have to focus on both business and technical details
  7. ETL code is fragile with a high susceptibility to break the entire process that may cause inaccurate and delay in data availability in BigQuery

AN EASIER WAY TO MOVE DATA FROM AMAZON AURORA TO BIGQUERY:

Using a fully managed, easy to use Data Pipeline platform like Hevo, you can load your data from Aurora to BigQuery in a matter of minutes. This can be achieved in a code-free, point and click visual interface. Here are simple steps to replicate Amazon Aurora to BigQuery using Hevo:

  • Connect to your Aurora DB
  • Select one of the following the replication mode:
    1. Full dump (load all tables)
    2. Load data from Custom SQL Query
    3. Fetch data using BinLog
  • Configure your BigQuery Warehouse and select the frequency at which you want to load data

With Hevo, you can achieve simple and efficient Data Replication from Aurora to BigQuery. Hevo can help you move data from not just Aurora DB but 100s of additional data sources.

Sign up for a 14-Day Free Trial with Hevo and experience a seamless, hassle-free data loading experience from Aurora DB to Google BigQuery.

No-code Data Pipeline for BigQuery