In this day businesses are generating a huge amount of data regularly. To make important decisions this raw data is very essential. However, there are a few major challenges in the process. It is very difficult to analyze such a huge amount of data (Petabyte) using a traditional database like MySQL, Oracle, SQL Server, etc. In order to get any tangible insight from this data, you would need to move data to Data Warehouse like Google BigQuery. This post provides a step-by-step walkthrough on how to migrate data from Amazon Aurora to the BigQuery Data warehouse using 2 steps. Read along and decide which method suits you the best!
Table of Contents
Methods to Connect Aurora to BigQuery
Here are the methods you can use to connect Aurora to BigQuery in a seamless fashion:
In this post, we will cover the second method (Custom Code) in detail. Towards the end of the post, you can also find a quick comparison of both data replication methods so that you can evaluate your requirements and choose wisely.
Method 1: Using Custom Code to Move Data from Aurora to BigQuery
This method requires you to manually set up the data transfer process from Aurora to BigQuery. The steps involved in migrating data from Aurora DB to BigQuery are as follows:
Step 1: Getting Data out of Amazon Aurora
By writing SQL queries we can export data from Aurora. The SELECT queries enable us to pull the data we want. You can specify filters and order of the data. You can also limit results.
A command-line tool called mysqldump lets you export entire tables and databases in a format you specify (i.e. delimited text, CSV, or SQL queries).
mysql -u user_name -p --database=db_name --host=rds_hostname --port=rdsport --batch -e "select * from table_name" | sed 's/t/","/g;s/^/"/;s/$/"/;s/n//g' > file_name
Step 2: Preparing Amazon Aurora Data
You need to make sure the target BigQuery table is perfectly aligned with the source Aurora table, specifically column sequence and data type of columns.
Step 3: Upload Data to Google Cloud Storage
You can use the bq command-line tool to upload the files to your datasets, adding schema and data type information. In GCP quickstart guide you can find the syntax of bq command line. Iterate through this process as many times as it takes to load all of your tables into BigQuery.
Once the data has been extracted from the Aurora database the next step is to upload it to the GCS There are multiple ways this can be achieved. The various methods are explained below.
(A) Using Gsutil
The gsutil utility will help us upload a local file to GCS(Google Cloud Storage) bucket.
To copy a file to GCS:
gsutil cp local_copy.csv gs://gcs_bucket_name/path/to/folder/
To copy an entire folder to GCS:
gsutil local_dir_name -r dir gs://gcs_bucket_name/path/to/parent_folder/
(B) Using Web console
An alternative means to upload the data from your local machine to GCS is using the web console. To use the web console alternative, follow the steps laid out below:
1. First of all, you need to Login to your GCP account. You ought to have a working Google account to make use of GCP. In the menu option, click on storage and navigate to the browser on the left tab
2. Create a new bucket to upload your data. Make sure the name you choose is globally unique
3. Click on the bucket name that you have created in step 2, this will ask to you browse the file from your local machine
4. Choose the file and click on the upload button. Once you see a progress bar wait for the action to be completed. You can see the file is loaded in the bucket.
Step 4: Upload to BigQuery from GCS
You can upload data to BigQuery from GCS using two methods: (A) Using console UI (B) Using the command line
(A) Uploading the data using the web console UI:
1. Go to the BigQuery from the menu option
2. On UI click on create a dataset, provide dataset name and location
3. Then click on the name of created dataset name. Click on create table option and provide the dataset name, table name, project name, table type.
(B) Using data using the command line
To open the command-line tool, on the GCS home page click on the cloud shell icon shown below:
The Syntax of the bq command line to load the file in the BigQuery table:
bq --location=[LOCATION] load --source_format=[FORMAT] [DATASET].[TABLE]
[LOCATION] is an optional parameter that represents Location name like “us-east”
[FORMAT] to load CSV file set it to CSV
[DATASET] dataset name.
[TABLE] table name to load the data.
[PATH_TO_SOURCE] path to source file present on the GCS bucket.
[SCHEMA] Specify the schema
Note: Autodetect flag recognizes the table schema
You can specify your schema using bq command line:
bq --location=US load --source_format=CSV your_dataset.your_table gs://your_bucket/your_data.csv ./your_schema.json
Your target table schema can also be autodetected:
bq --location=US load --autodetect --source_format=CSV your_dataset.your_table gs://mybucket/data.cs
BigQuery command line interface offers us to 3 options to write to an existing table.
- Overwrite the table
bq --location = US load --autodetect --replace --source_file_format = CSV your_target_dataset_name.your_target_table_name gs://source_bucket_name/path/to/file/source_file_name.csv
- Append data to the table
bq --location = US load --autodetect --noreplace --source_file_format = CSV your_target_dataset_name.your_table_table_name gs://source_bucket_name/path/to/file/source_file_name.csv ./schema_file.json
- Adding new fields in the target table
bq --location = US load --noreplace --schema_update_option = ALLOW_FIELD_ADDITION --source_file_format = CSV your_target_dataset.your_target_table gs://bucket_name/source_data.csv ./target_schema.json
Step 5: Update the Target Table in BigQuery
The data that was matched in the above-mentioned steps have not done complete data updates on the target table. The data is stored in an intermediate data table, this is because GCS is a staging area for BigQuery upload. Hence, the data is stored in an intermediate table before been uploaded to BigQuery
There two ways of updating the final table as explained below:
- Update the rows in the final table, Then insert new rows from the intermediate table
UPDATE target_table t SET t.value = s.value FROM intermediate_table s WHERE t.id = s.id; INSERT target_table (id, value) SELECT id, value FROM intermediate_table WHERE NOT id IN (SELECT id FROM target_table);
- Delete all the rows from the final table which are in the intermediate table, Then insert all the rows newly loaded in the intermediate table. Here the intermediate table will be in truncate and load mode
DELETE FROM final_table f WHERE f.id IN (SELECT id from intermediate_table); INSERT data_setname.target_table(id, value) SELECT id, value FROM data_set_name.intermediate_table;
That’s it! Your Amazon Aurora to Google BigQuery data transfer process is complete.
Limitations of using Custom Code to Move Data from Aurora to BigQuery
The manual approach will allow you to move your data from Amazon Aurora to BigQuery successfully, however it suffers from the following limitations:
- Writing custom code would benefit only if you are looking for one-time data migration from Amazon Aurora to BigQuery.
- When you have a use case where data needs to be migrated on an ongoing basis or in real-time, you would have to move it in an incremental manner. The above custom code ETL would fail here. You would need to write additional lines of code to achieve this real-time data migration.
- There are chances that the custom code breaks if the source schema gets changed.
- If in future you identify the data transformations needs to be applied on data, you would need extra time and resources.
- Since you have developed this custom code to migrate data you have to maintain the standard of the code to achieve the business goals.
- In the custom code approach, You have to focus on both business and technical details.
- ETL code is fragile with a high susceptibility to break the entire process that may cause inaccurate and delay in data availability in BigQuery.
Method 2: Using Hevo Data to Move Data from Aurora to BigQuery
Using a fully managed, easy-to-use Data Pipeline platform like Hevo, you can load your data from Aurora to BigQuery in a matter of minutes. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.
Get Started with Hevo for free
This can be achieved in a code-free, point-and-click visual interface. Here are simple steps to replicate Amazon Aurora to BigQuery using Hevo:
- Step 1: Connect to your Aurora DB by providing the proper credentials.
- Step 2: Select one of the following the replication mode:
- Full dump (load all tables)
- Load data from Custom SQL Query
- Fetch data using BinLog
- Step 3: Complete Aurora to BigQuery Migration by providing information about your Google BigQuery destination such as the authorized Email Address, Project ID, etc.
About Amazon Aurora
Amazon Aurora is a popular relational database developed by Amazon. It is one of the most widely used Databases for low latency data storage and data processing. This Database operates on Cloud technology and is easily compatible with MySQL and PostgreSQL. This way it provides performance and accessibility similar to traditional databases at a relatively low price. Moreover, it is simple to use and it has Amazon security and reliability features.
Amazon Aurora is a MySQL-compatible relational database used by businesses. Aurora offers better performance and cost-effective price than traditional MySQL. It is primarily used for a transactional or operational database. It is specifically not recommended for analytics.
About Google BigQuery
BigQuery is a Google-managed cloud-based data warehouse service. This is intended to store, process and analyze large volume (Petabytes) of data to make data analysis more accurate. BigQuery is known to give quick results with very minimal cost and great performance. Since infrastructure is managed by Google, you as a developer, data analyst or data scientist can focus on uncovering meaningful insights using native SQL.
This blog talks about the two methods you can implement to move data from Aurora to BigQuery in a seamless fashion.
Visit our Website to Explore Hevo
With Hevo, you can achieve simple and efficient Data Replication from Aurora to BigQuery. Hevo can help you move data from not just Aurora DB but 100s of additional data sources.
Sign Up for a 14-Day Free Trial with Hevo and experience a seamless, hassle-free data loading experience from Aurora DB to Google BigQuery. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.
Share your understanding of the Amazon Aurora BigQuery Integration in the comments below!