In this day businesses are generating a huge amount of data regularly. To make important decisions this raw data is very essential. However, there are a few major challenges in the process. It is very difficult to analyze such a huge amount of data (Petabyte) using a traditional database like MySQL, Oracle, SQL Server, etc. In order to get any tangible insight from this data, you would need to move data to Data Warehouse like Google BigQuery. This post provides a step-by-step walkthrough on how to migrate data from Amazon Aurora to the BigQuery Data warehouse using 2 steps. Read along and decide which method suits you the best!
Method 1: Using Custom Code to Move Data from Aurora to BigQuery
This method consists of a 5-step process to move data from Amazon Aurora to BigQuery through custom ETL Scripts. There are various advantages to using this method, but there are also a few limitations. To perform ETL using this method, you must be technically sound and have good programming skills.
Method 2: Using Hevo Data to Move Data from Aurora to BigQuery
Hevo Data can load your data from Aurora to BigQuery in minutes without writing a single line of code and for free. Data loading can be configured on a visual, point, and click interface. Since Hevo is fully managed, you would not have to invest any additional time and resource in maintaining and monitoring the data. Hevo promises 100% data consistency and accuracy.
Try Hevo for Free!
Methods to Connect Aurora to BigQuery
Here are the methods you can use to connect Aurora to BigQuery in a seamless fashion:
In this post, we will cover the second method (Custom Code) in detail. Towards the end of the post, you can also find a quick comparison of both data replication methods so that you can evaluate your requirements and choose wisely.
Method 1: Using Custom Code to Move Data from Aurora to BigQuery
This method requires you to manually set up the data transfer process from Aurora to BigQuery. The steps involved in migrating data from Aurora DB to BigQuery are as follows:
Step 1: Getting Data out of Amazon Aurora
By writing SQL queries we can export data from Aurora. The SELECT queries enable us to pull the data we want. You can specify filters and order of the data. You can also limit results.
A command-line tool called mysqldump lets you export entire tables and databases in a format you specify (i.e. delimited text, CSV, or SQL queries).
mysql -u user_name -p --database=db_name --host=rds_hostname --port=rdsport --batch -e "select * from table_name" | sed 's/t/","/g;s/^/"/;s/$/"/;s/n//g' > file_name
Step 2: Preparing Amazon Aurora Data
You need to make sure the target BigQuery table is perfectly aligned with the source Aurora table, specifically column sequence and data type of columns.
Step 3: Upload Data to Google Cloud Storage
You can use the bq command-line tool to upload the files to your datasets, adding schema and data type information. In GCP quickstart guide you can find the syntax of bq command line. Iterate through this process as many times as it takes to load all of your tables into BigQuery.
Once the data has been extracted from the Aurora database the next step is to upload it to the GCS There are multiple ways this can be achieved. The various methods are explained below.
(A) Using Gsutil
The gsutil utility will help us upload a local file to GCS(Google Cloud Storage) bucket.
To copy a file to GCS:
gsutil cp local_copy.csv gs://gcs_bucket_name/path/to/folder/
To copy an entire folder to GCS:
gsutil local_dir_name -r dir gs://gcs_bucket_name/path/to/parent_folder/
(B) Using Web console
An alternative means to upload the data from your local machine to GCS is using the web console. To use the web console alternative, follow the steps laid out below:
1. First of all, you need to Login to your GCP account. You ought to have a working Google account to make use of GCP. In the menu option, click on storage and navigate to the browser on the left tab
2. Create a new bucket to upload your data. Make sure the name you choose is globally unique
3. Click on the bucket name that you have created in step 2, this will ask to you browse the file from your local machine
4. Choose the file and click on the upload button. Once you see a progress bar wait for the action to be completed. You can see the file is loaded in the bucket.
Step 4: Upload to BigQuery from GCS
You can upload data to BigQuery from GCS using two methods: (A) Using console UI (B) Using the command line
(A) Uploading the data using the web console UI:
1. Go to the BigQuery from the menu option
2. On UI click on create a dataset, provide dataset name and location
3. Then click on the name of created dataset name. Click on create table option and provide the dataset name, table name, project name, table type.
(B) Using data using the command line
To open the command-line tool, on the GCS home page click on the cloud shell icon shown below:
The Syntax of the bq command line to load the file in the BigQuery table:
bq --location=[LOCATION] load --source_format=[FORMAT] [DATASET].[TABLE]
[PATH_TO_SOURCE] [SCHEMA]
[LOCATION] is an optional parameter that represents Location name like “us-east”
[FORMAT] to load CSV file set it to CSV
[DATASET] dataset name.
[TABLE] table name to load the data.
[PATH_TO_SOURCE] path to source file present on the GCS bucket.
[SCHEMA] Specify the schema
Note: Autodetect flag recognizes the table schema
You can specify your schema using bq command line:
bq --location=US load --source_format=CSV your_dataset.your_table gs://your_bucket/your_data.csv ./your_schema.json
Your target table schema can also be autodetected:
bq --location=US load --autodetect --source_format=CSV your_dataset.your_table gs://mybucket/data.cs
BigQuery command line interface offers us to 3 options to write to an existing table.
- Overwrite the table
bq --location = US load --autodetect --replace --source_file_format = CSV your_target_dataset_name.your_target_table_name gs://source_bucket_name/path/to/file/source_file_name.csv
- Append data to the table
bq --location = US load --autodetect --noreplace --source_file_format = CSV your_target_dataset_name.your_table_table_name gs://source_bucket_name/path/to/file/source_file_name.csv ./schema_file.json
- Adding new fields in the target table
bq --location = US load --noreplace --schema_update_option = ALLOW_FIELD_ADDITION --source_file_format = CSV your_target_dataset.your_target_table gs://bucket_name/source_data.csv ./target_schema.json
Step 5: Update the Target Table in BigQuery
The data that was matched in the above-mentioned steps have not done complete data updates on the target table. The data is stored in an intermediate data table, this is because GCS is a staging area for BigQuery upload. Hence, the data is stored in an intermediate table before been uploaded to BigQuery
There two ways of updating the final table as explained below:
- Update the rows in the final table, Then insert new rows from the intermediate table
UPDATE target_table t SET t.value = s.value FROM intermediate_table s WHERE t.id = s.id; INSERT target_table (id, value) SELECT id, value FROM intermediate_table WHERE NOT id IN (SELECT id FROM target_table);
- Delete all the rows from the final table which are in the intermediate table, Then insert all the rows newly loaded in the intermediate table. Here the intermediate table will be in truncate and load mode
DELETE FROM final_table f WHERE f.id IN (SELECT id from intermediate_table); INSERT data_setname.target_table(id, value) SELECT id, value FROM data_set_name.intermediate_table;
That’s it! Your Amazon Aurora to Google BigQuery data transfer process is complete.
Limitations of using Custom Code to Move Data from Aurora to BigQuery
The manual approach will allow you to move your data from Amazon Aurora to BigQuery successfully, however it suffers from the following limitations:
- Writing custom code would benefit only if you are looking for one-time data migration from Amazon Aurora to BigQuery.
- When you have a use case where data needs to be migrated on an ongoing basis or in real-time, you would have to move it in an incremental manner. The above custom code ETL would fail here. You would need to write additional lines of code to achieve this real-time data migration.
- There are chances that the custom code breaks if the source schema gets changed.
- If in future you identify the data transformations needs to be applied on data, you would need extra time and resources.
- Since you have developed this custom code to migrate data you have to maintain the standard of the code to achieve the business goals.
- In the custom code approach, You have to focus on both business and technical details.
- ETL code is fragile with a high susceptibility to break the entire process that may cause inaccurate and delay in data availability in BigQuery.
Load your Data from Source to BigQuery within minutes
No credit card required
Method 2: Using Hevo Data to Move Data from Aurora to BigQuery
Using a fully managed, easy-to-use Data Pipeline platform like Hevo, you can load your data from Aurora to BigQuery in a matter of minutes. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.
This can be achieved in a code-free, point-and-click visual interface. Here are simple steps to replicate Amazon Aurora to BigQuery using Hevo:
- Step 1: Connect to your Aurora DB by providing the proper credentials.
- Step 2: Select one of the following the replication mode:
- Full dump (load all tables)
- Load data from Custom SQL Query
- Fetch data using BinLog
- Step 3: Complete Aurora to BigQuery Migration by providing information about your Google BigQuery destination such as the authorized Email Address, Project ID, etc.
About Amazon Aurora
Amazon Aurora is a popular relational database developed by Amazon. It is one of the most widely used Databases for low latency data storage and data processing. This Database operates on Cloud technology and is easily compatible with MySQL and PostgreSQL. This way it provides performance and accessibility similar to traditional databases at a relatively low price. Moreover, it is simple to use and it has Amazon security and reliability features.
Amazon Aurora is a MySQL-compatible relational database used by businesses. Aurora offers better performance and cost-effective price than traditional MySQL. It is primarily used for a transactional or operational database. It is specifically not recommended for analytics.
About Google BigQuery
BigQuery is a Google-managed cloud-based data warehouse service. This is intended to store, process and analyze large volume (Petabytes) of data to make data analysis more accurate. BigQuery is known to give quick results with very minimal cost and great performance. Since infrastructure is managed by Google, you as a developer, data analyst or data scientist can focus on uncovering meaningful insights using native SQL.
Integrate PostgreSQL on Amazon Aurora to BigQuery
Integrate Amazon Ads to BigQuery
Integrate Amazon S3 to BigQuery
Conclusion
This blog talks about the two methods you can implement to move data from Aurora to BigQuery in a seamless fashion.
With Hevo, you can achieve simple and efficient Data Replication from Aurora to BigQuery. Hevo can help you move data from not just Aurora DB but 150+ additional data sources.
FAQs on Aurora to BigQuery
1. What is Amazon Aurora used for?
Amazon Aurora is a fully managed relational database service designed for high availability, performance, and scalability. It is used to run MySQL and PostgreSQL-compatible databases with automated backups, replication, and failover support.
2. How do you integrate Amazon Aurora with BigQuery?
To integrate Amazon Aurora with BigQuery, you can use a data pipeline service like Hevo Data or AWS Data Pipeline to extract data from Aurora, transform it as needed, and load it into BigQuery.
3. What is BigQuery?
BigQuery is a fully managed, serverless data warehouse offered by Google Cloud, designed for processing and analyzing large datasets quickly. It supports SQL queries and integrates with various data sources for real-time analytics and reporting.
Lahudas focuses on solving data practitioners' problems through content tailored to the data industry by using his problem-solving ability and passion for learning about data science.