DynamoDB to BigQuery ETL: 3 Easy Steps to Move Data

on Tutorial, Data Warehouse, Database, DynamoDB, Google BigQuery • September 15th, 2021 • Write for Hevo

If you wish to move your data from DynamoDB to BigQuery, then you are on the right page. This post aims to help you understand the methods to move data from DynamoDB to BigQuery. But, before we get there, it is important to briefly understand the features of DynamoDB and BigQuery.

Table of Contents

Introduction to DynamoDB and Google BigQuery

DynamoDB and BigQuery are popular, fully managed cloud databases provided by the two biggest names in Tech. Having launched for business in 2012 and 2010 respectively, these come as part of a host of services offered by their respective suite of services. This makes the typical user wanting to stick to just one, a decision that solidifies as one looks into the cumbersome process of setting up and maximizing the potential of having both these up and running parallelly. That being said, businesses still end up doing this for a variety of reasons, and therein lies the relevance of discussing this topic. 

Moving data from DynamoDB to BigQuery

As mentioned before, because these services are offered by two different companies that want everything to be done within their tool suite, it is a non-trivial task to move data seamlessly from one to the other. Here are the two ways to move data from DynamoDB to BigQuery:

1) Using Hevo Data: An easy-to-use integration platform that gets the job done with minimal effort.

2) Using Custom Scripts: You can custom build your ETL pipeline by hand-coding scripts.

This article aims to guide the ones that have opted to move data on their own from DynamoDB to BigQuery. The blog would be able to guide you with a step-by-step process, make you aware of the pitfalls and provide suggestions to overcome them. 

Steps to Move Data from DynamoDB to Bigquery using Custom Code Method

DynamoDB to BigQuery
Image Source: Self

Below are the broad steps that you would need to take to migrate your data from DynamoDB to BigQuery. Each of these steps is further detailed in the rest of the article.

Step 1: Export the DynamoDB Data onto Amazon S3

The very first step is to transfer the source DynamoDB data to Amazon S3. Both S3 and GCS(Google Cloud Storage) support CSV as well as JSON files but for demonstration purposes, let’s take the CSV example. The actual export from DynamoDB to S3 can be done using the Command Line or via the AWS Console.

Method 1
The command-line method is a two-step process. First, you export the table data into a CSV file:

$aws dynamodb scan --table-name hevo_dynamo --output > hevo.txt

The above would produce a tab-separated output file which can then be easily converted to a CSV file. This CSV file (hevo.csv, let’s say) could then be uploaded to an S3 bucket using the following command:

$aws s3 cp hevo.csv s3://hevobucket/hevo.csv 

Method 2
If you prefer to use the console, sign in to your Amazon Console here.
The steps to be followed on the console are mentioned in detail in the AWS documentation here.

Step 2: Setting Up Google Cloud Storage and Copy Data from Amazon S3

The next step is to move the S3 data file onto Google Cloud Storage. As before, there is a command-line path as well as the GUI method to get this done. Let’s go through the former first.

  • Using gsutil

    gsutil is a command-line service to access and do a number of things on Google Cloud; primarily it is used to work with the GCS buckets.

    To create a new bucket the following command could be used:
$gsutil mb gs://hevo_gc/hevo
  • You could mention a bunch of parameters in the above command to specify the cloud location, retention, etc. (full list here under ‘Options’) per your requirements. An interesting thing about BigQuery is that it generally loads uncompressed CSV files faster than compressed ones. Hence, unless you are sure of what you are doing, you probably shouldn’t run a compression utility like gzip on the CSV file for the next step. Another thing to keep in mind with GCS and your buckets is setting up access control. Here are all the details you will need on that.

  • The next step is to copy the S3 file onto this newly created GCS bucket. The following copy command gets that job done:

$gsutil cp s3://hevo_s3/hevo.csv/ gs://hevo_gc/hevo.csv
  • BigQuery Data Transfer Service

    This is a relatively new and faster way to get the same thing done. Both CSV and JSON files are supported by this service however there are limitations that could be found here and here. Further documentation and the detailed steps on how to go about this can be found here.

Step 3: Import the Google Cloud Storage File into the BigQuery Table

Every BigQuery table lies in a specific data set of a specific project. Hence, the following steps are to be executed in the same order:

  • Create a new project.
  • Create a data set.
  • Run the bq load command to load the data into a table.
  1. The first step is to create a project. Sign in on the BigQuery Web UI. Click on the hamburger button ( ) and select APIs & Services.  Click Create Project and provide a project name (Let’s say ‘hevo_project’). Now you need to enable BigQuery for which search for the same and click on Enable. Your project is now created with BigQuery enabled.
  1. The next step is to create a data set. This can be quickly done using the bq command-line tool and the command is called mk. Create a new data set using the following command:
$bq mk hevo_dataset
  1. At this point, you are ready to import the GCS file into a table in this data set. The load command of bq lets you do the same. It’s slightly more complicated than the mk command so let’s go through the basic syntax first.
Bq load command syntax - 
$bq load project:dataset.table --autodetect --source_format
  1. autodetect is a parameter used to automatically detect the schema from the source file and is generally recommended. Hence, the following command should do the job for you:
$bq load hevo_project:hevo_dataset.hevo_table --autodetect 
--source_format=CSV gs://hevo_gc/hevo.csv

The GCS file gets loaded into the table hevo_table.

  • If no table exists under the name ‘hevo_table’ the above load command creates a new table.
  • If hevo_table is an existing table there are two types of load available to bring the source data into this table – Overwrite or Table Append.

Here’s the command to overwrite or replace:

$bq load hevo_project:hevo_dataset.hevo_table --autodetect --replace 
--source_format=CSV gs://hevo_gc/hevo.csv

Here’s the command to append data:

$bq load hevo_project:hevo_dataset.hevo_table --autodetect --noreplace 
--source_format=CSV gs://hevo_gc/hevo.csv

You should be careful with the append in terms of unique key constraints as BigQuery doesn’t enforce it on its tables.

Incremental load – Type 1/ Upsert

In this type of incremental load, a new record from the source is either inserted as a new record in the target table or replaces an existing record in the target table.

Let’s say the source (hevo.csv) looks like this:

IDNameSalaryDate
123Tom500002017-06-01
111Joe600002017-06-02

And the target table (hevo_table) looks like this:

IDNameSalaryDate
123Tom500002017-06-01
111Joe650002017-07-02

Post incremental load, hevo_table will look like this:

IDNameSalaryDate
123Tom500002017-06-01
111Joe650002017-07-02

The way to do this would be to load the hevo.csv into a separate table (staging table) first, let’s call it, hevo_intermediate. This staging table is then compared with the target table to perform the upsert as follows:

INSERT hevo_dataset.hevo_table (id, name, salary, date)

SELECT id, name, salary, date

FROM  hevo_dataset.hevo_intermediate
 WHERE NOT id IN (SELECT id FROM hevo_dataset.hevo_intermediate);

UPDATE hevo_dataset.hevo_table h

SET h.name = i.name,

h.salary = i.salary,

h.date = i.date

FROM  hevo_dataset.hevo_intermediate i

WHERE h.id = i.id;

Incremental load – Type 2/ Append Only 

In this type of incremental load, a new record from the source is always inserted into the target table if at least one of the fields has a different value from the target. This is quite useful to understand the history of data changes for a particular field and helps drive business decisions. 

Let’s take the same example as before. The target table in this scenario would look like the following: 

IDNameSalaryDate
123Tom500002017-06-01
111Joe600002017-06-01
111Joe650002017-07-02

To write the code for this scenario, you first insert all the records from the source to the target table as below:

INSERT hevo_dataset.hevo_table (id, name, salary, date)

SELECT id, name, salary, date

FROM  hevo_dataset.hevo_intermediate;

Next, you delete the duplicate records (all fields have the same value) using the window function like this:

DELETE FROM (SELECT id, name, salary, date, ROW_NUMBER() OVER(PARTITION BY id, name, salary, date) rn

FROM  hevo_dataset.hevo_table)

WHERE rn <> 1;

Hurray! You have successfully migrated your data from DynamoDB to BigQuery.

Limitations of Moving Data from DynamoDB to BigQuery using Custom Code Method

Limitations of moving data from DynamoDB to BigQuery
Image Source

As you have seen now, Data Replication from DynamoDB to BigQuery is a lengthy and time-consuming process. Furthermore, you have to take care of the following situations:

  • The example discussed in this article is to demonstrate copying over a single file from DynamoDB to BigQuery. In reality, hundreds of tables would have to be synced periodically or close to real-time; to manage that and not be vulnerable to data loss and data inconsistencies is quite the task.  
  • There are sometimes subtle, characteristic variations between services, especially when the vendors are different. It could happen in file Size Limits, Encoding, Date Format, etc. These things may go unnoticed while setting up the process and if not taken care of before kicking off Data Migration, it could lead to loss of data. 

So, to overcome these limitations to migrate your data from DynamoDB to BigQuery, let’s discuss an easier alternative – Hevo.

An easier approach to move data from DynamoDB to BigQuery using Hevo

Hevo- DynamoDB to BigQuery

The tedious task of setting this up as well as the points of concern mentioned above does not make the ‘custom method’ endeavor a suggestible one. You can save a lot of time and effort by implementing an integration service like Hevo and focus more on looking at the data and generating insights from it. Here is how you can migrate your data from DynamoDB to BigQuery using Hevo:

  • Connect and configure your DynamoDB Data Source.
  • Select the Replication mode: (i) Full dump (ii) Incremental load for append-only data (iii) Incremental load for mutable data.
  • Configure your Google BigQuery Data Warehouse where you want to move data.
SIGN UP HERE FOR A 14-DAY FREE TRIAL!

Conclusion

In this article, you got a detailed understanding of how to migrate your data from DynamoDB to BigQuery using Custom code. You also learned some of the limitations associated with this method. Hence, you were introduced to an easier alternative- Hevo to migrate your data from DynamoDB to BigQuery seamlessly.

With Hevo, you can move data in real-time from DynamoDb to BigQuery in a reliable, secure, and hassle-free fashion. In addition to this, Hevo has 100+ native data source integrations that work out of the box. You could explore the integrations here.

VISIT OUR WEBSITE TO EXPLORE HEVO

Before you go ahead and take a call on the right approach to move data from DynamoDB to BigQuery, you should try Hevo for once.

SIGN UP to experience Hevo’s hassle-free Data Pipeline platform.

Share your experience of moving data from DynamoDB to BigQuery in the comments section below!

No-code Data Pipeline for BigQuery