This post aims to help you understand the methods to move data from DynamoDB to BigQuery. But, before we get there, it is important to briefly understand the features of DynamoDB and BigQuery
Overview of DynamoDB and Google BigQuery
DynamoDB and BigQuery are popular, fully managed cloud databases provided by the two biggest names in Tech. Having launched for business in 2012 and 2010 respectively, these come as part of a host of services offered by their respective suite of services. This makes the typical user wanting to stick to just one, a decision that solidifies as one looks into the cumbersome process of setting up and maximizing the potential of having both these up and running parallelly. That being said, businesses still end up doing this for a variety of reasons and therein lies the relevance of discussing this topic.
Moving data from DynamoDB to BigQuery
As mentioned before, because these services are offered by two different companies that want everything to be done within their tool suite, it is a non-trivial task to move data seamlessly from one to the other. Here are the two ways to move data from DynamoDB to BigQuery:
1) Using Hevo Data: An easy-to-use integration platform that gets the job done with minimal effort.
2) Using Custom Scripts: You can custom build your ETL pipeline by hand coding scripts.
This article aims to guide the ones that have opted to move data on your own. The blog would be able to guide you with a step by step process, make you aware of the pitfalls and provide suggestions to overcome them.
Moving Data from DynamoDB to Bigquery using Custom Code:
Below are the broad steps that you would need to take. Each of these steps is further detailed in the rest of the article.
- Export the DynamoDB data onto Amazon S3.
- Setting up Google Cloud Storage and copy data from Amazon S3.
- Import the GCS data into a BigQuery table.
Step 1: Export the DynamoDB data onto Amazon S3
The very first step is to transfer the source DynamoDB data to Amazon S3. Both S3 and GCS support CSV as well as JSON files but for demonstration purposes, let’s take the CSV example. The actual export from DynamoDB to S3 can be done using the command line or via the AWS console.
- Method 1
The command line method is a two step process. First, you export the table data into a CSV file –
$aws dynamodb scan --table-name hevo_dynamo --output > hevo.txt
The above would produce a tab separated output file which can then be easily converted to a CSV file. This CSV file (hevo.csv, let’s say) could then be uploaded to an S3 bucket using the following command –
$aws s3 cp hevo.csv s3://hevobucket/hevo.csv
- Method 2
If you prefer to use the console, sign into your Amazon Console – https://console.aws.amazon.com/datapipeline/
The steps to be followed on the console are mentioned in detail in the AWS documentation here.
Step 2: Setting up Google Cloud Storage and copy data from Amazon S3
Next step is to move the S3 data file onto Google Cloud Storage. As before, there is a command line path as well as the GUI method to get this done. Let’s go through the former first –
gsutil is a command line service to access and do a number of things on Google Cloud; primarily it is used to work with the GCS buckets.
To create a new bucket the following command could be used –
$gsutil mb gs://hevo_gc/hevo
You could mention a bunch of parameters in the above command to specify the cloud location, retention, etc. (full list here under ‘Options’) per your requirements. An interesting thing about BigQuery is that it generally loads uncompressed CSV files faster than the compressed ones. Hence, unless you are sure of what you are doing, you probably shouldn’t run a compression utility like gzip on the CSV file for the next step. Another thing to keep in mind with GCS and your buckets is setting up access control. Here’s all the details you will need on that.
The next step is to copy the S3 file onto this newly created GCS bucket. The following copy command gets that job done:
$gsutil cp s3://hevo_s3/hevo.csv/ gs://hevo_gc/hevo.csv
BigQuery Data Transfer Service
Further documentation and the detailed steps on how to go about this can be found here – https://cloud.google.com/bigquery/docs/s3-transfer
Step 3: Import the Google Cloud Storage file into the BigQuery table
Every BigQuery table lies in a specific data set of a specific project. Hence, the following steps are to be executed in the same order:
- Create a new project.
- Create a data set.
- Run the bq load command to load the data into a table.
First step is to create a project. Sign in on the BigQuery Web UI. Click on the hamburger button ( ) and select APIs & Services. Click Create Project and provide a project name (Let’s say ‘hevo_project’). Now you need to enable BigQuery for which search for the same and click on Enable. Your project is now created with BigQuery enabled.
Next step is to create a data set. This can be quickly done using the bq command-line tool and the command is called mk . Create a new data set using the following command:
$bq mk hevo_dataset
At this point, you are ready to import the GCS file into a table in this data set. The load command of bq lets you do the same. It’s slightly more complicated than the mk command so let’s go through the basic syntax first.
Bq load command syntax - $bq load project:dataset.table --autodetect --source_format
autodetect is a parameter used to automatically detect the schema from the source file and is generally recommended.
Hence, the following command should do the job for you –
$bq load hevo_project:hevo_dataset.hevo_table --autodetect --source_format=CSV gs://hevo_gc/hevo.csv
The GCS file gets loaded into the table hevo_table.
If no table exists under the name ‘hevo_table’ the above load command creates a new table.
If hevo_table is an existing table there are two types of load available to bring the source data into this table – Overwrite or Table Append.
Here’s the command to overwrite or replace –
$bq load hevo_project:hevo_dataset.hevo_table --autodetect --replace --source_format=CSV gs://hevo_gc/hevo.csv
Here’s the command to append data –
$bq load hevo_project:hevo_dataset.hevo_table --autodetect --noreplace --source_format=CSV gs://hevo_gc/hevo.csv
You should be careful with the append in terms of unique key constraints as BigQuery doesn’t enforce it on its tables.
Incremental load – Type 1/ Upsert
In this type of incremental load, a new record from the source is either inserted as a new record in the target table or replaces an existing record in the target table.
Let’s say the source (hevo.csv) looks like this:
And the target table (hevo_table) looks like this:
Post incremental load, hevo_table will look like this:
The way to do this would be to load the hevo.csv into a separate table (staging table) first, let’s call it, hevo_intermediate. This staging table is then compared with the target table to perform the upsert as follows:
INSERT hevo_dataset.hevo_table (id, name, salary, date) SELECT id, name, salary, date FROM hevo_dataset.hevo_intermediate WHERE NOT id IN (SELECT id FROM hevo_dataset.hevo_intermediate);
UPDATE hevo_dataset.hevo_table h SET h.name = i.name, h.salary = i.salary, h.date = i.date FROM hevo_dataset.hevo_intermediate i WHERE h.id = i.id;
Incremental load – Type 2/ Append Only
In this type of incremental load, a new record from the source is always inserted to the target table if at least one of the fields have a different value from the target. This is quite useful to understand the history of data changes for a particular field and helps drive business decisions.
Let’s take the same example as before. The target table in this scenario would look like the following:
To write the code for this scenario, we first insert all the records from the source to the target table as below –
INSERT hevo_dataset.hevo_table (id, name, salary, date) SELECT id, name, salary, date FROM hevo_dataset.hevo_intermediate;
Next, we delete the duplicate records (all fields have the same value) using the window function like this –
DELETE FROM (SELECT id, name, salary, date, ROW_NUMBER() OVER(PARTITION BY id, name, salary, date) rn FROM hevo_dataset.hevo_table) WHERE rn <> 1;
Limitations of the Custom Code Method
As you have seen now, data replication from DynamoDB to BigQuery is a lengthy and time-consuming process. Furthermore, you have to take care of the following situations
- The example discussed in this article is to demonstrate copying over a single file from DynamoDB to BigQuery. In reality, hundreds of tables would have to be synced periodically or close to realtime; to manage that and not be vulnerable to data loss and data inconsistencies is quite the task.
- There are sometimes subtle, characteristic variations between services, especially when the vendors are different. It could happen in file size limits, encoding, date format, etc. These things may go unnoticed while setting up the process and if not taken care of before kicking off data migration, it could lead to loss of data.
An easier approach to move data from DynamoDb to BigQuery:
The tedious task of setting this up as well as the points of concerns mentioned above does not make the ‘custom method’ endeavour a suggestable one. You can save a lot of time and effort by implementing an integration service like Hevo and focus more on looking at the data and generating insights from it. Here is how you can move data using Hevo:
- Connect and configure your DynamoDB Data Source
- Select the replication mode: (i) Full dump (ii) Incremental load for append-only data (iii) Incremental load for mutable data
- Configure your BigQuery Warehouse where you want to move data
With Hevo, you can move data in real-time from DynamoDb to BigQuery in a reliable, secure, and hassle-free fashion. In addition to this, Hevo has 100+ native data source integrations that work out of the box. You could explore the integrations here: www.hevodata.com/integrations
Before you go ahead and take a call on the right approach to move data from DynamoDB to BigQuery – Do experience Hevo’s hassle-free Data Pipeline platform by signing up for a 14-day free trial here.