AWS Glue is a serverless ETL service, which is fully managed. Since it is serverless, you do not have to worry about the configuration and management of your resources. Before going through the steps to export DynamoDB to S3 using AWS Glue, here are the use cases of DynamoDB and Amazon S3.
- Dynamodb is heavily used in e-commerce since it stores the data as a key-value pair with low latency.
- Due to its low latency, Dynamodb is used in serverless web applications.
- Since S3 is cost-effective, S3 can be used as a backup to store your transient/raw and permanent data.
- Using S3, data lake can be built to perform analytics and as a repository of data.
- S3 can be used in Machine Learning, Data profiling etc.
Now, let us export data from DynamoDB to S3 using AWS glue. It is done in two major steps:
A. Creating a crawler
B. Exporting data from DynamoDB to S3.
A. Steps to Create a Crawler:
- Create a Database DynamoDB.
- Pick the table CompanyEmployeeList from tables drop-down list.
- Let the table info gets created through crawler. Set up crawler details in the window below. Provide crawler name as dynamodb_crawler.
- Add database name and DynamoDB table name.
- Provide the necessary IAM role to crawler such that it can access the DynamoDB table. Here, the created IAM role is AWSGlueServiceRole-dynamodb.
- You can schedule the crawler. For this illustration, it is running on demand as the activity is one-time.
- Review the crawler information.
- Run the crawler.
- Check the catalog details once crawler is executed successfully.
B. Export Data from DynamoDB to S3
Since the crawler is generated, let us create a job to copy data from DynamoDB table to S3. Here the job name given is dynamodb_s3_gluejob. In AWS Glue, you can use either Python or Scala as an ETL language. For the scope of this article, let us use Python
- Pick your data source.
- Pick your data target.
- Once completed, Glue will create a readymade mapping for you.
- Once you review your mapping, it will automatically generate python code/job for you.
- Execute the Python job.
- Once the job completes successfully, it will generate logs for you to review.
- Go and check files in the bucket. Download the files.
- Review the contents of the file.
Advantages of exporting DynamoDB to S3 using AWS Glue:
- This approach is fully serverless and you do not have to worry about provisioning and maintaining your resources
- You can run your customized Python and Scala code to run the ETL
- You can push your event notification to Cloudwatch
- You can trigger Lambda function for success or failure notification
- You can manage your job dependencies using AWS Glue
- AWS Glue is the perfect choice if you want to create data catalog and push your data to Redshift spectrum
Disadvantages of exporting DynamoDB to S3 using AWS Glue of this approach:
- AWS Glue is batch-oriented and it does not support streaming data. In case your DynamoDB table is populated at a higher rate. AWS Glue may not be the right option
- AWS Glue service is still in an early stage and not mature enough for complex logic
- AWS Glue still has a lot of limitations on the number of crawlers, number of jobs etc.
Faster & Efficient Way to Import DynamoDB to S3
Using Hevo Data Integration Platform, you can seamlessly export data from DynamoDB to S3 using 2 simple steps.
- Connect and configure your DynamoDB database.
- For each table in DynamoDB choose a table name in Amazon S3 where it should be copied.
AWS Glue can be used over AWS Data Pipeline when you do not want to worry about your resources and do not need to take control over your resources ie. EC2 instances, EMR cluster etc. However, considering AWS Glue on early stage with various limitations, Glue may still not be the perfect choice for copying data from Dynamodb to S3. You can sign up with Hevo (7-day free trial) and set up DynamoDB to S3 in minutes.