AWS Simple Queue Service is a completely managed message queue service offered by Amazon. Queue services are typically used to decouple systems and services in the microservice architecture. In that sense, SQS is a software-as-a-service alternative for queue systems like Kafka, RabbitMQ, etc. AWS S3 or Simple Storage Service is another software-as-a-service offered by Amazon. S3 is a complete solution for any kind of storage needs for up to 5 terabytes.
SQS and S3 form an integral part of applications exploiting cloud-based microservices architecture and it is very common to have a requirement of transferring messages from SQS to S3 to keep a historical record of everything that is coming through the queue. This post is about the methods to accomplish this transfer.
What is SQS?
SQS frees the developers from the complexity and effort associated with developing, maintaining, and operating a highly reliable queue layer. It helps to send, receive and store messages between software systems. The standard size of messages is capped at 256 KBs. But with the extended AWS SDK, a message size of up to 2 GB is supported. Messages greater than 256KB in size will by default be using S3 as the internal storage. One of the greatest advantages of using SQS instead of traditional queue systems like Kafka is that it allows virtually unlimited scaling without the customer having to worry about capacity planning or pre-provisioning.
AWS offers a very flexible pricing plan for SQS based on the pay-as-you-go model and it provides significant cost savings when compared to the always-on model.
Behind the scenes, SQS messages are stored in distributed SQS servers for redundancy. SQS offers two types of queues – A standard queue and a FIFO queue. Standard queue offers at least one guarantee which means that occasionally duplicate messages might reach the receiver. The FIFO queue is designed for applications where the order of the events and uniqueness of the messages is critical. It provides an exactly-once guarantee.
SQS offers a dead-letter queue for routing problematic or erroneous messages that can not be processed in normal conditions. Amazon offers a standard queue at .40$ per 1 million requests and the FIFO queue at .50$ per 1 million requests. The total cost of ownership will also include data storage costs.
What is S3?
AWS S3 is a completely managed object storage service that can be used for a variety of use cases like hosting data, backup and archiving, data warehousing, etc. Amazon handles all operation and maintenance activities related to scaling, provisioning, etc. and the customers only need to pay for the storage that they use. It offers fine-grained access controls to meet any kind of organizational and business compliance requirements through an easy-to-use management user interface. S3 also supports analytics through the use of AWS Athena and AWS Redshift Spectrum which enables users to execute SQL scripts on the stored data. S3 data is encrypted by default at rest.
S3 achieves state-of-the-art availability by storing the data across distributed servers. A caveat to this approach is that there is normally a propagation delay and S3 only guarantees eventual consistency. That said, the writes are atomic; which means at any point, the API will return either the old data or new data and never a corrupted response. Conceptually S3 is organized as buckets and objects.
A bucket is the highest level S3 namespace and acts as a container for storing objects. They have a critical role in access control and usage reporting is always aggregated at the bucket level. An object is the fundamental storage entity and consists of the actual object as well as the metadata. An object is uniquely identified by a unique key and a version identifier. Customers can choose the AWS regions in which their buckets need to be located according to their cost and latency requirements.
A point to note here is that objects do not support locking and if two PUTs come at the same time, the request with the latest timestamp will win. This means if there is concurrent access, users will have to implement some kind of locking mechanism on their own.
Steps to Load data from SQS to S3
The most straightforward approach to transfer data from SQS to S3 is to use standard AWS services like Lambda functions and AWS firehose. AWS Lambda functions are serverless functions that allow users to execute arbitrary logic using amazon’s infrastructure. These functions can be triggered based on specific events or scheduled based on required intervals.
It is pretty straightforward to write a Lambda function to execute based on messages from SQS and write it to S3. The caveat is that this will create an S3 object for every message that is received and this is not always the ideal outcome. To create files in S3 after buffering the SQS messages for a fixed interval of time, there are two approaches for SQS to S3 data transfer:
1) Through a Scheduled Lambda Function
A scheduled Lambda function for SQS to S3 transfer is executed in predefined intervals and can consume all the SQS messages that were produced during that specific interval. Once it processes all the messages, it can create a multi-part S3 upload using API calls. To schedule a Lambda function that transfers data from SQS to S3, execute the below steps.
- Sign in to the AWS console and go to the Lambda console.
- Choose to create a function.
- For the execution role, select create a new execution role with Lambda permissions.
- Choose to use a blueprint. Blueprints are prototype code snippets that are already implemented to provide examples for users. Search for hello-world blueprint in the search box and choose it.
- Click create function. On the next page, click to add a trigger.
- In the trigger search menu, search and select CloudWatch events. CloudWatch events are used to schedule Lambda functions.
- Click create a new rule and select rule type as scheduled expression. Scheduled expression takes a Cron expression. You can enter a valid Cron expression corresponding to your execution strategy.
- The Lambda function will contain code to access the SQS and to execute a multi-part upload to S3. S3 mandates that all single file uploads greater than 500 MB should be multipart.
- Choose create a function to activate the Lambda function.
- Once this is configured, AWS CloudWatch will generate events according to the cron expression, schedule, and trigger the Lambda function.
A problem with this approach is that Lambda functions have an execution time ceiling of 15 minutes and a usable memory ceiling of 3008 MB. If there are a large number of SQS events, you can run out of time and memory limits leading to dropping messages.
2) Using a Triggered Lambda Function and AWS Firehose
A deterrent to using a triggered Lambda function to move data from SQS to S3 was that it would create an S3 object per message leading to a large number of destination files. A workaround to avoid this problem is to use a buffered delivery stream that can write to S3 in predefined intervals. This approach involves the following broad set of steps.
Step 1: Create a triggered Lambda function
To create a triggered Lambda function for SQS to S3 data transfer, follow the same steps from the first approach. Instead of selecting a schedule expression select triggers. Amazon will provide you with a list of possible triggers. Select the SQS trigger and click create function. In the Lambda function write a custom code to redirect the SQS messages to Kinesis Firehose Delivery Stream.
Step 2: Create a Firehose Delivery Stream
To create a delivery stream, go to the AWS console and select the Kinesis Data Firehose Console.
Choose the destination as S3. In the configuration options, you will be presented with options to select the buffer size and buffer interval.
Buffer size is the amount of data up to which kinesis firehose will buffer the messages before writing to S3 as an object. You can have any value from 1 MB to 128 MB here.
Buffer interval is the amount of time up to which the firehose will wait before it writes to S3. You can select any value from 60 seconds to 900 seconds here. After selecting the buffer size and buffer interval, you can leave the other parameters as default and click on create. That completes the pipeline to transfer data from SQS to S3.
The main limitation of this approach is that the user does not have close control over when to write to S3 beyond the buffer interval and buffer size limits imposed by Amazon. These limits are not always practical in real scenarios.
These are some benefits of having Hevo Data as your Data Automation Partner:
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Auto Schema Mapping: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the S3 schema.
- Integrate With Custom Sources: Hevo allows businesses to move data from 100+ Data Sources straight to thier desired destination.
- Quick Setup: Hevo with its automated features, can be set up in minimal time. Moreover, with its simple and interactive UI, it is extremely easy for new customers to work on and perform operations using just 3 simple steps.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
With continuous real-time data movement, ETL your data seamlessly from your data sources to a destination of your choice with Hevo’s easy-to-setup and No-code interface. Try our 14-day full access free trial!
Explore Hevo Platform With A 14-Day Free Trial
SQS to S3: Limitations of the Custom-Code Approach
Both the approaches mentioned for SQS to S3 data transfer use AWS-provided functions. An obvious advantage here is that you can implement the whole pipeline staying inside the AWS ecosystem. But these approaches have a number of limitations as mentioned below.
- Both approaches require a lot of custom coding and knowledge of AWS proprietary configurations. Some of these configurations are very confusing and can lead to a significant amount of time and effort expense.
- AWS imposes multiple limits for execution time, run time memory, and storage memory in case of the services that we used to accomplish this transfer. This is not always practical in real scenarios.
Conclusion
In this blog, you learned how to move data from SQS to S3 using AWS Lambda and AWS Firehouse. You also went through the limitations of using custom code for SQS to S3 data migration. The AWS Lambda and Firehouse-based approach for loading data from SQS to S3 will consume a significant amount of time and resources. Moreover, it will be an error-prone method and you will be required to debug and maintain the data transfer process regularly.
Hevo Data provides an Automated No-code Data Pipeline that empowers you to overcome the above-mentioned limitations. Hevo caters to 100+ data sources (40+ free sources). Furthermore, Hevo’s fault-tolerant architecture ensures a consistent and secure transfer of your data to a Data Warehouse. Using Hevo will make your life easier and make Data Transfer hassle-free.
Learn more about Hevo
Share your experience of loading data from SQS to S3 in the comment section below.
Sourabh has more than a decade of experience building scalable real-time analytics and has worked for companies like Flipkart, tBits Global, and Unbxd. He is experienced in technologies like MySQL, Hibernate, Spring, CXF, php, ExtJS and Shell.