AWS Simple Queue Service is a completely managed message queue service offered by Amazon. Queue services are typically used to decouple systems and services in the microservice architecture. In that sense, SQS is a software-as-a-service alternative for queue systems like Kafka, RabbitMQ, etc. AWS S3 or Simple Storage Service is another software-as-a-service offered by Amazon. S3 is a complete solution for any kind of storage needs for up to 5 terabytes. SQS and S3 form an integral part of applications exploiting cloud-based microservices architecture and it is very common to have a requirement of transferring messages from SQS to S3 to keep a historical record of everything that is coming through the queue. This post is about the methods to accomplish this transfer.
SQS frees the developers from the complexity and effort associated with developing, maintaining and operating a highly reliable queue layer. It helps to send, receive and store messages between software systems. The standard size of messages is capped at 256 KBs. But with the extended AWS SDK, a message size of up to 2 GB is supported. Messages greater than 256KB of size will by default be using S3 as the internal storage. One of the greatest advantages of using SQS instead of traditional queue systems like Kafka is that it allows virtually unlimited scaling without the customer having to worry about capacity planning or pre-provisioning. AWS offers a very flexible pricing plan for SQS based on the pay as you go model and it provides significant cost savings when compared to the always-on model.
Behind the scenes, SQS messages are stored in distributed SQS servers for redundancy. SQS offers two types of queues – A standard queue and a FIFO queue. Standard queue offers an at least once guarantee which means that occasionally duplicate messages might reach the receiver. The FIFO queue is designed for applications where the order of the events and uniqueness of the messages is critical. It provides an exactly-once guarantee. SQS offers a dead-letter queue for routing problematic or erroneous messages that can not be processed in the normal condition. Amazon offers standard queue at .40$ per 1 million requests and the FIFO queue at .50$ per 1 million requests. The total cost of ownership will also include data storage costs.
AWS S3 is a completely managed object storage service that can be used for a variety of use cases like hosting data, backup and archiving, data warehousing, etc. Amazon handles all operation and maintenance activities related to scaling, provisioning, etc. and the customers only need to pay for the storage that they use. It offers fine-grained access controls to meet any kind of organizational and business compliance requirements through an easy to use management user interface. S3 also supports analytics through the use of AWS Athena and AWS Redshift Spectrum which enables users to execute SQL scripts on the stored data. S3 data is encrypted by default at rest.
S3 achieves state of the art availability by storing the data across distributed servers. A caveat to this approach is that there is normally a propagation delay and S3 only guarantees eventual consistency. That said, the writes are atomic; which means at any point, the API will return either the old data or new data and never a corrupted response. Conceptually S3 is organized as buckets and objects. A bucket is the highest level S3 namespace and acts as a container for storing objects. They have a critical role in access control and usage reporting is always aggregated at the bucket level. An object is the fundamental storage entity and consists of the actual object as well as the metadata. An object is uniquely identified by a unique key and a version identifier. Customers can choose the AWS regions in which their buckets need to be located according to their cost and latency requirements. A point to note here is that objects do not support locking and if two PUTs come at the same time, the request with the latest timestamp will win. This means if there is concurrent access, users will have to implement some kind of locking mechanism on their own.
Two approaches to Move Data from SQS to S3
Data can be copied from SQS to S3 in either of two ways:
Approach 1: Build a Custom Code using AWS Lambda to move data. This would need you to invest engineering resources to set up and monitor the infrastructure.
Approach 2: Use a fully-managed Data Pipeline Platform like Hevo Data. This would be an easier, hassle-free approach.
This blog covers the first approach in detail. Towards the end, the post also addresses some of the challenges and limitations in this approach, so that you are able to take an informed call on your next steps when moving data from SQS to S3.
SQS to S3: Building a Data Pipeline using AWS Lambda and AWS Firehose
The most straightforward approach to transfer data from SQS to S3 is to use standard AWS services like Lambda functions and AWS firehose. AWS Lambda functions are serverless functions that allow users to execute arbitrary logic using amazon’s infrastructure. These functions can be triggered based on specific events or scheduled based on required intervals. It is pretty straightforward to write a Lambda function to execute based on messages from SQS and write it to S3. The caveat is that this will create an S3 object for every message that is received and this is not always the ideal outcome. To create files in S3 after buffering the SQS messages for a fixed interval of time, there are two approaches.
Through a scheduled Lambda function
A scheduled Lambda function is executed in predefined intervals and can consume all the SQS messages that were produced during that specific interval. Once it processes all the messages, it can create a multi-part S3 upload using API calls. To schedule a Lambda function, execute the below steps.
- Sign in to AWS console and go to the Lambda console
- Choose create function
- For the execution role, select create a new execution role with Lambda permissions
- Choose to use a blueprint. Blueprints are prototype code snippets that are already implemented to provide examples for users. Search for hello world blueprint in the search box and choose it.
- Click create function. On the next page, click to add a trigger.
- In the trigger search menu, search and select CloudWatch events. CloudWatch events are used to schedule Lambda functions
- Click create a new rule and select rule type as scheduled expression. Scheduled expression takes a Cron expression. You can enter a valid cron expression corresponding to your execution strategy.
- The Lambda function will contain code to access the SQS and to execute a multi-part upload to S3. S3 mandates that all single file uploads greater than 500 MB should be multipart.
- Choose create a function to activate the Lambda function
Once this is configured, AWS CloudWatch will generate events according to the cron expression, schedule and trigger the Lambda function.
A problem with this approach is that Lambda functions have an execution time ceiling of 15 minutes and a usable memory ceiling of 3008 MB. If there are a large number of SQS events, you can run out of the time and memory limits leading to dropping messages.
Using a triggered Lambda function and AWS firehose
A deterrent to using a triggered Lambda function to move data from SQS to S3 was that it would create an S3 object per message leading to a large number of destination files. A workaround to avoid this problem is to use a buffered delivery stream which can write to S3 in predefined intervals. This approach involves the following broad set of steps.
Step 1: Create a triggered Lambda function
To create a triggered Lambda function, follow the same steps from the first approach. Instead of selecting a schedule expression select triggers. Amazon will provide you a list of possible triggers. Select the SQS trigger and click create function. In the Lambda function write a custom code to redirect the SQS messages to Kinesis Firehose Delivery Stream.
Step 2: Create a Firehose Delivery Stream
- To create a delivery stream, go to AWS console and select the Kinesis Data Firehose Console.
- Choose the destination as S3. In the configuration options, you will be presented with options to select the buffer size and buffer interval.
Buffer size is the amount of data up to which kinesis firehose will buffer the messages before writing to S3 as an object. You can have any value from 1 MB to 128 MB here.
Buffer interval is the amount of time up to which the firehose will wait before it writes to S3. You can select any value from 60 seconds to 900 seconds here. After selecting the buffer size and buffer interval, you can leave the other parameters as default and click on create. That completes the pipeline to transfer data from SQS to S3.
The main limitation with this approach is that the user does not have close control over when to write to S3 beyond the buffer interval and buffer size limits imposed by Amazon. These limits are not always practical in real scenarios.
SQS to S3: Limitations of the Custom-Code Approach
Both the approaches mentioned here use AWS provided functions. An obvious advantage here is that you can implement the whole pipeline staying inside the AWS ecosystem. But these approaches have a number of limitations as mentioned below.
- Both approaches require a lot of custom coding and knowledge of AWS proprietary configurations. Some of these configurations are very confusing and can lead to a significant amount of time and effort expense.
- AWS imposes multiple limits for execution time, run time memory and storage memory in case of the services that we used to accomplish this transfer. This is not always practical in real scenarios.
An Easier Alternative to Move Data from SQS to S3
A better way is to use a solution like Hevo (14-day, risk-free trial) that enables setting up the complete SQS to S3 data transfer in a jiffy.
Hevo provides an intuitive user interface that will allow you to connect SQS (and 100’s of more sources) to S3 and move data in real-time.
Hevo’s fault-tolerant architecture ensures the data is moved securely and reliably without any loss. This, in turn, will enable your team to stop worrying about data and focus on the right aspects that can help the company grow.
Before you deploy expensive tech bandwidth at the task of building a pipeline from SQS to S3, do explore Hevo by signing up for a no-commitment 14-day free trial here.