With the invention of IoT and Online services, log parsing, text indexing are some of the critical features that many IT sectors are facing. Indexing is required to optimize the search results and reduce the latency. Amazon Elasticsearch is a feature offered by Amazon that is built on top of the open-source Elasticsearch stack and provides a fully-managed service for indexing your data.
In this blog post, you will learn about S3 and AWS Elasticsearch, its feature, and 3 easy steps to move the data from AWS S3 to Elasticsearch. Read along to understand these steps and their benefits in detail!
Prerequisites
To transfer data from S3 to Elasticsearch, you must have:
- Access to Amazon S3.
- Access to Amazon Elasticsearch.
- Basic understanding of data and data flow.
- Python 3.6 or later installed.
Introduction to S3
Amazon Simple Storage Service, which is commonly known as Amazon S3, is an object storage service offered by Amazon to store the data. Amazon S3 provides a scalable and secure data storage service that can be used by customers and industries of all sizes to store any data format like weblogs, application files, backups, codes, documents, etc. Amazon S3 also provides high data availability, and it claims to be 99.999999999% of data durability. Amazon S3 is a popular choice all around the world due to its exceptional features.
Are you looking for ways to connect your cloud storage tools like Amazon S3 or dynamoDB? Hevo has helped customers across 45+ countries connect their cloud storage to migrate data seamlessly. Hevo streamlines the process of migrating data by offering:
- Seamlessly data transfer between Amazon S3, DynamoDB, and 150+ other sources.
- Risk management and security framework for cloud-based systems with SOC2 Compliance.
- Always up-to-date data with real-time data sync.
Don’t just take our word for it—try Hevo and experience why industry leaders like Whatfix say,” We’re extremely happy to have Hevo on our side.”
Get Started with Hevo for Free
Features of Amazon S3
- Amazon S3 has a simple UI that allows you to upload, modify, view, and manage the data.
- It also has exceptional support for the leading programming languages such as Python, Scala, Java, etc. which can interact with S3 using its API.
- Amazon S3 allows users to manage the data securely, and it also offers periodic backup and versioning of the data.
- You can use Amazon S3 with almost all of the leading ETL tools and programming languages to read, write, and transform the data.
To learn more about Amazon S3, visit here.
Introduction to Amazon Elasticsearch
Elasticsearch is an open-source platform used for log analytics, application monitoring, indexing, text-search, and many more. It is used in a combination known as ELK stack which stands for Elasticsearch, Logstash, and Kibana.
Amazon Elasticsearch is a fully-managed scalable service provided by Amazon that is easy to deploy, operate on the cloud. Amazon Elasticsearch offers the native open-source API of Elasticsearch so that your existing code and application that uses vanilla Elasticsearch will work seamlessly. Amazon ES also provides built-in support for Logstash and Kibana to quickly parse the logs, texts to visualize and analyze them.
Amazon ES has excellent integration with CloudWatch Logs, which automatically loads your logs to Amazon Elasticsearch for quick analysis. You need to select the logs and specify the Elasticsearch domain. The Amazon ES integration moves the data continuously and automatically for analysis.
To learn more about AWS Elasticsearch, visit here.
Advantages of AWS Elasticsearch
To understand the importance of connecting S3 to Elasticsearch, you first need to learn the advantages of AWS Elasticsearch. The AWS Elasticsearch is in such popular demand because of the following advantages:
- Easy to Use: AWS ElasticSearch is a fully-managed service provided by AWS. You can easily set up a production-ready cluster with a minimum time frame. AWS handles all the hardware, installation, infrastructure, and maintenance.
- Open Source Support: AWS ES also supports open-source API to ensure the smooth running of your existing application. It has built-in support for Logstash for data loading, transformation, and Kibana to visualize them. So once you transfer data from S3 to Elastichsearch, you can implement all these features on your data.
- Secure Access: With the help of AWS VPC, you can isolate your ElasticSearch cluster and enable all security aspects for a secure transition of data.
- High Availability: AWS ensures the high availability of data across its services. You are using S3 to store the data which claims to be 99.999999999% fail-safe access to the data.
- Integration with Other AWS Service: AWS ES offers seamless in-built integration to other AWS services like Kinesis, Firehose, CloudWatch, etc.
- Scalable: AWS ES automatically manages the scale up and down of the cluster as per data volume. Hence since the S3 to Elasticsearch connection is in place, from the AWS Management Console, you can easily set up cluster resizing.
Integrate Amazon S3 to BigQuery
Integrate Amazon S3 to MySQL
Integrate Amazon S3 to Databricks
Steps to Connect S3 to Elasticsearch
To load the data from S3 to Elasticsearch, you can use Amazon Lambda to create a trigger that will load the data continuously from S3 to Elasticsearch. The Lambda will watch the S3 location for the file, and in an event, it will trigger the code that will index your file.
The process of loading data from Amazon S3 to Elasticsearch with AWS Lambda is very straightforward. The following steps are required to connect S3 to Elasticsearch using this method:
Step 1: Create a Lambda Deployment Package
The first step of transferring data from S3 to Elasticsearch requires you to set up Lambda Deployment package:
- Open your favorite Python editor and create a package called s3ToES.
- Create a python file named “s3ToES.py” and add the following lines of code. Edit the region and host on the following code.
Import Libraries:
import boto3
import re
import requests
from requests_aws4auth import AWS4Auth
Define Constants:
region = 'us-west-1'
service = 'es'
creds = boto3.Session().get_credentials()
awsauth = AWS4Auth(creds.access_key, creds.secret_key, region, service, session_token=creds.token)
host = 'http://aws.xxxxxxxxxxx.com/es'
index = 'lambda-s3-file-index'
type = 'lambda-type'
url = host + "/" + index + "/" + type
headers = { "Content-Type": "application/json" }
s3 = boto3.client('s3')
- Create Regular Expressions to parse the logs.
pattern_ip = re.compile('(d+.d+.d+.d+)')
pattern_time = re.compile("[(d+/www/dddd:dd:dd:dds-dddd)]")
pattern_msg = re.compile('"(.+)"')
- Define Lambda handler Function that will be essential in transferring data from S3 to Elasticsearch.
def praseLog(event, context):
for record in event["Records"]:
# From the event record, get the bucket name and key.
bucket_name = record['s3']['bucket']['name']
file_key = record['s3']['object']['key']
# From the S3 object, read the file and spllit the lines
obj = s3.get_object(Bucket=bucket_name, Key=file_key)
body = obj['Body'].read()
lines = body.splitlines()
# For each line, match the regex.
for line in lines:
ip = pattern_ip.search(line).group(1)
timestamp = pattern_time.search(line).group(1)
message = pattern_msg .search(line).group(1)
parsed_doc = { "ip": ip, "timestamp": timestamp, "message": message }
r = requests.post(url, auth=awsauth, json=parsed_doc, headers=headers)
- Install the Python packages to the folder where the code resides.
Windows :
cd s3ToES
pip install requests -t .
pip install requests_aws4auth -t .
Linux:
cd s3ToES
pip install requests -t .
pip install requests_aws4auth -t .
- Package the code and the dependencies –
Windows:
Right-click on the s3ToES folder and create a zip package
Linux:
zip -r lambda.zip *
Step 2: Create the Lambda Function
Once you successfully created the deployment package, you need to create a Lambda function to deploy the package designed above. Then only you can start transferring data from S3 to Elasticsearch.
- Search for the AWS Lambda in the AWS Console, and then click on Create Function.
- Once you create the function, you need to add a trigger that will invoke the task when the event happens. In this example, we want the code to run whenever a log file arrives in an S3 bucket. Follow the below steps to create a trigger.
- Choose S3.
- Choose your bucket.
- For Event Type, choose PUT.
- For Prefix, type logs/.
- For Filter pattern, type .log.
- Select Enable trigger.
- Choose Add.
- For Handler, type s3ToES.parseLog. This setting will tell Lambda that the file is – s3ToES.py and the method to invoke after the trigger is – praseLog.
- Select the zip file as the code entry type and upload the zip file created above.
Once you save, the Lambda function will be ready for its execution. Once you test the function you can further proceed to transfer data from S3 to Elsticsearch.
Step 3: Test the Lambda Function
- To test the Lambda function, you need to upload a file to the S3 location.
- Create a file named sample.log and add the following log properties –
12.345.678.90 - [23/Aug/2020:13:55:36 -0700] "PUT /some-file.jpg"
12.345.678.91 - [23/Aug/2030:14:56:14 -0700] "GET /some-file.jpg"
- Once you upload the file, the Lambda will invoke the ES function and ES will index the log.
- Go to the ElasticSearch Console or Kibana and verify that the ‘lambda-s3-file-index’ index contains two documents.
GET https://es-domain/lambda-index/_search?pretty
{
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [
{
"_index" : "lambda-s3-file-index",
"_type" : "lambda-type",
"_id" : "vTYXaOWEKWV_TTkweuSDg",
"_score" : 1.0,
"_source" : {
"ip" : "12.345.678.91",
"message" : "GET /some-file.jpg",
"timestamp" : "23/Aug/2020:14:56:14 -0700"
}
},
{
"_index" : "lambda-s3-index",
"_type" : "lambda-type",
"_id" : "vjYmaWIBJfd_TTkEuCAB",
"_score" : 1.0,
"_source" : {
"ip" : "12.345.678.90",
"message" : "PUT /some-file.jpg",
"timestamp" : "23/Aug/2020:13:55:36 -0700"
}
}
]
}
}
That’s it! your S3 to Elasticsearch connection is ready.
Enhance Your Data Migration Game!
No credit card required
Benefits of Connecting S3 with Elasticsearch
- S3 stores large amounts of unstructured data (logs, files, backups, etc.), and Elasticsearch enables full-text search, making it easier to index and search through this data quickly.
- Elasticsearch can index and provide near real-time search functionality for data stored in S3, allowing faster insights from large datasets, such as logs or application data.
- S3 can store an unlimited amount of data, while Elasticsearch can be scaled independently to manage search and indexing performance, enabling highly scalable architecture.
- Tools like Amazon Elasticsearch Service, AWS Lambda, or Logstash can automatically ingest, transform, and index data from S3 to Elasticsearch, reducing manual effort.
Conclusion
In this blog post, you have learned about Amazon ElasticSearch, its features, and how you can load the data from S3 to ElasticSearch. The one drawback of this approach is that you need to write a lot of code which may be not very useful to a non-programmer. Furthermore, you will have to use a lot of resources and engineering bandwidth to build an in-house solution from scratch if you wish to transfer your data from Elasticsearch or S3 to a Data Warehouse for analysis.
Hevo Data provides an Automated No-code Data Pipeline that empowers you to overcome the above-mentioned limitations. Hevo caters to 150+ data sources (including 40+ free sources) and can seamlessly transfer your Elasticsearch to S3 data to the Data Warehouse of your choice in real-time. Hevo’s Data Pipeline enriches your data and manages the transfer process in a fully automated and secure manner without having to write any code. It will make your life easier and make data migration hassle-free.
FAQ on S3 to Elasticsearch
How do I transfer data from S3 to Elasticsearch?
To transfer data from S3 to Elasticsearch, you can use Logstash or AWS Lambda. Here’s a brief overview of each method:
a) Using Logstash
– Install Logstash
– Create Logstash configuration file
– Run Logstash with the configuration file
b) Using AWS Lambda
– Create an AWS Lambda Function
– Set Up an S3 Trigger for the Lambda Function
– Write Lambda function code to read logs from S3
– Deploy and test the lambda function
Can Elasticsearch use S3?
Elasticsearch can use Amazon S3 as a repository for snapshots, which are backups of your indices. This is useful for disaster recovery or migrating data between clusters.
– Install the S3 Repository Plugin
– Configure the S3 Repository
– Register the S3 Repository
– Create a Snapshot
– Restore from a Snapshot
When should you not use Elasticsearch?
Elasticsearch is a powerful search and analytics engine, but it’s not suitable for every use case. Here are scenarios where you might not want to use Elasticsearch:
– Transactional Databases
– Frequent Data Updates
– Complex Joins and Relations
– Real-Time Analytics on large datasets
Want to take Hevo for a spin? Sign up for a 14-day free trial and experience the feature-rich Hevo suite firsthand.
Share your understanding of Connecting S3 to Elasticsearch in the comments section below.
Vishal Agarwal is a Data Engineer with 10+ years of experience in the data field. He has designed scalable and efficient data solutions, and his expertise lies in AWS, Azure, Spark, GCP, SQL, Python, and other related technologies. By combining his passion for writing and the knowledge he has acquired over the years, he wishes to help data practitioners solve the day-to-day challenges they face in data engineering. In his article, Vishal applies his analytical thinking and problem-solving approaches to untangle the intricacies of data integration and analysis.