With the invention of IoT and Online services, log parsing, text indexing are some of the critical features that many IT sectors are facing. Indexing is required to optimize the search results and reduce the latency. Amazon Elasticsearch is a feature offered by Amazon that is built on top of the open-source Elasticsearch stack and provides a fully-managed service for indexing your data.

In this blog post, you will learn about S3 and AWS Elasticsearch, its feature, and 3 easy steps to move the data from AWS S3 to Elasticsearch. Read along to understand these steps and their benefits in detail!

Prerequisites

To transfer data from S3 to Elasticsearch, you must have:

  • Access to Amazon S3.
  • Access to Amazon Elasticsearch.
  • Basic understanding of data and data flow.
  • Python 3.6 or later installed.

Introduction to S3

Amazon S3 Logo
Image Source

Amazon Simple Storage Service, which is commonly known as Amazon S3, is an object storage service offered by Amazon to store the data. Amazon S3 provides a scalable and secure data storage service that can be used by customers and industries of all sizes to store any data format like weblogs, application files, backups, codes, documents, etc. Amazon S3 also provides high data availability, and it claims to be 99.999999999% of data durability. Amazon S3 is a popular choice all around the world due to its exceptional features. 

Amazon S3 has a simple UI that allows you to upload, modify, view, and manage the data. It also has exceptional support for the leading programming languages such as Python, Scala, Java, etc. which can interact with S3 using its API.

Amazon S3 allows users to manage the data securely, and it also offers periodic backup and versioning of the data. You can use Amazon S3 with almost all of the leading ETL tools and programming languages to read, write, and transform the data.

To learn more about Amazon S3, visit here.

Solve your data replication problems with Hevo’s reliable, no-code, automated pipelines with 150+ connectors.
Get your free trial right away!

Introduction to Amazon Elasticsearch

Amazon Elasticsearch Logo
Image Source

Elasticsearch is an open-source platform used for log analytics, application monitoring, indexing, text-search, and many more. It is used in a combination known as ELK stack which stands for Elasticsearch, Logstash, and Kibana. 

Amazon Elasticsearch is a fully-managed scalable service provided by Amazon that is easy to deploy, operate on the cloud. Amazon Elasticsearch offers the native open-source API of Elasticsearch so that your existing code and application that uses vanilla Elasticsearch will work seamlessly. Amazon ES also provides built-in support for Logstash and Kibana to quickly parse the logs, texts to visualize and analyze them.

Amazon ES has excellent integration with CloudWatch Logs, which automatically loads your logs to Amazon Elasticsearch for quick analysis. You need to select the logs and specify the Elasticsearch domain. The Amazon ES integration moves the data continuously and automatically for analysis.

To learn more about AWS Elasticsearch, visit here.

Advantages of AWS Elasticsearch

To understand the importance of connecting S3 to Elasticsearch, you first need to learn the advantages of AWS Elasticsearch. The AWS Elasticsearch is in such popular demand because of the following advantages:

  • Easy to Use: AWS ElasticSearch is a fully-managed service provided by AWS. You can easily set up a production-ready cluster with a minimum time frame. AWS handles all the hardware, installation, infrastructure, and maintenance.
  • Open Source Support: AWS ES also supports open-source API to ensure the smooth running of your existing application. It has built-in support for Logstash for data loading, transformation, and Kibana to visualize them. So once you transfer data from S3 to Elastichsearch, you can implement all these features on your data.
  • Secure Access: With the help of AWS VPC, you can isolate your ElasticSearch cluster and enable all security aspects for a secure transition of data.
  • High Availability: AWS ensures the high availability of data across its services. You are using S3 to store the data which claims to be 99.999999999% fail-safe access to the data.
  • Integration with Other AWS Service: AWS ES offers seamless in-built integration to other AWS services like Kinesis, Firehose, CloudWatch, etc.
  • Scalable: AWS ES automatically manages the scale up and down of the cluster as per data volume. Hence since the S3 to Elasticsearch connection is in place, from the AWS Management Console, you can easily set up cluster resizing.
Download the Ultimate Guide on Database Replication
Download the Ultimate Guide on Database Replication
Download the Ultimate Guide on Database Replication
Learn the 3 ways to replicate databases & which one you should prefer.
Integrate Your Elasticsearch and S3 Data Using Hevo’s No Code Data Pipeline

Hevo Data, an Automated No-code Data Pipeline, helps you directly transfer data from Elasticsearch and S3 to Business Intelligence tools, Data Warehouses, or a destination of your choice in a completely hassle-free & automated manner. Hevo’s end-to-end Data Management connects you to Elasticsearch’s cluster using the Elasticsearch Transport Client and synchronizes your cluster data using indices. Hevo’s Pipeline allows you to leverage the services of both Generic Elasticsearch & AWS Elasticsearch. Hevo also enables you to load data from S3 buckets into your Destination database or Data Warehouse seamlessly. S3 stores its files after compressing them into a Gzip format. Hevo’s Data pipeline automatically unzips any Gzipped files on ingestion and also performs file re-ingestion in case there is any data update.

Hevo is fully managed and completely automates the process of not only loading data from 100+ data sources (including 40+ free sources) but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure and flexible manner with zero data loss. Hevo’s consistent & reliable solution to manage data in real-time allows you to focus more on Data Analysis, instead of Data Consolidation. 

Check out what makes Hevo amazing:

  • Secure: Hevo has a fault-tolerant architecture that ensures that your S3 and Elasticsearch data is handled in a secure, consistent manner with zero data loss.
  • Auto Schema Mapping: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data from S3 buckets and Elasticsearch files and maps it to the destination schema.
  • Quick Setup: Hevo with its automated features, can be set up in minimal time. Moreover, with its simple and interactive UI, it is extremely easy for new customers to work on and perform operations.
  • Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the Data Pipelines you set up. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use in your ETL process.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.

With continuous Real-Time data movement, load your data from Elasticsearch & S3 to your destination warehouse with Hevo’s easy-to-setup and No-code interface. Try our 14-day full access free trial.

Get Started with Hevo for Free

Steps to Connect S3 to Elasticsearch

To load the data from S3 to Elasticsearch, you can use Amazon Lambda to create a trigger that will load the data continuously from S3 to Elasticsearch. The Lambda will watch the S3 location for the file, and in an event, it will trigger the code that will index your file.

The process of loading data from Amazon S3 to Elasticsearch with AWS Lambda is very straightforward. The following steps are required to connect S3 to Elasticsearch using this method:

Step 1: Create a Lambda Deployment Package

The first step of transferring data from S3 to Elasticsearch requires you to set up Lambda Deployment package:

  • Open your favorite Python editor and create a package called s3ToES.
  • Create a python file named “s3ToES.py” and add the following lines of code. Edit the region and host on the following code.

Import Libraries:

import boto3
import re
import requests
from requests_aws4auth import AWS4Auth

Define Constants:

region = 'us-west-1'
service = 'es'
creds = boto3.Session().get_credentials()
awsauth = AWS4Auth(creds.access_key, creds.secret_key, region, service, session_token=creds.token)
host = 'http://aws.xxxxxxxxxxx.com/es'
index = 'lambda-s3-file-index'
type = 'lambda-type'
url = host + "/" + index + "/" + type
headers = { "Content-Type": "application/json" }
s3 = boto3.client('s3')
  • Create Regular Expressions to parse the logs.
pattern_ip = re.compile('(d+.d+.d+.d+)')
pattern_time = re.compile("[(d+/www/dddd:dd:dd:dds-dddd)]")
pattern_msg = re.compile('"(.+)"')
  • Define Lambda handler Function that will be essential in transferring data from S3 to Elasticsearch.
def praseLog(event, context):
    for record in event["Records"]:
   # From the event record, get the bucket name and key.
        bucket_name = record['s3']['bucket']['name']
        file_key = record['s3']['object']['key']
        # From the S3 object, read the file and spllit the lines
        obj = s3.get_object(Bucket=bucket_name, Key=file_key)
        body = obj['Body'].read()
        lines = body.splitlines()
        # For each line, match the regex.
        for line in lines:
            ip = pattern_ip.search(line).group(1)
            timestamp = pattern_time.search(line).group(1)
            message = pattern_msg .search(line).group(1)
            parsed_doc = { "ip": ip, "timestamp": timestamp, "message": message }
            r = requests.post(url, auth=awsauth, json=parsed_doc, headers=headers)
  • Install the Python packages to the folder where the code resides.

Windows :

cd s3ToES
pip install requests -t .
pip install requests_aws4auth -t .

Linux:

cd s3ToES
pip install requests -t .
pip install requests_aws4auth -t .
  • Package the code and the dependencies – 

Windows:

Right-click on the s3ToES folder and create a zip package

Linux:

zip -r lambda.zip *

Step 2: Create the Lambda Function

Once you successfully created the deployment package, you need to create a Lambda function to deploy the package designed above. Then only you can start transferring data from S3 to Elasticsearch.

  • Search for the AWS Lambda in the AWS Console, and then click on Create Function.
S3 to Elasticsearch: Create function
Image Source: Self
  • Once you create the function, you need to add a trigger that will invoke the task when the event happens. In this example, we want the code to run whenever a log file arrives in an S3 bucket. Follow the below steps to create a trigger.
  • Choose S3.
  • Choose your bucket.
  • For Event Type, choose PUT.
  • For Prefix, type logs/.
  • For Filter pattern, type .log.
  • Select Enable trigger.
  • Choose Add.
  • For Handler, type s3ToES.parseLog. This setting will tell Lambda that the file is – s3ToES.py and the method to invoke after the trigger is – praseLog.
  • Select the zip file as the code entry type and upload the zip file created above.
  • Choose Save.

Once you save, the Lambda function will be ready for its execution. Once you test the function you can further proceed to transfer data from S3 to Elsticsearch.

Step 3: Test the Lambda Function

  • To test the Lambda function, you need to upload a file to the S3 location.
  • Create a file named sample.log and add the following log properties – 
12.345.678.90 - [23/Aug/2020:13:55:36 -0700] "PUT /some-file.jpg"
12.345.678.91 - [23/Aug/2030:14:56:14 -0700] "GET /some-file.jpg"
  • Once you upload the file, the Lambda will invoke the ES function and ES will index the log.
  • Go to the ElasticSearch Console or Kibana and verify that the ‘lambda-s3-file-index’ index contains two documents. 
GET https://es-domain/lambda-index/_search?pretty
{
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "lambda-s3-file-index",
        "_type" : "lambda-type",
        "_id" : "vTYXaOWEKWV_TTkweuSDg",
        "_score" : 1.0,
        "_source" : {
          "ip" : "12.345.678.91",
          "message" : "GET /some-file.jpg",
          "timestamp" : "23/Aug/2020:14:56:14 -0700"
        }
      },
      {
        "_index" : "lambda-s3-index",
        "_type" : "lambda-type",
        "_id" : "vjYmaWIBJfd_TTkEuCAB",
        "_score" : 1.0,
        "_source" : {
          "ip" : "12.345.678.90",
          "message" : "PUT /some-file.jpg",
          "timestamp" : "23/Aug/2020:13:55:36 -0700"
        }
      }
    ]
  }
}

That’s it! your S3 to Elasticsearch connection is ready.

Conclusion

In this blog post, you have learned about Amazon ElasticSearch, its features, and how you can load the data from S3 to ElasticSearch. The one drawback of this approach is that you need to write a lot of code which may be not very useful to a non-programmer. Furthermore, you will have to use a lot of resources and engineering bandwidth to build an in-house solution from scratch if you wish to transfer your data from Elasticsearch or S3 to a Data Warehouse for analysis.

Hevo Data provides an Automated No-code Data Pipeline that empowers you to overcome the above-mentioned limitations. Hevo caters to 150+ data sources (including 40+ free sources) and can seamlessly transfer your Elasticsearch to S3 data to the Data Warehouse of your choice in real-time. Hevo’s Data Pipeline enriches your data and manages the transfer process in a fully automated and secure manner without having to write any code. It will make your life easier and make data migration hassle-free.

FAQ on S3 to Elasticsearch

How do I transfer data from S3 to Elasticsearch?

To transfer data from S3 to Elasticsearch, you can use Logstash or AWS Lambda. Here’s a brief overview of each method:
a) Using Logstash
– Install Logstash
– Create Logstash configuration file
– Run Logstash with the configuration file
b) Using AWS Lambda
– Create an AWS Lambda Function
– Set Up an S3 Trigger for the Lambda Function
– Write Lambda function code to read logs from S3
– Deploy and test the lambda function

Can Elasticsearch use S3?

Elasticsearch can use Amazon S3 as a repository for snapshots, which are backups of your indices. This is useful for disaster recovery or migrating data between clusters.
– Install the S3 Repository Plugin
– Configure the S3 Repository
– Register the S3 Repository
– Create a Snapshot
– Restore from a Snapshot

When should you not use Elasticsearch?

Elasticsearch is a powerful search and analytics engine, but it’s not suitable for every use case. Here are scenarios where you might not want to use Elasticsearch:
– Transactional Databases
– Frequent Data Updates
– Complex Joins and Relations
– Real-Time Analytics on large datasets

Want to take Hevo for a spin? Sign up for a 14-day free trial and experience the feature-rich Hevo suite firsthand.

Share your understanding of Connecting S3 to Elasticsearch in the comments section below.

Vishal Agrawal
Technical Content Writer, Hevo Data

Vishal Agarwal is a Data Engineer with 10+ years of experience in the data field. He has designed scalable and efficient data solutions, and his expertise lies in AWS, Azure, Spark, GCP, SQL, Python, and other related technologies. By combining his passion for writing and the knowledge he has acquired over the years, he wishes to help data practitioners solve the day-to-day challenges they face in data engineering. In his article, Vishal applies his analytical thinking and problem-solving approaches to untangle the intricacies of data integration and analysis.