S3 to Elasticsearch: 3 Easy Steps

on Amazon S3, Data Integration, ETL, ETL Tutorials, Tutorials • August 26th, 2020 • Write for Hevo

With the invention of IoT and Online services, log parsing, text indexing are one of the critical features that many IT sectors are facing. Indexing is required to optimize the search results and reduce the latency. Amazon Elasticsearch is a feature offered by Amazon that is built on top of the open-source Elasticsearch stack and provides a fully-managed service for indexing your data.

In this blog post, you will learn about S3 and AWS Elasticsearch, its feature, and 3 easy steps to move the data from AWS S3 to Elasticsearch. Read along to understand these steps and their benefits in detail!

Table of Contents

Prerequisites

To transfer data from S3 to Elasticsearch, you must have:

  • Access to Amazon S3.
  • Access to Amazon Elasticsearch.
  • Basic understanding of data and data flow.
  • Python 3.6 or later installed.

Introduction to S3

Amazon S3 Logo
Image Source

Amazon Simple Storage Service, which is commonly known as Amazon S3, is an object storage service offered by Amazon to store the data. Amazon S3 provides a scalable and secure data storage service that can be used by customers and industries of all sizes to store any data format like weblogs, application files, backups, codes, documents, etc. Amazon S3 also provides high data availability, and it claims to be 99.999999999% of data durability. Amazon S3 is a popular choice all around the world due to its exceptional features. 

Amazon S3 has a simple UI that allows you to upload, modify, view, and manage the data. It also has exceptional support for the leading programming languages such as Python, Scala, Java, etc. which can interact with S3 using its API.

Amazon S3 allows users to manage the data securely, and it also offers periodic backup and versioning of the data. You can use Amazon S3 with almost all of the leading ETL tools and programming languages to read, write, and transform the data.

To learn more about Amazon S3, visit here.

Introduction to Amazon Elasticsearch

Amazon Elasticsearch Logo
Image Source

Elasticsearch is an open-source platform used for log analytics, application monitoring, indexing, text-search, and many more. It is used in a combination known as ELK stack which stands for Elasticsearch, Logstash, and Kibana. 

Amazon Elasticsearch is a fully-managed scalable service provided by Amazon that is easy to deploy, operate on the cloud. Amazon Elasticsearch offers the native open-source API of Elasticsearch so that your existing code and application that uses vanilla Elasticsearch will work seamlessly. Amazon ES also provides built-in support for Logstash and Kibana to quickly parse the logs, texts to visualize and analyze them.

Amazon ES has excellent integration with CloudWatch Logs, which automatically loads your logs to Amazon Elasticsearch for quick analysis. You need to select the logs and specify the Elasticsearch domain. The Amazon ES integration moves the data continuously and automatically for analysis.

To learn more about AWS Elasticsearch, visit here.

Advantages of AWS Elasticsearch

To understand the importance of connecting S3 to Elasticsearch, you first need to learn the advantages of AWS Elasticsearch. The AWS Elasticsearch is in such popular demand because of the following advantages:

  • Easy to Use: AWS ElasticSearch is a fully-managed service provided by AWS. You can easily set up a production-ready cluster with a minimum time frame. AWS handles all the hardware, installation, infrastructure, and maintenance.
  • Open Source Support: AWS ES also supports open-source API to ensure the smooth running of your existing application. It has built-in support for Logstash for data loading, transformation, and Kibana to visualize them. So once you transfer data from S3 to Elastichsearch, you can implement all these features on your data.
  • Secure Access: With the help of AWS VPC, you can isolate your ElasticSearch cluster and enable all security aspects for a secure transition of data.
  • High Availability: AWS ensures the high availability of data across its services. You are using S3 to store the data which claims to be 99.999999999% fail-safe access to the data.
  • Integration with Other AWS Service: AWS ES offers seamless in-built integration to other AWS services like Kinesis, Firehose, CloudWatch, etc.
  • Scalable: AWS ES automatically manages the scale up and down of the cluster as per data volume. Hence since the S3 to Elasticsearch connection is in place, from the AWS Management Console, you can easily set up cluster resizing.

Simplify AWS S3 and Elasticsearch ETL with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to Load Data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ data sources like AWS S3 and Elasticsearch and loads the data onto the desired Data Warehouse, enriches the data, and transforms it into an analysis-ready form without writing a single line of code.

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different Business Intelligence (BI) tools as well.

Get Started with Hevo for Free

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Steps to Connect S3 to Elasticsearch

To load the data from S3 to Elasticsearch, you can use Amazon Lambda to create a trigger that will load the data continuously from S3 to Elasticsearch. The Lambda will watch the S3 location for the file, and in an event, it will trigger the code that will index your file.

The process of loading data from Amazon S3 to Elasticsearch with AWS Lambda is very straightforward. The following steps are required to connect S3 to Elasticsearch using this method:

Step 1: Create a Lambda Deployment Package

The first step of transferring data from S3 to Elasticsearch requires you to set up Lambda Deployment package:

  • Open your favorite Python editor and create a package called s3ToES.
  • Create a python file named “s3ToES.py” and add the following lines of code. Edit the region and host on the following code.

Import Libraries:

import boto3
import re
import requests
from requests_aws4auth import AWS4Auth

Define Constants:

region = 'us-west-1'
service = 'es'
creds = boto3.Session().get_credentials()
awsauth = AWS4Auth(creds.access_key, creds.secret_key, region, service, session_token=creds.token)
host = 'http://aws.xxxxxxxxxxx.com/es'
index = 'lambda-s3-file-index'
type = 'lambda-type'
url = host + "/" + index + "/" + type
headers = { "Content-Type": "application/json" }
s3 = boto3.client('s3')
  • Create Regular Expressions to parse the logs.
pattern_ip = re.compile('(d+.d+.d+.d+)')
pattern_time = re.compile("[(d+/www/dddd:dd:dd:dds-dddd)]")
pattern_msg = re.compile('"(.+)"')
  • Define Lambda handler Function that will be essential in transferring data from S3 to Elasticsearch.
def praseLog(event, context):
    for record in event["Records"]:
   # From the event record, get the bucket name and key.
        bucket_name = record['s3']['bucket']['name']
        file_key = record['s3']['object']['key']
        # From the S3 object, read the file and spllit the lines
        obj = s3.get_object(Bucket=bucket_name, Key=file_key)
        body = obj['Body'].read()
        lines = body.splitlines()
        # For each line, match the regex.
        for line in lines:
            ip = pattern_ip.search(line).group(1)
            timestamp = pattern_time.search(line).group(1)
            message = pattern_msg .search(line).group(1)
            parsed_doc = { "ip": ip, "timestamp": timestamp, "message": message }
            r = requests.post(url, auth=awsauth, json=parsed_doc, headers=headers)
  • Install the Python packages to the folder where the code resides.

Windows :

cd s3ToES
pip install requests -t .
pip install requests_aws4auth -t .

Linux:

cd s3ToES
pip install requests -t .
pip install requests_aws4auth -t .
  • Package the code and the dependencies – 

Windows:

Right-click on the s3ToES folder and create a zip package

Linux:

zip -r lambda.zip *

Step 2: Create the Lambda Function

Once you successfully created the deployment package, you need to create a Lambda function to deploy the package designed above. Then only you can start transferring data from S3 to Elasticsearch.

  • Search for the AWS Lambda in the AWS Console, and then click on Create Function.
S3 to Elasticsearch
Image Source: Self
  • Once you create the function, you need to add a trigger that will invoke the task when the event happens. In this example, we want the code to run whenever a log file arrives in an S3 bucket. Follow the below steps to create a trigger.
  • Choose S3.
  • Choose your bucket.
  • For Event Type, choose PUT.
  • For Prefix, type logs/.
  • For Filter pattern, type .log.
  • Select Enable trigger.
  • Choose Add.
  • For Handler, type s3ToES.parseLog. This setting will tell Lambda that the file is – s3ToES.py and the method to invoke after the trigger is – praseLog.
  • Select the zip file as the code entry type and upload the zip file created above.
  • Choose Save.

Once you save, the Lambda function will be ready for its execution. Once you test the function you can further proceed to transfer data from S3 to Elsticsearch.

Step 3: Test the Lambda Function

  • To test the Lambda function, you need to upload a file to the S3 location.
  • Create a file named sample.log and add the following log properties – 
12.345.678.90 - [23/Aug/2020:13:55:36 -0700] "PUT /some-file.jpg"
12.345.678.91 - [23/Aug/2030:14:56:14 -0700] "GET /some-file.jpg"
  • Once you upload the file, the Lambda will invoke the ES function and ES will index the log.
  • Go to the ElasticSearch Console or Kibana and verify that the ‘lambda-s3-file-index’ index contains two documents. 
GET https://es-domain/lambda-index/_search?pretty
{
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "lambda-s3-file-index",
        "_type" : "lambda-type",
        "_id" : "vTYXaOWEKWV_TTkweuSDg",
        "_score" : 1.0,
        "_source" : {
          "ip" : "12.345.678.91",
          "message" : "GET /some-file.jpg",
          "timestamp" : "23/Aug/2020:14:56:14 -0700"
        }
      },
      {
        "_index" : "lambda-s3-index",
        "_type" : "lambda-type",
        "_id" : "vjYmaWIBJfd_TTkEuCAB",
        "_score" : 1.0,
        "_source" : {
          "ip" : "12.345.678.90",
          "message" : "PUT /some-file.jpg",
          "timestamp" : "23/Aug/2020:13:55:36 -0700"
        }
      }
    ]
  }
}

That’s it! your S3 to Elasticsearch connection is ready.

Conclusion

In this blog post, you have learned about Amazon ElasticSearch, its features, and how you can load the data from S3 to ElasticSearch. The one drawback of this approach is that you need to write a lot of code which may be not very useful to a non-programmer.

Visit our Website to Explore Hevo

However, suppose you’re looking for an easy solution. In that case, we recommend you to try Hevo Data, a No-code Data Pipeline that helps you transfer data from a source of your choice in a fully automated and secure manner without having to write the code repeatedly. Hevo, with its strong integration with 100+ sources & BI tools like AWS S3 and Elasticsearch, allows you to export & load data and transform & enrich your data & make it analysis-ready in a jiffy.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your understanding of Connecting S3 to Elasticsearch in the comments section below.

No-Code Data Pipeline for S3