Elasticsearch to S3: Move Data Using Logstash

on Tutorial • February 7th, 2020 • Write for Hevo

If you are looking to move data from Elasticseach to S3 for archival, analysis, or other use cases, you are at the right place. This post talks about how to move data from Elasticsearch to S3 using a few approaches. Before deep-diving into that, let us first understand these technologies briefly.

Understanding Elasticsearch

Elasticsearch accomplishes its super-fast search capabilities through the use of a Lucene based distributed reverse index. When a document is loaded to Elasticsearch, it creates a reverse index of all the fields in that document. A reverse index is an index where each of the entries is mapped to a list of documents that contains them. Data is stored in JSON form and can be queried using the proprietary query language.

Elasticsearch has four main APIs – Index API, Get API, Search API and Put Mapping API. Index API is used to add documents to the index. Get API allows to retrieve the documents and Search API enables querying over the index data. Put Mapping API is used to add additional fields to an already existing index. 

The common practice is to use Elasticsearch as part of the standard ELK stack – Which involves three components – Elasticsearch, Logstash, and Kibana. Logstash provides data loading and transformation capabilities. Kibana provides visualization capabilities. Together, three of these components form a powerful Data stack. 

Behind the scenes, Elasticsearch uses a cluster of servers to deliver high query performance. An index in Elasticsearch is a collection of documents. Each index is divided into shards that are distributed across different servers. By default, it creates 5 shards per index with each shard having a replica for boosting search performance. Index requests are handled only by the primary shards and search requests are handled by both the shards. 

The number of shards is a parameter that is constant at the index level. Users with deep knowledge of their data can override the default shard number and allocate more shards per index. A point to note is that a low amount of data distributed across a large number of shards will degrade the performance. 

Scaling in Elasticsearch is accomplished by adding more servers. The architecture can automatically rebalance the data and query load across available nodes. Fault tolerance in Elasticsearch is accomplished through cross-cluster replication. A remote cluster can be set up to sync with the primary cluster and serve as hot standby.

Amazon offers a completely managed Elasticsearch service that is priced according to the number of instance hours of operational nodes. 

Understanding Amazon S3

AWS S3 is a fully managed object storage service that is used for a variety of use cases like hosting data, backup and archiving, data warehousing, etc. Amazon handles all operational activities related to capacity scaling, pre-provisioning, etc and the customers only need to pay for the amount of space that they use. It offers comprehensive access controls to meet any kind of organizational and business compliance requirements through an easy to use control panel interface. 

S3 supports analytics through the use of AWS Athena and AWS redshift spectrum through which users can execute SQL queries over data stored in S3. S3 buckets can be encrypted by S3 default encryption. Once enabled, all items in a particular bucket will be encrypted. 

S3 achieves high availability by storing the data across a number of distributed servers. Naturally, there is an associated propagation delay with this approach and S3 only guarantees eventual consistency. But, the writes are atomic; which means at any time, the API will return either the new data or old data and never will it provide a corrupted response. 

Conceptually S3 is organized as buckets and objects. A bucket is the highest level S3 namespace and acts as a container for storing objects. They have a critical role in access control and usage reporting is always aggregated at the bucket level. An object is the fundamental storage entity and consists of the actual object as well as the metadata. An object is uniquely identified by a unique key and a version identifier. 

Customers can choose the AWS regions in which their buckets need to be located according to their cost and latency requirements. A point to note here is that objects do not support locking and if two PUTs come at the same time, the request with the latest timestamp will win. This means if there is concurrent access, users will have to implement some kind of locking mechanism on their own. 

Two approaches to Move Data from Elasticsearch to S3

Data can be copied from Elasticsearch to S3 in either of two ways

Approach 1: Write a Custom Code using Logstash to move data. You would have to invest both time and tech bandwidth to build, set up and monitor the ETL infrastructure.

Approach 2: Use a Data Pipeline Platform like Hevo Data that gets the same done in just a few clicks. Since Hevo is a fully-managed self-serve platform, it would be a hassle-free alternative to approach 1.

This blog covers approach 1 in great detail. The blog also highlights the challenges, shortcomings, and limitations of this approach so that you can evaluate all your alternatives and make the best choice.

Elasticsearch to S3: Building a Custom Code

Moving data from Elasticsearch to S3 can be done in multiple ways. The most straightforward is to write a script to query all the data from an index and write it into a CSV or JSON file. But the limitations to the amount of data that can be queried at once make that approach a nonstarter. You will end up with errors ranging from time outs to too large window of query. So we need to consider other approaches.

Logstash, which is a core part of the ELK stack, is a full-fledged data load and transformation utility. With some adjustment of configuration parameters, it can be made to export all the data in an elastic index to CSV or JSON. The latest release of log stash also includes an S3 plugin, which means the data can be exported to S3 directly without intermediate storage. Let us look in detail into this approach and its limitations.

Using Logstash

Logstash is a service side pipeline that can ingest data from a number of sources, process or transform them and deliver to a number of destinations. In this use case, Log stash input will be Elasticsearch and output will be a CSV file.

Logstash works based on data access and delivery plugins. For this exercise, we need to install the Logstash Elasticsearch plugin and the Logstash S3 plugin.

  1. Execute the below command to install logstash Elasticsearch plugin.
    logstash-plugin install logstash-input-elasticsearch
  2. Execute the below command to install logstash output s3 plugin.
    logstash-plugin install logstash-output-s3
  3. Next involves creating a configuration for the logstash execution. An example configuration to execute this is provided below.

     

    input {
    
     elasticsearch {
    
        hosts => "elastic_search_host"
    
        index => "source_index_name"
    
        query => '
    
        {
    
        "query": {
    
        "match_all": {}
    
        }
    
        } 
    
      '
    
      }
    
    }
    
    
    output {
    
       s3{
    
         access_key_id => "aws_access_key"
    
         secret_access_key => "aws_secret_key"
    
         bucket => "bucket_name"
    
       }
    
    }

    In the above JSON, replace the elastic_search_host with the URL of your source Elasticsearch instance. The index key should have the index name as the value. The query tries to match every document present in the index. Remember to also replace the aws access details and the bucket name with your required details.

    Create this configuration and name it as es_to_s3.conf

  4. Execute the configuration using the following command.
    logstash -f es_to_s3.conf

    The above command will generate JSON output matching the query in the provided S3 location. Depending on your data volume, this will take a few minutes. There are multiple parameters that can be adjusted in the S3 configuration to control variables like output file size etc. A detailed description of all config parameters can be found here.

Elasticsearch to S3: Limitations of Building a Custom Code

The above approach is the simplest way to transfer data from an Elasticsearch to S3 without using any external tools. But it does have some limitations.

  1. This approach works fine for a one time load, but in most situations, the transfer is a continuous process that needs to be executed based on an interval or triggers. To accommodate such requirements, customized code will be required.
  2. This approach is resource-intensive and can hog the cluster depending upon the number of indexes and the volume of data that needs to be copied. 

An alternative to this approach would be to use a third-party platform like Hevo.

An Easier Alternative to Move Data from SQS to S3

A fully-managed Data Integration platform like Hevo (14-day, risk-free trial) can take the burden off you completely by automating the data load from Elasticsearch to S3.

Hevo’s fault-tolerant architecture ensures the data is moved securely and reliably without any loss.

In addition to Elasticsearch, Hevo can help you move data from a variety of different data sources into S3. This, in turn, will enable your team to stop worrying about data and only focus on gaining insights from it.

What are your thoughts around moving data from Elasticsearch to S3? Have you explored other approaches that have worked for you? Let us know in the comments.

No-code Data Pipeline for S3