AWS Elasticsearch Snapshot Replication: Best Methods

Pratik Dwivedi • Last Modified: December 29th, 2022

Elasticsearch snapshot replication

Are you looking for the best ways to perform Elasticsearch snapshot replication? You have landed on the right page. There are different types of replication available. This article aims to discuss snapshot replication.

You will be looking at the following:

Introduction to Elasticsearch

Elasticsearch is a popular open-source full-text search and analytics engine for all types of data, including textual, document, numerical, geospatial, structured, and unstructured data. Amazon Elasticsearch Service (Amazon ES) is a managed service that makes it easy to deploy, operate and scale Elasticsearch clusters in the AWS Cloud. The target users or consumers can be from all over the globe and to serve them quickly, you need to have your ES cluster instances running as geographically close to the user as possible. 

This article attempts to provide steps to perform Elasticsearch replication snapshot and discuss strategies to achieve multi-region data replication in Elastic search ( abbreviated as ES). The activity of capturing the event of data change and the data changes themselves, is called Change Data Capture, CDC for short. We will discuss the broad strategies that can be applied, intrinsic details are not a part of this article. 

Limitations of Elasticsearch 

1. To cater to local/regional searches, you need to provision a different cluster in every region supported. 

2. There is no automated or easy method to ensure that different clusters are coherent with each other and share the same snapshot(data streams+indices etc.). 

To ensure that all your ES clusters return the same results after sifting through the exact same data, You will have to devise a strategy and implement on your own. 

More often than not, this could involve writing code. 

Elasticsearch snapshots are incremental, meaning that they only store data that has changed since the last successful snapshot. 

Hevo Data: Migrate your Data Seamlessly

Hevo is a No-code Data Pipeline. It supports pre-built data integrations from 100+ data sources, including Elasticsearch. If you want to replicate your data, then Hevo is the right choice for you. Hevo is a fully managed solution for your data migration. It will automate your data flow in minutes. Its fault-tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully-automated solution to replicate and manage data in real-time and always have analysis-ready data in your desired destination.

Visit our Website to Explore Hevo

Let’s look at some salient features of Hevo:

  • Fully Managed: It requires no maintenance as Hevo is a fully automated platform.
  • Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer. 
  • Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
  • Schema Management: Hevo can automatically detect the schema of the incoming data and maps it to the destination schema.
  • Live Monitoring: Advanced monitoring gives you a one-stop view to watch all the activities that occur within pipelines.
  • Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support call.

Give Hevo a try by Sign up here for a 14-Day Free Trial!

Methods to Perform Replication

The following are approaches to a probable solution:-

1. DynamoDB –> DynamoDb Streams –> Lambda –> ES 

In this approach, you will store your data in DynamoDB, which is connected to DynamoDB streams. Whenever you update your data,  the delta (changed) data is sent to DynamoDB streams. These streams trigger a Lambda function which will facilitate flowing of this data into all your ES instances. You will have to carefully create your Lambda function, as it’s the core of this strategy. 

Elasticsearch Snapshot Replication using DynamoDB

Some of your events could fail, so you can configure CloudWatch to monitor them as well as those which fail the re-attempt.

This approach is simple but if your ES clusters are in far-apart regions, network latency may lead to delays and timeouts, leading to inconsistency between your ES clusters. 

Efficient Variant to this Approach

An efficient variant of the above approach is to use DynamoDb Global Tables instead of simple DynamoDb. You will just need to add/update data to global tables in one region only, and DynamoDB Global tables will take upon itself to replicate the data across regions. 

Global tables will take care of data ingestion, error handling and network latency. This is a very efficient and secure option, worth considering if you have the resources(budget) for it. The flip side is that DynamoDB Global tables support “eventual consistency”; which means there could be a short time gap before your clusters are in sync. 

2. Manual Approach for Simple and Static ES Instances

If all your server and data sources are in AWS , then, do the following:

Step 1: Lets call your two ES instances as “Source” and “Destination”

Register the same manual snapshot repository on both the source and destination domains (usually, an S3 bucket).

(Optional) If you’re migrating to another AWS account, attach a policy to the source S3 bucket that grants cross-account permissions to the destination S3 bucket.

Step 2: Take a manual snapshot of the “Source” Elasticsearch domain

You can take snapshots of an entire cluster, including all its data streams and indices , then save in your S3 bucket. 

Step 3: Elasticsearch has a “_restore API”, use this API to restore the snapshot to the “Destination” domain

Some parameters and configurable options you can use are: 

index_settings parameter – used to override index settings during the restore process. 

rename_pattern and rename_replacement options – can be also used to rename data streams and indices on restore using regular expression that supports referencing the original text. 

Manual snapshots are not free and will incur costs for taking a snapshot as well as S3 storage costs. 

Remember, if your data needs to be updated/augmented regularly OR if your queries change frequently, this process needs to be repeated every time and can become tedious. 

Conclusion

For all practical purposes, your data and indices will change very often, if not daily. If the number of ES domains increases, this approach can become too tedious. Hence, it may be better to create a dynamic pipeline using Hevo. 

Hevo is a No-code Data Pipeline. It supports pre-built integrations from  100+ Data Sources at a reasonable price. You can automate your data flow from Elasticsearch in minutes. 

Sign Up for a 14-day free trial today and let Hevo be a helping hand in your backup process.

Have any further queries? Get in touch with us in the comments section below.

No-Code Data Pipeline for Elasticsearch