Elasticsearch to Databricks: 2 Easy Ways to Replicate Data

• August 30th, 2022

Elasticsearch to Databricks_FI

Elasticsearch is one of the top players when it comes to logging or log-based analytics use cases. Although Elasticsearch is known for its efficiency, knowledge of Elasticsearch Domain Specific Language (DSL) is a must. And, if you are struggling with DSL, you might want to replicate your data to Databricks instead.

Databricks supports standard SQL, and its functionalities are at par with Elasticsearch. Databricks can easily handle complex queries.

You can replicate data with the help of CSV files using Python to replicate your data from Elasticsearch to Databricks or use an automated data pipeline like Hevo to ease your replicating process.

Table of Contents

How to Replicate Data From Elasticsearch to Databricks?

You can replicate data from Elasticsearch to Databricks using any of the two methods:

First, let’s understand how to replicate data using CSV files.

Replicating Data From Elasticsearch to Databricks using CSV Files

Replicating data from Elasticsearch to Databricks is a 3-step process using CSV files. Firstly you need to export data from Elasticsearch as CSV files, then export the CSV files into Databricks and modify your data according to the needs.

  • Step 1: In various formats, documents can be exported from Elasticsearch using Python Pandas. The files can be exported in HTML, CSV, and JSON formats. Using the following command, you can install Python 3’s PIP: 
sudo apt install python3-pip
sudo yum install python36
sudo yum install python36-devel
sudo yum install python36-setuptools
sudo easy_install-3.6 pip

Here is a sample script for exporting documents in any of the data formats (HTML, CSV and JSON):

#!/usr/bin/env python3
#-*- coding: utf-8 -*-
 
import sys, time, io
start_time = time.time()
 
if sys.version[0] != "3":
    print ("nThis script requires Python 3")
    print ("Please run the script using the 'python3' command.n")
    quit()
try:
    # import the Elasticsearch low-level client library
    from elasticsearch import Elasticsearch
    # import Pandas, JSON, and the NumPy library
    import pandas, json
    import numpy as np
except ImportError as error:
    print ("nImportError:", error)
    print ("Please use 'pip3' to install the necessary packages.")
    quit()
# create a client instance of the library
print ("ncreating client instance of Elasticsearch")
elastic_client = Elasticsearch()
 
"""
MAKE API CALL TO CLUSTER AND CONVERT
THE RESPONSE OBJECT TO A LIST OF
ELASTICSEARCH DOCUMENTS
"""
# total num of Elasticsearch documents to get with API call
total_docs = 20
print ("nmaking API call to Elasticsearch for", total_docs, "documents.")
response = elastic_client.search(
    index='employees',
    body={},
    size=total_docs
)
# grab list of docs from nested dictionary response
print ("putting documents in a list")
elastic_docs = response["hits"]["hits"]
 
"""
GET ALL OF THE ELASTICSEARCH
INDEX'S FIELDS FROM _SOURCE
"""
#  create an empty Pandas DataFrame object for docs
docs = pandas.DataFrame()
# iterate each Elasticsearch doc in list
print ("ncreating objects from Elasticsearch data.")
for num, doc in enumerate(elastic_docs):
    # get _source data dict from document
    source_data = doc["_source"]
    # get _id from document
    _id = doc["_id"]
    # create a Series object from doc dict object
    doc_data = pandas.Series(source_data, name = _id)
    # append the Series object to the DataFrame object
    docs = docs.append(doc_data)
 
"""
EXPORT THE ELASTICSEARCH DOCUMENTS PUT INTO
PANDAS OBJECTS
"""
print ("nexporting Pandas objects to different file types.")
 
# export the Elasticsearch documents as a JSON file
docs.to_json("objectrocket.json")
# have Pandas return a JSON string of the documents
json_export = docs.to_json() # return JSON data
print ("nJSON data:", json_export)
# export Elasticsearch documents to a CSV file
docs.to_csv("objectrocket.csv", ",") # CSV delimited by commas
# export Elasticsearch documents to CSV
csv_export = docs.to_csv(sep=",") # CSV delimited by commas
print ("nCSV data:", csv_export)
# create IO HTML string
import io
html_str = io.StringIO()
# export as HTML
docs.to_html(
    buf=html_str,
    classes='table table-striped'
)
 
# print out the HTML table
print (html_str.getvalue())
# save the Elasticsearch documents as an HTML table
docs.to_html("objectrocket.html")
 
print ("nntime elapsed:", time.time()-start_time)
  • Step 2: Navigate through the Sidebar menu and click on Data, in the Databricks UI. Now, you simply need to drag your CSV files into the drop zone or browse your files from the local computer and upload them and click on Create Table. Your path will look something like this after uploading: /FileStore/tables/<fileName>-<integer>.<fileType>. Your data can be simply accessed by clicking the Create Table with UI button.
Elasticsearch to Databricks: Uploading files in Databricks
Image Source
  • Step 3: After uploading your Data as a table in Databricks, you can modify and read your data in Databricks as CSV.
    • After selecting a Cluster, you need to click on Preview Table and read your CSV data in Databricks.
    • The data types in Databricks are read as a string by default. You must change the data type from a drop-down list to the appropriate one.
    • You can modify your data by navigating through the left bar. It has the following options:
      • Table Name: The name of the table can be changed.
      • File Type: CSV, JSON, and AVRO file types can be chosen.
      • Column Delimiter: This represents the field separating character. For example, ‘,’ is the delimiter in the case of CSV.
      • First Row Header: The first row’s column can be selected as the header.
      • Multi-line: In the cells, lines can be broken with the help of this option.
    • Once all the above parameters are configured, click on Create Table.
    • The CSV files can be read easily from the cluster where you have uploaded that file.

Challenges Faced While Replicating Data

In the following scenario, using CSV files might not be a wise choice:

  • You will need to perform the entire process frequently to access updated data at your destination to achieve two-way sync.
  • The CSV method might not be a good fit for you if you need to replicate data regularly since it’s time-consuming to replicate data using CSV files. 

Companies can use automated pipelines such as Hevo to avoid such challenges. Hevo helps you replicate data from databases such as Postgresql, MongoDB, MariaDB, SQL Server, etc.

Using an automated data pipeline tool, you can transfer data from Elasticsearch to Databricks.

Replicating Data From Elasticsearch to Databricks using Hevo Data

Hevo is an automated data pipeline tool that can replicate data from Elasticsearch to Databricks. Using Hevo data from 150+ sources can be replicated by users for further analysis in a much simpler way into a data warehouse, database, or destination of your choice.

Data can be managed in real-time using Hevo. Hevo provides a consistent and reliable solution that ensures you always have analysis-ready data.

Steps to Connect Elasticsearch to Databricks

The simple steps to carry out Elasticsearch to Databricks using Hevo:

Step 1: Configure Elasticsearch as a Source

Authenticate and Configure your Elasticsearch Source.

Elasticsearch to Databricks: Configure Elasticsearch as a Source
Image Source

Step 2: Configure Databricks as a Destination

In the next step, we will configure Databricks as the destination.

Elasticsearch to Databricks: Configure Databricks as Destination
Image Source

Step 3: All Done to Setup Your ETL Pipeline

Once your Elasticsearch to Databricks ETL Pipeline is configured, Hevo will collect new and updated data from Elasticsearch every five minutes (the default pipeline frequency) and duplicate it into Databricks. Depending on your needs, you can adjust the pipeline frequency from 5 minutes to an hour.

Data Replication Frequency

Default Pipeline FrequencyMinimum Pipeline FrequencyMaximum Pipeline FrequencyCustom Frequency Range (Hrs)
1 Hr15 Mins24 Hrs1-24

You can set up Data Pipeline and start replicating the data within a few minutes!

Why Integrate Elasticsearch to Databricks?

1) Query Language

The querying language used in Elasticsearch is not SQL. Elasticsearch provides Domain Specific Language(DSL) to define queries based on JSON. So it’s not a straightforward conversion from an RDBMS to Elasticsearch for searching through data. To use all of Elasticsearch’s capabilities with DSL, you must have medium-high SQL and Database knowledge. 

Since Databricks uses standard SQL, running queries is not much of a hassle. You can even view graphical representations in the Databricks interface itself on querying.

2) Performance 

Elasticsearch is one of the most widely used database alternatives for text search and log-based or logging analysis. Although Elasticsearch supports the handling of a high volume of data, it has difficulty searching with complex queries. This means anyone running complex queries should look for an alternative.

Databricks uses the unified Spark engine to operate, so processing complex SQL queries and streaming of data happens smoothly. Databricks is feature loaded as well as very well equipped to handle complex queries at the same time efficiently.

Why Use Hevo Data for Elasticsearch to Databricks Integration?

Using manual scripts and custom code to move data into the warehouse is cumbersome. Changing API endpoints and limits, ad-hoc data preparation, and inconsistent schema makes maintaining such a system a nightmare. Hevo’s reliable no-code data pipeline platform enables you to set up zero-maintenance data pipelines that just work.

  • Wide Range of Connectors – Instantly connect and read data from 150+ sources, including SaaS apps and databases, and precisely control pipeline schedules down to the minute.
  • In-built Transformations – Format your data on the fly with Hevo’s preload transformations using either the drag-and-drop interface or our nifty python interface. Generate analysis-ready data in your warehouse using Hevo’s Postload Transformation 
  • Near Real-Time Replication – Get access to near real-time replication for all database sources with log-based replication. For SaaS applications, near real-time replication is subject to API limits.   
  • Auto-Schema Management – Correcting improper schema after the data is loaded into your warehouse is challenging. Hevo automatically maps the source schema with the destination warehouse so you don’t face the pain of schema errors.
  • Transparent Pricing – Say goodbye to complex and hidden pricing models. Hevo’s Transparent Pricing brings complete visibility to your ELT spend. Choose a plan based on your business needs. Stay in control with spend alerts and configurable credit limits for unforeseen spikes in the data flow.
  • 24×7 Customer Support – With Hevo, you get more than just a platform; you get a partner for your pipelines. Discover peace with round-the-clock “Live Chat” within the platform. What’s more, you get 24×7 support even during the 14-day free trial.
  • Security – Discover peace with end-to-end encryption and compliance with all major security certifications, including HIPAA, GDPR, and SOC-2.
Get started for Free with Hevo!

Let’s Put It All Together

In this blog, you got to know about the key factors which could be considered for replicating data from Elasticsearch to Databricks. You learned how data could be replicated using Python Pandas. You also learned about an automated data pipeline solution known as Hevo.

You can use Hevo today to enjoy fully automated, hassle-free data replication for 150+ sources. You can sign up for a 14-day free trial, which gives you access to many free sources. Hevo’s free trial supports 50+ connectors and up to 1 million events per month and spectacular 24/7 email support to help you get started.

Share your experience of replicating data from Elasticsearch to Databricks! Let us know in the comments section below!

Visit our Website to Explore Hevo

No-Code Data Pipeline for Databricks