Elasticsearch to BigQuery: 2 Easy Methods

• November 23rd, 2020

Do you want to transfer your Elasticsearch data to Google BigQuery? Are you finding it challenging to connect Elasticsearch to BigQuery? If yes, then you’ve landed at the right place! This article will answer all your queries & relieve you of the stress of finding a truly efficient solution. Follow this step-by-step guide to master the skill of efficiently transferring your data to Google BigQuery from Elasticsearch.

It will help you take charge in a hassle-free way without compromising efficiency. This article aims at making the data export process as smooth as possible.

Upon a complete walkthrough of the content, you will be able to successfully set up a connection between Elasticsearch & Google BigQuery to seamlessly transfer data to Google BigQuery for a fruitful analysis in real-time. It will further help you build a customized ETL pipeline for your organization. Through this article, you will get a deep understanding of the tools and techniques & thus, it will help you hone your skills further.

Table of Contents

Introduction to Elasticsearch

Elasticsearch to BigQuery: Elasticsearch Logo.
Image Source: www.antaresnet.com

Elasticsearch is an open-source search engine, developed in Java. It is essentially a No-SQL database and an analytic engine built on top of Lucene. It provides real-time performance and quick search results primarily because it performs an index-based search instead of a text-based one.

Key features of Elasticsearch

  • Scalable: It runs smoothly, even on systems, containing 1000’s of nodes and scales automatically.  
  • Document Oriented: It uses the documents in the JSON format by default. It is simple, concise, and supports most programming languages.
  • Schema Support: There is no need to provide an index or a data type beforehand, objects, are automatically mapped when they get indexed with new properties.
  • Instance Search: It supports the “Search As you Type” feature, to help users extract relevant results quickly.

For further information on Elasticsearch, you can check the official website here.

Introduction to Google BigQuery

Elasticsearch to BigQuery: BigQuery Logo.
Image Source: www.dataschool.com

Google BigQuery is a robust serverless data warehouse solution that lets you store and analyse large amounts of data in a matter of a few seconds. It allows you to leverage its phenomenal functionalities such as robust querying & processing abilities to analyse data from various sources such as Google Analytics and a lot more.

Key features of Google BigQuery

  • Scalability: Google BigQuery offers true scalability and consistent performance using its massively parallel computing and secure storage engine.
  • Data Ingestion Formats: Google BigQuery allows users to load data in various formats such as AVRO, CSV, JSON, etc.
  • Built-in AI & ML: It supports predictive analysis using its auto ML tables feature, a codeless interface that helps develop models having the best in class accuracy. Google BigQuery ML is another feature that supports algorithms such as K means, Logistic Regression, etc.
  • Parallel Processing: It uses a cloud-based parallel query processing engine that reads data from thousands of disks at the same time.

For further information on Google BigQuery, you can check the official website here.

Ways to Connect Elasticsearch With Google BigQuery

Method 1: Using Apache Airflow & Google Dataflow to Connect Elasticsearch to BigQuery 

Using Google Dataflow and Apache Airflow & Beam to establish a connection between Elasticsearch & Google BigQuery is one such way. This method requires integrating your Elasticsearch cluster and Google Cloud Project using a VPC network & NAT gateway. You will then need to write Kotlin & Python-based custom code snippets to carry out the ETL job using the ElasticsearchIO & BigQueryIO classes of the Apache Beam SDK.

Method 2: Using Hevo Data, a No-code Data Pipeline

A fully managed, No-code Data Pipeline platform like Hevo Data, helps you load data from Elasticsearch (among 100+ Sources) to Google BigQuery in real-time, in an effortless manner. Hevo, with its minimal learning curve, can be set up in a matter of minutes, making the users ready to load data without compromising performance. Its strong integration with various sources such as databases, files, analytics engine, etc. gives users the flexibility to bring in data of all different kinds in a way that’s as smooth as possible, without having to write a single line of code.

Get started with hevo for free

Prerequisites

  • Working knowledge of Elasticsearch.
  • Working knowledge of Google BigQuery.
  • A general idea about Kotlin.
  • A general idea about Apache Beam & its libraries.
  • A Google BigQuery account and project.
  • An Elasticsearch account.

Methods to Connect Elasticsearch to BigQuery

There are multiple ways in which you can transfer data from Elasticsearch to BigQuery:

Method 1: Using Apache Airflow & Google Dataflow to Connect Elasticsearch to BigQuery

To set up a custom ETL pipeline that will help you transfer data from Elasticsearch to BigQuery requires you to use a VPC network and NAT gateway to set up a connection between your Google Cloud Project & Elasticsearch clusters. You will then use ElasticsearchIO & BigQueryIO classes to fetch and import data.

Elasticsearch to BigQuery.
Image Source: www.medium.com/inside-bizzabo

This method can be implemented using the following steps:

Step 1: Setting up a VPC for Google Dataflow

To start loading data from Elasticsearch to BigQuery, you will first have to establish a connection between your Elasticsearch cluster and your Google Cloud Project. For Elasticsearch clusters that are running on Amazon Web Services EC2, you will further have to use the AWS DNS service, known as Route 53, that will help you associate a particular domain to your Elasticsearch cluster. 

To do this, go to the official AWS Console website and login with your credentials such as username and password. 

Elasticsearch to BigQuery: AWS Console Login.
Image Source: Self

Once you’ve logged in, go to the Route 53 console and select the hosted zones option from the panel on the left. Click on the create a recordset option and configure your recordset as follows:

Elasticsearch to BigQuery: Creating a Record Set.
Image Source: Self

With your recordset now ready, you need to figure out the Google Dataflow IP address that will access your Elasticsearch cluster. Since Google Dataflows use ad-hoc VMs, to keep track of IPs & worker VMs, you will have to create a VPC for your data pipeline handling connections with Google Cloud Projects.

To do this, go to the official website of Google Cloud Platform and log in using your Google credentials such as username and password. Once you’ve logged in, go to the page of VPC networks, and click on the create VPC network option.

A new dialogue box will now open up on your screen, where you need to provide the name, description, and custom subnet configurations for your VPC network. You can do this as follows:

Elasticsearch to BigQuery: Creating a VPC Network.
Image Source: Self

For further information on configuring the subnet for your Google Cloud Project VPC, you can check the official documentation here.

This is how you can set up a VPC for Google Dataflow.

Step 2: Linking the Google Dataflow VPC to a NAT Gateway

With your VPC set up for your Google Cloud Projects, you now need to link it with a cloud-based NAT gateway. It will ensure that any traffic from your VPC will only pass through your NAT gateways static IPs. To do this, go to the official website of Google Cloud Console and log in with your credentials such as username and password.

Elasticsearch to BigQuery: Google Cloud Login.
Image Source: Self

Once you’ve logged in, go to the Cloud NAT page and click on the create NAT gateway option. A new dialogue box will now open up on your screen, where you need to provide the following information:

Elasticsearch to BigQuery: Creating a NAT Gateway.
Image Source: Self
  • Gateway Name: Enter a name for your gateway.
  • VPC network: Select your VPC network, from the drop-down list.
  • Region: Select a world region for your gateway.
  • Cloud-based Router: You can either select/create a cloud router.

Once you’ve added all necessary details, click on the create option. With your NAT gateway now set up, go to the EC2 console and add a new security group for your Elasticsearch cluster, that will help accept incoming traffic only from the Google Dataflow IPs.

Elasticsearch to BigQuery: TCP NAT Gateway Set up.
Image Source: Self

This is how you can link your VPC network for Google Cloud Project with NAT gateways.

Step 3: Using ElasticsearchIO to Fetch Data from Elasticsearch

With your Elasticsearch cluster & GCP connection now up and running, you can start fetching data from your Elasticsearch database. To do this, you can make use of the ElasticsearchIO.read() function of Elasticsearch IO class, available in the SDK of Apache Beam as follows:

private fun readFromES(argMap: HashMap<String, String>, query: String): ElasticsearchIO.Read? {
return ElasticsearchIO.read()         .withConnectionConfiguration(ElasticsearchIO.ConnectionConfiguration
.create(arrayOf(argMap["source"]), argMap["index"], argMap["type"]).withConnectTimeout(argMap["connectTimeout"]!!.toInt()               .withSocketAndRetryTimeout(argMap["socketAndRetryTimeout"]!!.toInt()
)
.withBatchSize(argMap["batchSize"]!!.toLong())
.withQuery(query)
}

In case you don’t have the Apache Beam SDK installed on your system, you can download it from here.

The ElasticsearchIO.read() function helps you fetch data from Elasticsearch in the form of JSON documents. It has various parameters such as source, socket, index_type, etc. that you can configure as per your Elasticsearch database. You will also need to pass a data query as an argument that will help fetch your data.

This is how you can start fetching data from Elasticsearch as JSON documents to connect Elasticsearch to BigQuery.

Are you tired of following a tedious setup process? Click here to check out the method of using Hevo Data that will let you integrate Elasticsearch to BigQuery in an automated and hassle-free manner!

Step 4: Configuring & Transforming Elasticsearch Documents

Once you’ve fetched your Elasticsearch data, you now need to convert your JSON documents into the form that Google BigQuery tables support, that is as database records. To do this, you can use Gson, to parse your JSON document and map it into the desired schema, based on numerous relevant keys. It will then fetch the data values and return it as a “TableRecord” object.

You can use the following lines of code to map your Elasticsearch JSON document into a “TableRecord” object that Google BigQuery supports:

private fun mapToTableRow() = MapElements.into(TypeDescriptor.of(TableRow::class.java))
        .via(ContactStringToTableRow())
class ContactStringToTableRow : SimpleFunction<String, TableRow>() {
    override fun apply(input: String): TableRow {
        val gson: Gson = GsonBuilder().create()
        val parsedMap: Map<String, Any> = gson.fromJson(input, object : TypeToken<Map<String, Any>>() {}.type)
        return TableRow()
                .set("id", parsedMap["id"].toString().toDouble().toLong())
                .set("first_name", parsedMap["first_name"].toString())
                .set("last_name", parsedMap["last_name"].toString())
                .set("address", parsedMap["address"].toString())
                .set("birthday", ISODateTimeFormat.dateTime().print((parsedMap["birthday"].toString().toDouble().toLong())))
                .set("person_json", input)
                .set("created", ISODateTimeFormat.dateTime().print((parsedMap["created"].toString().toDouble().toLong())))
                .set("modified", ISODateTimeFormat.dateTime().print((parsedMap["modified"].toString().toDouble().toLong())))
    }
}

This is how you can transform your Elasticsearch JSON data to connect Elasticsearch to BigQuery.

Step 5: Loading data from Elasticsearch to BigQuery

Once you’ve transformed your Elasticsearch data, you can now make use of the BigQueryIO.writeTable() function of the BigQueryIO, available in the SDK of Apache Beam as follows:

private fun writeToBQ(tableReference: TableReference?, fields: ArrayList<TableFieldSchema>): BigQueryIO.Write<TableRow>? {
    return BigQueryIO.writeTableRows()
            .to(tableReference)
            .optimizedWrites()
            .withSchema(TableSchema().setFields(fields))  .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)          .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
}
Loading Job from into Google BigQuery.
Image Source: Self

With your data pipeline now ready, every time your data pipelines execute, they will add a new job to Google Dataflow. In case you want to know more about a job, you can click on its name to retrieve information such as auto-scaling, metrics, logs, etc.

Elasticsearch to BigQuery: Google BigQuery Job Summary.
Image Source: Self

This is how you can connect Elasticsearch to BigQuery using Google Dataflow and Apache Airflow & Beam.

Step 6: Scheduling Data Pipeline Execution with Apache Airflow (Optional Step)

In case you want your data pipeline to execute at scheduled intervals of time, you can make use of Apache Airflow & its DAGs to achieve this functionality. You can refer to the official documentation of Apache Airflow here, that will help you get started with the installation and set up process of Apache Airflow.

With Apache Airflow in place, a relevant DAG trigger at a pre-defined interval of time will execute the jar files based on the configuration parameters of your DAG, which will help start the ETL job. To do this, you can use the following lines of code:

dag = DAG('dataPipelineFromElasticsearchToBigquery-dag', default_args=dag_args, catchup=False)
start = DummyOperator(task_id='start', dag=dag)
task = DataFlowJavaOperator(
    task_id='daily-dataPipelineFromElasticsearchToBigquery-task',  jar='gs://dataPipelineFromElasticsearchToBigquery/lib/dataPipelineFromElasticsearchToBigquery.jar',
    options=all_options,
    dag=dag)
start >> task

This is how you can set up interval-based execution of your data pipeline using Apache Airflow.

Limitations of Integrating Elasticsearch & Google BigQuery using Google Dataflows and Apache Airflow & Beam

  • Integrating Elasticsearch with Google BigQuery using Apache Beam & Google Dataflow requires you to write custom Kotlin-based code to fetch, transform and then load data. Hence, you must have strong technical knowledge. 
  •  Setting up a dedicated VPC network, NAT gateways, etc., can be a challenging task, especially for beginners as it requires you to have a deep understanding of how IP addresses & subnetting works.
  • You must ensure that you provide all correct parameters such as table name, schema, etc. as even a small error can result in the ETL process to fail.
  • It requires you to have a general idea of how different services such as Apache Airflow, Beam, Google Dataflow, etc. work, resulting in a bottleneck, as many might not be aware of their operations.

Method 2: Using Hevo Data, a No-code Data Pipeline

Elasticsearch to BigQuery: Hevo Logo.

Hevo Data, a No-code Data Pipeline, helps you transfer data from Elasticsearch (among 100+ sources) to Google BigQuery & lets you visualize it in a BI tool. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination. It allows you to focus on key business needs and perform insightful analysis using various BI tools such as Power BI, Tableau, etc. 

Steps to use Hevo Data:

Hevo Data focuses on two simple steps to get you started:

  • Configure Source: Connect Hevo Data with a data source of your choice such as Elasticsearch.
  • Integrate Data: Load data from Elasticsearch to BigQuery by providing your Google BigQuery database credentials such as your authorized Google BigQuery account, along with a name for your database, destination, and project Id.

Check out what makes Hevo amazing:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.

Simplify your data analysis with Hevo today and Sign up here for a 14-day free trial!.

Conclusion

This article teaches you how to connect Elasticsearch to BigQuery with ease. It provides in-depth knowledge about the concepts behind every step to help you understand and implement them efficiently. These methods, however, can be challenging especially for a beginner & this is where Hevo saves the day. Hevo Data, a No-code Data Pipeline, helps you transfer data from a source of your choice in a fully-automated and secure manner without having to write the code repeatedly. Hevo, with its strong integration with 100+ sources & BI tools, allows you to not only export & load data but also transform & enrich your data & make it analysis-ready in a jiff.

visit our website to explore hevo

Want to take Hevo for a spin? sign up and experience the feature-rich Hevo suite first hand. You can also have a look at our unbeatable pricing that will help you choose the right plan for your business needs!

Tell us about your experience of connecting Elasticsearch to BigQuery! Share your thoughts in the comments section below!

No-code Data Pipeline For Google BigQuery