Elasticsearch for Apache Hadoop

Elasticsearch for Apache Hadoop is an umbrella project consisting of two similar, yet independent sub-projects: elasticsearch-hadoop and repository-hdfs. This documentation pertains to elasticsearch-hadoop. For information about repository-hdfs and using HDFS as a back-end repository for doing snapshot or restore from or to Elasticsearch, go to Hadoop HDFS repository plugin.

Elasticsearch for Apache Hadoop is an open-source, stand-alone, self-contained, small library that allows big data processing frameworks (specifically Apache Hadoop Map/Reduce and Apache Spark) to interact with Elasticsearch. One can think of it as a connector that allows data to flow bi-directionaly so that applications can leverage transparently the Elasticsearch engine capabilities to significantly enrich their capabilities and improve performance.

Elasticsearch for Apache Hadoop provides native integration for Map/Reduce, Spark, and Hive, making Elasticsearch accessible as if it were a native resource within your data processing cluster. As such, Elasticsearch for Apache Hadoop operates as a library that processing jobs import and use through its APIs to read from and write to Elasticsearch.

While the official name of the project is Elasticsearch for Apache Hadoop throughout the documentation the term elasticsearch-hadoop will be used instead to increase readability.

Admonition

This document assumes the reader already has a basic familiarity with Elasticsearch, and Hadoop and/or Spark concepts. For more information, refer to Elasticsearch for Apache Hadoop resources.