Elasticsearch for Apache Hadoop
Elasticsearch for Apache Hadoop is an umbrella project consisting of two similar, yet independent sub-projects: elasticsearch-hadoop
and repository-hdfs
.
This documentation pertains to elasticsearch-hadoop
. For information about repository-hdfs
and using HDFS as a back-end repository for doing snapshot or restore from or to Elasticsearch, go to Hadoop HDFS repository plugin.
Elasticsearch for Apache Hadoop is an open-source, stand-alone, self-contained, small library that allows Hadoop jobs (whether using Map/Reduce or libraries built upon it such as Hive or new upcoming libraries like Apache Spark ) to interact with Elasticsearch. One can think of it as a connector that allows data to flow bi-directionaly so that applications can leverage transparently the Elasticsearch engine capabilities to significantly enrich their capabilities and increase the performance.
Elasticsearch for Apache Hadoop offers first-class support for vanilla Map/Reduce and Hive so that using Elasticsearch is literally like using resources within the Hadoop cluster. As such, Elasticsearch for Apache Hadoop is a passive component, allowing Hadoop jobs to use it as a library and interact with Elasticsearch through Elasticsearch for Apache Hadoop APIs.
While the official name of the project is Elasticsearch for Apache Hadoop throughout the documentation the term elasticsearch-hadoop will be used instead to increase readability.
This document assumes the reader already has a basic familiarity with Elasticsearch and Hadoop concepts. For more information, refer to Elasticsearch for Apache Hadoop resources.