Runtime options
When using elasticsearch-hadoop, it is important to be aware of the following Hadoop configurations that can influence the way Map/Reduce tasks are executed and in return elasticsearch-hadoop.
Unfortunately, these settings need to be setup manually before the job / script configuration. Since elasticsearch-hadoop is called too late in the life-cycle, after the tasks have been already dispatched and as such, cannot influence the execution anymore.
[TBC: FANCY QUOTE]
In other words, speculative execution is an optimization, enabled by default, that allows Hadoop to create duplicates tasks of those which it considers hanged or slowed down. When doing data crunching or reading resources, having duplicate tasks is harmless and means at most a waste of computation resources; however when writing data to an external store, this can cause data corruption through duplicates or unnecessary updates. Since the speculative execution behavior can be triggered by external factors (such as network or CPU load which in turn cause false positive) even in stable environments (virtualized clusters are particularly prone to this) and has a direct impact on data, elasticsearch-hadoop disables this optimization for data safety.
Please check your library setting and disable this feature. If you encounter more data then expected, double and triple check this setting.
Speculative execution can be disabled for the map and reduce phase - we recommend disabling in both cases - by setting to false
the following two properties:
mapred.map.tasks.speculative.execution
mapred.reduce.tasks.speculative.execution
One can either set the properties by name manually on the Configuration
/JobConf
client:
jobConf.setSpeculativeExecution(false);
// or
configuration.setBoolean("mapred.map.tasks.speculative.execution", false);
configuration.setBoolean("mapred.reduce.tasks.speculative.execution", false);
or by passing them as arguments to the command line:
$ bin/hadoop jar -Dmapred.map.tasks.speculative.execution=false \
-Dmapred.reduce.tasks.speculative.execution=false <jar>
Apache Hive has its own setting for speculative execution through namely hive.mapred.reduce.tasks.speculative.execution
. It is enabled by default so do change it to false
in your scripts:
set hive.mapred.reduce.tasks.speculative.execution=false;
Note that while the setting has been deprecated in Hive 0.10 and one might get a warning, double check that the speculative execution is actually disabled.
Out of the box, Spark has speculative execution disabled. Double check this is the case through the spark.speculation
setting (false
to disable it, true
to enable it).