Limitations

The following limitations and known problems apply to the 9.0.0-beta1 release of the Elastic data frame analytics feature. The limitations are grouped into the following categories:

Platform limitations are related to the platform that hosts the machine learning feature of the Elastic Stack.
Configuration limitations apply to the configuration process of the '{dataframe} analytics jobs'.
Operational limitations affect the behavior of the '{dataframe} analytics jobs' that are running.

Platform limitations ¶

CPU scheduling improvements apply to Linux and MacOS only ¶

When there are many machine learning jobs running at the same time and there are insufficient CPU resources, the JVM performance must be prioritized so search and indexing latency remain acceptable. To that end, when CPU is constrained on Linux and MacOS environments, the CPU scheduling priority of native analysis processes is reduced to favor the Elasticsearch JVM. This improvement does not apply to Windows environments.

Configuration limitations ¶

Cross-cluster search is not supported ¶

Cross-cluster search is not supported for data frame analytics.

Nested fields are not supported ¶

Nested fields are not supported for '{dataframe} analytics jobs'. These fields are ignored during the analysis. If a nested field is selected as the dependent variable for classification or regression analysis, an error occurs.

'{dataframe-cap} analytics jobs' cannot be updated ¶

You cannot update data frame analytics configurations. Instead, delete the '{dataframe} analytics job' and create a new one.

Data frame analytics memory limitation ¶

Data frame analytics can only perform analyses that fit into the memory available for machine learning. Overspill to disk is not currently possible. For general machine learning settings, see Machine learning settings in Elasticsearch.

When you create a '{dataframe} analytics job' and the inference step of the process fails due to the model is too large to fit into JVM, follow the steps in this GitHub issue for a workaround.

'{dataframe-cap} analytics jobs' cannot use more than 232 documents for training ¶

A '{dataframe} analytics job' that would use more than 232 documents for training cannot be started. The limitation applies only for documents participating in training the model. If your source index contains more than 232 documents, set the training_percent to a value that represents less than 232 documents.

Trained models created in 7.8 are not backwards compatible ¶

Trained models created in version 7.8.0 are not backwards compatible with older node versions. In a mixed cluster environment, all nodes must be at least 7.8.0 to use a model created on a 7.8.0 node.

Operational limitations ¶

Deleting a '{dataframe} analytics job' does not delete the destination index ¶

The delete '{dataframe} analytics job' API does not delete the destination index that contains the annotated data of the data frame analytics. That index must be deleted separately.

'{dataframe-cap} analytics jobs' runtime may vary ¶

The runtime of '{dataframe} analytics jobs' depends on numerous factors, such as the number of data points in the data set, the type of analytics, the number of fields that are included in the analysis, the supplied hyperparameters, the type of analyzed fields, and so on. For this reason, a general runtime value that applies to all or most of the situations does not exist. The runtime of a '{dataframe} analytics job' may take from a couple of minutes up to many hours in extreme cases.

The runtime increases with an increasing number of analyzed fields in a nearly linear fashion. For data sets of more than 100,000 points, start with a low training percent. Run a few '{dataframe} analytics jobs' to see how the runtime scales with the increased number of data points and how the quality of results scales with an increased training percentage.

'{dataframe-cap} analytics jobs' may restart after an Elasticsearch upgrade ¶

A '{dataframe} analytics job' may be restarted from the beginning in the following cases:

the job is in progress during an Elasticsearch update,
the job resumes on a node with a higher version,
the results format has changed requiring different mappings in the destination index.

If any of these conditions applies, the destination index of the '{dataframe} analytics job' is deleted and the job starts again from the beginning – regardless of the phase where the job was in.

Documents with values of multi-element arrays in analyzed fields are skipped ¶

If the value of an analyzed field (field that is subect of the data frame analytics) in a document is an array with more than one element, the document that contains this field is skipped during the analysis.

Outlier detection field types ¶

Outlier detection requires numeric or boolean data to analyze. The algorithms don’t support missing values, therefore fields that have data types other than numeric or boolean are ignored. Documents where included fields contain missing values, null values, or an array are also ignored. Therefore a destination index may contain documents that don’t have an outlier score. These documents are still reindexed from the source index to the destination index, but they are not included in the outlier detection analysis and therefore no outlier score is computed.

Regression field types ¶

Regression supports fields that are numeric, boolean, text, keyword and ip. It is also tolerant of missing values. Fields that are supported are included in the analysis, other fields are ignored. Documents where included fields contain an array are also ignored. Documents in the destination index that don’t contain a results field are not included in the regression analysis.

Classification field types ¶

Classification supports fields that have numeric, boolean, text, keyword, or ip data types. It is also tolerant of missing values. Fields that are supported are included in the analysis, other fields are ignored. Documents where included fields contain an array are also ignored. Documents in the destination index that don’t contain a results field are not included in the classification analysis.

Imbalanced class sizes affect classification performance ¶

If your training data is very imbalanced, classification analysis may not provide good predictions. Try to avoid highly imbalanced situations. We recommend having at least 50 examples of each class and a ratio of no more than 10 to 1 for the majority to minority class labels in the training data. If your training data set is very imbalanced, consider downsampling the majority class, upsampling the minority class, or gathering more data.

Deeply nested objects affect inference performance ¶

If the data that you run inference against contains documents that have a series of combinations of dot delimited and nested fields (for example: {"a.b": "c", "a": {"b": "c"},...}), the performance of the operation might be slightly slower. Consider using as simple mapping as possible for the best performance profile.

Analytics runtime performance may significantly slow down with feature importance computation ¶

For complex models (such as those with many deep trees), the calculation of feature importance takes significantly more time. If a reduction in runtime is important to you, try strategies such as disabling feature importance, reducing the amount of training data (for example by decreasing the training percentage), setting hyperparameter values, or only selecting fields that are relevant for analysis.