Fix index lifecycle management errors
Index lifecycle management (ILM) runs actions asynchronously on your cluster's indices, according to the conditions you define in your policy. ILM phases and actions run sequentially on each index, using the permissions of the user who last edited the ILM policy.
ILM can surface two types of issues:
- Direct errors: The Elasticsearch API call itself fails.
- Indirect configuration issues: The API call succeeds, but the intended result doesn't take effect.
This guide explains how to check overall ILM health, investigate individual indices, and resolve common errors.
This section covers the symptoms of stuck tasks and erring tasks, then shows common investigative API commands.
ILM purposely holds an index on a couple of steps for its logic-based and time-based conditions. The following ILM explain API phase/action/step combinations wait:
hot/rollover/check-rollover-readyuntil ILM rollover requirements are met.*/complete/completeuntil the index'sagequalifies for the next phase'smin_age. Refer to howmin_ageis calculated for more information.
This page refers to steps other than these as transient steps, where ILM asynchronously applies an operation against the index instead of waiting for a logic-based or time-based condition.
ILM polls for work on an interval basis (default 10m). For more information, refer to ILM phase transitions.
An index moves through its ILM steps as fast as the underlying operation finishes, plus the wait for the next poll. Transient steps that depend on an asynchronous operation can therefore be affected by task backlogs. Common examples:
*/migrate/check-migrationmonitors the index's shards' allocation and recoveries.*/*/forcemergewaits for the index's force merge, noting the guide's performance considerations.delete/wait_for_snapshot/wait-for-snapshotdelays until the ILM Delete's Snapshot lifecycle management (SLM) policy is successfully completed for the index.
It's fine if these transient steps appear in the ILM explain API output. But if an index doesn't progress past a step for an extended period, investigate. The cause is often specific to your setup or use case, rather than a cluster problem.
When errors occur, the ILM explain API response includes the following:
failed_step, set to the active step namestep, set toERRORis_auto_retryable_errorflag, setfailed_step_retry_count, incremented
All erring indices automatically run the Retry policy API on each ILM polling interval. During automatic or manual retry:
- The
stepresets to the active step description andfailed_stepis removed. - The
is_auto_retryable_errorpersists. - The
failed_step_retry_countpersists and increments again if another error is encountered.
Non-erring indices do not report the fields failed_step, is_auto_retryable_error, nor failed_step_retry_count. Indices that have recovered from previous errors also remove these temporary fields. This is why the ILM explain API supports the only_errors flag, which returns only indices that are currently failing or are retrying a step:
GET _all/_ilm/explain?human=true&expand_wildcards=all&only_errors=true
For troubleshooting, ILM explain API emits step_info. This field is returned only when further context is available, such as message for information and reason for errors.
If ILM cannot automatically resolve the error for this index, execution is halted until the underlying issue with the policy, index, or cluster is resolved. For example, shard migrations might block until Elastic Cloud Autoscaling scales or adds necessary data tiers.
Use the following APIs to check ILM health across all indices.
Elasticsearch's Cluster health API reports stagnating_indices for indices that have been attempting a step longer than expected:
GET _health_report/ilm
This report's thresholds are controlled by Read cluster settings API:
health.ilm.max_retries_per_step(default100)health.ilm.max_time_on_action(default1d)health.ilm.max_time_on_step(default1d)
This report consolidates actionable interventions to consider for your ILM and cluster health.
For a high-level summary of all index statuses (not just those needing intervention), use the ILM explain API. To save an overview of all phase/action/step index statuses to ilm_explain.json, processed with jq:
$ cat ilm_explain.json | jq -c '.indices[]|select(.managed==true)|.phase+"/"+.action+"/"+.step' | sort | uniq -c | sort -r
For example ILM troubleshooting walkthroughs, refer to
- Monitoring ILM Elasticsearch Health for resolving erring steps.
- ILM History Index for an explanation of step sequences and how to review historical index statuses.
The following example demonstrates troubleshooting ILM for a newly created index. Consider a shrink-index policy that shrinks an index to four shards once it is at least five days old:
PUT _ilm/policy/shrink-index
{
"policy": {
"phases": {
"warm": {
"min_age": "5d",
"actions": {
"shrink": {
"number_of_shards": 4
}
}
}
}
}
}
To create an index my-index-000001 that has only two primary shards and apply the ILM policy shrink-index:
PUT /my-index-000001
{
"settings": {
"index.number_of_shards": 2,
"index.lifecycle.name": "shrink-index"
}
}
After five days, ILM attempts to run the shrink index API against index my-index-000001 from two shards to four shards. Because the shrink action cannot increase the number of shards, this operation fails and ILM moves my-index-000001 to the step of ERROR.
Use the ILM explain API to get information about what went wrong:
GET /my-index-000001/_ilm/explain
Which returns the following information:
{
"indices" : {
"my-index-000001" : {
"index" : "my-index-000001",
"managed" : true,
"index_creation_date_millis" : 1541717265865,
"time_since_index_creation": "5.1d",
"policy" : "shrink-index",
"lifecycle_date_millis" : 1541717265865,
"age": "5.1d",
"phase" : "warm",
"phase_time_millis" : 1541717272601,
"action" : "shrink",
"action_time_millis" : 1541717272601,
"step" : "ERROR",
"step_time_millis" : 1541717272688,
"failed_step" : "shrink",
"step_info" : {
"type" : "illegal_argument_exception",
"reason" : "the number of target shards [4] must be less that the number of source shards [2]"
},
"phase_execution" : {
"policy" : "shrink-index",
"phase_definition" : {
"min_age" : "5d",
"actions" : {
"shrink" : {
"number_of_shards" : 4
}
}
},
"version" : 1,
"modified_date_in_millis" : 1541717264230
}
}
}
}
- The policy being used to manage the index:
shrink-index - The index age: 5.1 days
- The phase the index is currently in:
warm - The current action:
shrink - The step the index is currently in:
ERROR - The step that failed to run:
shrink - The type of error and a description of that error.
- The definition of the current phase from the
shrink-indexpolicy
To resolve this, update the policy to shrink the index to a single shard after 5 days:
PUT _ilm/policy/shrink-index
{
"policy": {
"phases": {
"warm": {
"min_age": "5d",
"actions": {
"shrink": {
"number_of_shards": 1
}
}
}
}
}
}
After resolving the underlying problem, wait for ILM's poll interval to automatically retry the index's ERROR step, or apply the retry policy API to run it on demand:
POST /my-index-000001/_ilm/retry
ILM subsequently attempts to re-run the step that failed. You can use the ILM Explain API to monitor the progress.
The following behaviors come up often when troubleshooting ILM. For more details, refer to the ILM guide or contact us.
When setting up an ILM policy or automating rollover with ILM, be aware that min_age can be relative to either the rollover time or the index creation time.
If you use ILM rollover, min_age is calculated relative to the time the index was rolled over. This is because the rollover API generates a new index and updates the age of the previous index to reflect the rollover time. If the index hasn’t been rolled over, then the age is the same as the creation_date for the index.
You can override how min_age is calculated using the index.lifecycle.origination_date and index.lifecycle.parse_origination_date ILM settings.
ILM does not skip steps due to logic-based or time-based conditions. It proceeds through all steps in the enabled action's order. For example, this means it's possible for an index stagnated at phase/action/step of warm/migrate/check-migration to surpass its expected deletion time. Make sure to review and resolve ILM errors to maintain a healthy cluster. For more information, refer to ILM phase transitions.
When an index enters a phase, it caches the ILM policy's current definition. For more information, refer to phase execution. This enables ILM to protect the index from policy changes which might cause data corruption.
As described in how changes are applied, ILM applies safe updates to an index's phase_execution immediately. Updates that aren't safe to apply retroactively are forward-applied, taking effect only as indices enter the phase after the update.
You might need to apply a policy change to indices that are already stagnant. It's not possible to run a single ILM step on demand, because doing so might corrupt the index. Instead, apply the relevant changes to those indices manually.
In rare cases, a policy change can leave indices stagnant. The only fix is the move to an ILM step API. This is an advanced API -- contact us with an Elasticsearch diagnostic before using it.
Each entry below shows the message you'll see in the ERROR step, the cause, and the recommended fix. Errors are grouped by the ILM action where they typically occur.
Problems with rollover aliases are a common cause of errors. You should consider using data streams instead of managing rollover with aliases.
These errors can occur when the ILM rollover action runs:
Rollover alias [x] can point to multiple indices, found duplicated alias [x] in index template [z]
The target rollover alias is specified in an index template’s index.lifecycle.rollover_alias setting. You need to explicitly configure this alias one time when you bootstrap the initial index. The rollover action then manages setting and updating the alias to roll over to each subsequent index.
Do not explicitly configure this same alias in the aliases section of an index template.
For an example, refer to this resolving duplicate alias video.
index.lifecycle.rollover_alias [x] does not point to index [y]
Either the index is using the wrong alias or the alias does not exist.
Check the index.lifecycle.rollover_alias index setting. To see what aliases are configured, use _cat/aliases.
For an example, refer to this resolving alias not pointing to index video.
setting [index.lifecycle.rollover_alias] for index [y] is empty or not defined
The index.lifecycle.rollover_alias setting must be configured for the rollover action to work.
Update the index settings to set index.lifecycle.rollover_alias.
For an example, refer to this resolving empty or not defined video.
alias [x] has more than one write index [y,z]
Only one index can be designated as the write index for a particular alias.
Use the aliases API to set is_write_index:false for all but one index.
For an example, refer to this resolving more than one write index video.
index name [x] does not match pattern ^.*-\d+
The index name must match the regex pattern ^.*-\d+ for the rollover action to work. The most common problem is that the index name does not contain trailing digits. For example, my-index does not match the pattern requirement.
Append a numeric value to the index name, for example my-index-000001.
For an example, refer to this resolving does not match pattern video.
The following errors usually surface during shard recovery, which can occur when you use ILM migrate operations or ILM searchable snapshots. Because these operations run asynchronously, the error reported by ILM often shows only a symptom of the real problem. To troubleshoot the underlying cause, refer to cluster allocation API examples.
index has a preference for tiers [xxx] and node does not meet the required [xxx] tier
If the allocation explain API returns this error, it indicates that shards cannot be assigned according to the current attribute-based or data tier allocation rules. For detailed guidance on resolving this issue, refer to Unable to assign shards based on the allocation rule.
The following errors can surface on any ILM step.
CircuitBreakingException: [x] data too large, data for [y]
This indicates that the cluster is hitting resource limits.
Before continuing to set up ILM, you’ll need to take steps to alleviate the resource issues. For more information, see Circuit breaker errors.
high disk watermark [x] exceeded on [y]
This indicates that the cluster is running out of disk space. This can happen when you don’t have index lifecycle management set up to roll over from hot to warm nodes. For more information, see Watermark errors.
security_exception: action [<action-name>] is unauthorized for user [<user-name>] with roles [<role-name>], this action is granted by the index privileges [manage_follow_index,manage,all]
ILM runs each action as the user who last modified the policy, with the privileges they held at that time. This error means the action requires privileges that user doesn't have.
To fix it, make sure the account that creates or modifies the policy has the necessary permission for every operation it includes. If this error surfaces on system indices, refer to File-based access recovery.
policy [<policy-name>] does not exist
The error occurs because the index is assigned to an ILM policy that does not exist in the cluster. To fix this, you can either create the missing policy with the required settings or link the index to an existing ILM policy.