Loading

Fix index lifecycle management errors

Index lifecycle management (ILM) runs actions asynchronously on your cluster's indices, according to the conditions you define in your policy. ILM phases and actions run sequentially on each index, using the permissions of the user who last edited the ILM policy.

ILM can surface two types of issues:

  • Direct errors: The Elasticsearch API call itself fails.
  • Indirect configuration issues: The API call succeeds, but the intended result doesn't take effect.

This guide explains how to check overall ILM health, investigate individual indices, and resolve common errors.

This section covers the symptoms of stuck tasks and erring tasks, then shows common investigative API commands.

ILM purposely holds an index on a couple of steps for its logic-based and time-based conditions. The following ILM explain API phase/action/step combinations wait:

This page refers to steps other than these as transient steps, where ILM asynchronously applies an operation against the index instead of waiting for a logic-based or time-based condition.

ILM polls for work on an interval basis (default 10m). For more information, refer to ILM phase transitions.

An index moves through its ILM steps as fast as the underlying operation finishes, plus the wait for the next poll. Transient steps that depend on an asynchronous operation can therefore be affected by task backlogs. Common examples:

It's fine if these transient steps appear in the ILM explain API output. But if an index doesn't progress past a step for an extended period, investigate. The cause is often specific to your setup or use case, rather than a cluster problem.

When errors occur, the ILM explain API response includes the following:

  • failed_step, set to the active step name
  • step, set to ERROR
  • is_auto_retryable_error flag, set
  • failed_step_retry_count, incremented

All erring indices automatically run the Retry policy API on each ILM polling interval. During automatic or manual retry:

  • The step resets to the active step description and failed_step is removed.
  • The is_auto_retryable_error persists.
  • The failed_step_retry_count persists and increments again if another error is encountered.

Non-erring indices do not report the fields failed_step, is_auto_retryable_error, nor failed_step_retry_count. Indices that have recovered from previous errors also remove these temporary fields. This is why the ILM explain API supports the only_errors flag, which returns only indices that are currently failing or are retrying a step:

				GET _all/_ilm/explain?human=true&expand_wildcards=all&only_errors=true
		

For troubleshooting, ILM explain API emits step_info. This field is returned only when further context is available, such as message for information and reason for errors.

If ILM cannot automatically resolve the error for this index, execution is halted until the underlying issue with the policy, index, or cluster is resolved. For example, shard migrations might block until Elastic Cloud Autoscaling scales or adds necessary data tiers.

Use the following APIs to check ILM health across all indices.

Elasticsearch's Cluster health API reports stagnating_indices for indices that have been attempting a step longer than expected:

				GET _health_report/ilm
		

This report's thresholds are controlled by Read cluster settings API:

  • health.ilm.max_retries_per_step (default 100)
  • health.ilm.max_time_on_action (default 1d)
  • health.ilm.max_time_on_step (default 1d)

This report consolidates actionable interventions to consider for your ILM and cluster health.

For a high-level summary of all index statuses (not just those needing intervention), use the ILM explain API. To save an overview of all phase/action/step index statuses to ilm_explain.json, processed with jq:

$ cat ilm_explain.json | jq -c '.indices[]|select(.managed==true)|.phase+"/"+.action+"/"+.step' | sort | uniq -c | sort -r
		
Tip

For example ILM troubleshooting walkthroughs, refer to

The following example demonstrates troubleshooting ILM for a newly created index. Consider a shrink-index policy that shrinks an index to four shards once it is at least five days old:

				PUT _ilm/policy/shrink-index
					{
  "policy": {
    "phases": {
      "warm": {
        "min_age": "5d",
        "actions": {
          "shrink": {
            "number_of_shards": 4
          }
        }
      }
    }
  }
}
		

To create an index my-index-000001 that has only two primary shards and apply the ILM policy shrink-index:

				PUT /my-index-000001
					{
  "settings": {
    "index.number_of_shards": 2,
    "index.lifecycle.name": "shrink-index"
  }
}
		

After five days, ILM attempts to run the shrink index API against index my-index-000001 from two shards to four shards. Because the shrink action cannot increase the number of shards, this operation fails and ILM moves my-index-000001 to the step of ERROR.

Use the ILM explain API to get information about what went wrong:

				GET /my-index-000001/_ilm/explain
		

Which returns the following information:

{
  "indices" : {
    "my-index-000001" : {
      "index" : "my-index-000001",
      "managed" : true,
      "index_creation_date_millis" : 1541717265865,
      "time_since_index_creation": "5.1d",
      "policy" : "shrink-index",
      "lifecycle_date_millis" : 1541717265865,
      "age": "5.1d",
      "phase" : "warm",
      "phase_time_millis" : 1541717272601,
      "action" : "shrink",
      "action_time_millis" : 1541717272601,
      "step" : "ERROR",
      "step_time_millis" : 1541717272688,
      "failed_step" : "shrink",
      "step_info" : {
        "type" : "illegal_argument_exception",
        "reason" : "the number of target shards [4] must be less that the number of source shards [2]"
      },
      "phase_execution" : {
        "policy" : "shrink-index",
        "phase_definition" : {
          "min_age" : "5d",
          "actions" : {
            "shrink" : {
              "number_of_shards" : 4
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1541717264230
      }
    }
  }
}
		
  1. The policy being used to manage the index: shrink-index
  2. The index age: 5.1 days
  3. The phase the index is currently in: warm
  4. The current action: shrink
  5. The step the index is currently in: ERROR
  6. The step that failed to run: shrink
  7. The type of error and a description of that error.
  8. The definition of the current phase from the shrink-index policy

To resolve this, update the policy to shrink the index to a single shard after 5 days:

				PUT _ilm/policy/shrink-index
					{
  "policy": {
    "phases": {
      "warm": {
        "min_age": "5d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          }
        }
      }
    }
  }
}
		

After resolving the underlying problem, wait for ILM's poll interval to automatically retry the index's ERROR step, or apply the retry policy API to run it on demand:

				POST /my-index-000001/_ilm/retry
		

ILM subsequently attempts to re-run the step that failed. You can use the ILM Explain API to monitor the progress.

The following behaviors come up often when troubleshooting ILM. For more details, refer to the ILM guide or contact us.

When setting up an ILM policy or automating rollover with ILM, be aware that min_age can be relative to either the rollover time or the index creation time.

If you use ILM rollover, min_age is calculated relative to the time the index was rolled over. This is because the rollover API generates a new index and updates the age of the previous index to reflect the rollover time. If the index hasn’t been rolled over, then the age is the same as the creation_date for the index.

You can override how min_age is calculated using the index.lifecycle.origination_date and index.lifecycle.parse_origination_date ILM settings.

ILM does not skip steps due to logic-based or time-based conditions. It proceeds through all steps in the enabled action's order. For example, this means it's possible for an index stagnated at phase/action/step of warm/migrate/check-migration to surpass its expected deletion time. Make sure to review and resolve ILM errors to maintain a healthy cluster. For more information, refer to ILM phase transitions.

When an index enters a phase, it caches the ILM policy's current definition. For more information, refer to phase execution. This enables ILM to protect the index from policy changes which might cause data corruption.

As described in how changes are applied, ILM applies safe updates to an index's phase_execution immediately. Updates that aren't safe to apply retroactively are forward-applied, taking effect only as indices enter the phase after the update.

You might need to apply a policy change to indices that are already stagnant. It's not possible to run a single ILM step on demand, because doing so might corrupt the index. Instead, apply the relevant changes to those indices manually.

In rare cases, a policy change can leave indices stagnant. The only fix is the move to an ILM step API. This is an advanced API -- contact us with an Elasticsearch diagnostic before using it.

Each entry below shows the message you'll see in the ERROR step, the cause, and the recommended fix. Errors are grouped by the ILM action where they typically occur.

Tip

Problems with rollover aliases are a common cause of errors. You should consider using data streams instead of managing rollover with aliases.

These errors can occur when the ILM rollover action runs:

The following errors usually surface during shard recovery, which can occur when you use ILM migrate operations or ILM searchable snapshots. Because these operations run asynchronously, the error reported by ILM often shows only a symptom of the real problem. To troubleshoot the underlying cause, refer to cluster allocation API examples.

The following errors can surface on any ILM step.