Loading

Task queue backlog

A backlogged task queue can lead to rejected requests or an unhealthy cluster state. Contributing factors can include uneven or resource constrained hardware, a large number of tasks triggered at the same time, expensive tasks that are using high CPU or are inducing high JVM, and long-running tasks.

To identify the cause of the backlog, try these diagnostic actions.

Use the cat thread pool API to monitor active threads, queued tasks, rejections, and completed tasks:

				GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,pool_size,active,queue_size,queue,rejected,completed
		

By way of explanation on these thread pool metrics:

  • the active and queue statistics are point-in-time
  • the rejected and completed statistics are cumulative from node start-up
  • the thread pool will fill active until it reaches the pool_size at which point it will fill queue until it reaches the queue_size after which it will rejected requests

There are a number of things that you can check as potential causes for the queue backlog:

  • Look for continually high queue metrics, which indicate long-running tasks or CPU-expensive tasks.
  • Look for bursts of elevated queue metrics, which indicate opportunities to spread traffic volume.
  • Determine whether thread pool issues are specific to a node role.
  • Check whether a specific node is depleting faster than others within a data tier. This might indicate hot spotting.

If a particular thread pool queue is backed up, periodically poll the CPU-related API's to gauge task progression vs resource constraints:

If cpu is consistently elevated or a hot thread's stack trace does not rotate over an extended period, investigate high CPU usage.

Although the hot threads API response does not list the specific tasks running on a thread, it provides a summary of the thread’s activities. You can correlate a hot threads response with a task management API response to identify any overlap with specific tasks. For example, if hot threads suggest the node is spending time in search, filter the Task Management API for search tasks.

Long-running tasks can also cause a backlog. Use the task management API to check for excessive running_time_in_nanos values:

				GET /_tasks?pretty=true&human=true&detailed=true
		

You can filter on a specific action, such as bulk indexing or search-related tasks. If investigating particular nodes, this API can be filtered to specific nodes.

  • Filter on bulk index actions:

    				GET /_tasks?human&detailed&actions=indices:*write*
    				GET /_tasks?human&detailed&actions=indices:*write*&nodes=<YOUR_NODE_ID_OR_NAME_HERE>
    		
  • Filter on search actions:

    				GET /_tasks?human&detailed&actions=indices:*search*
    				GET /_tasks?human&detailed&actions=indices:*search*&nodes=<YOUR_NODE_ID_OR_NAME_HERE>
    		

Long-running tasks might need to be canceled.

Refer to this video for a walkthrough of how to troubleshoot the task management API output.

You can also check the Tune for search speed and Tune for indexing speed pages for more information.

Use the cat pending tasks API to identify delays in cluster state synchronization:

				GET /_cat/pending_tasks?v=true
		

Cluster state synchronization can be expected to fall behind when a cluster is unstable, but otherwise this state usually indicates an unworkable cluster setting override or traffic pattern.

There are a few common source issues to check for:

If you're not present during an incident to investigate backlogged tasks, you might consider enabling slow logs to review later.

For example, you can review slow search logs later using the search profiler, so that time consuming requests can be optimized.

Per before, when task backlogs occur it is frequently due to

Many of these can be investigated in isolation as unintended traffic pattern or configuration changes. Refer to the following recommendations to address repeat or long standing symptoms.

If an individual task is causing a thread pool queue due to high CPU usage, try cancelling the task and then optimizing it before retrying.

This problem can surface due to a number of possible causes:

  • Creating new tasks or modifying scheduled tasks which either run frequently or are broad in their effect, such as index lifecycle management policies or rules
  • Performing traffic load testing
  • Doing extended look-backs, especially across data tiers
  • Searching or performing bulk updates to a high number of indices in a single request

If an active task’s hot thread shows no progress, consider canceling the task if it's flagged as cancellable.

If you consistently encounter cancellable tasks running longer than expected, you might consider reviewing:

For example, you can use the task management API to identify and cancel searches that consume excessive CPU time.

				GET _tasks?actions=*search&detailed
		

The response description contains the search request and its queries. The running_time_in_nanos parameter shows how long the search has been running.

{
  "nodes" : {
    "oTUltX4IQMOUUVeiohTt8A" : {
      "name" : "my-node",
      "transport_address" : "127.0.0.1:9300",
      "host" : "127.0.0.1",
      "ip" : "127.0.0.1:9300",
      "tasks" : {
        "oTUltX4IQMOUUVeiohTt8A:464" : {
          "node" : "oTUltX4IQMOUUVeiohTt8A",
          "id" : 464,
          "type" : "transport",
          "action" : "indices:data/read/search",
          "description" : "indices[my-index], search_type[QUERY_THEN_FETCH], source[{\"query\":...}]",
          "start_time_in_millis" : 4081771730000,
          "running_time_in_nanos" : 13991383,
          "cancellable" : true
        }
      }
    }
  }
}
		

To cancel this example search to free up resources, you would run:

				POST _tasks/oTUltX4IQMOUUVeiohTt8A:464/_cancel
		

For additional tips on how to track and avoid resource-intensive searches, see Avoid expensive searches.

If a specific node’s thread pool is depleting faster than its data tier peers, try addressing uneven node resource utilization, also known as "hot spotting". For details about reparative actions you can take, such as rebalancing shards, refer to the Hot spotting troubleshooting documentation.

By default, Elasticsearch allocates processors equal to the number reported available by the operating system. You can override this behaviour by adjusting the value of node.processors, but this advanced setting should be configured only after you've performed load testing.

In some cases, you might need to increase the problematic thread pool size. For example, it might help to increase a stuck force_merge thread pool. If the setting is automatically calculated to 1 based on available CPU processors, then increasing the value to 2 is indicated in elasticsearch.yml, for example:

thread_pool.force_merge.size: 2