Loading

Task queue backlog

A backlogged task queue can prevent tasks from completing and lead to an unhealthy cluster state. Contributing factors include resource constraints, a large number of tasks triggered at once, and long-running tasks.

To identify the cause of the backlog, try these diagnostic actions.

A depleted thread pool can result in rejected requests.

Use the cat thread pool API to monitor active threads, queued tasks, rejections, and completed tasks:

 GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed 
  • Look for high active and queue metrics, which indicate potential bottlenecks and opportunities to reduce CPU usage.
  • Determine whether thread pool issues are specific to a data tier.
  • Check whether a specific node’s thread pool is depleting faster than others. This might indicate hot spotting.

If a particular thread pool queue is backed up, periodically poll the nodes hot threads API to gauge the thread’s progression and ensure it has sufficient resources:

 GET /_nodes/hot_threads 

Although the hot threads API response does not list the specific tasks running on a thread, it provides a summary of the thread’s activities. You can correlate a hot threads response with a task management API response to identify any overlap with specific tasks. For example, if the hot threads response indicates the thread is performing a search query, you can check for long-running search tasks using the task management API.

Long-running tasks can also cause a backlog. Use the task management API to check for excessive running_time_in_nanos values:

 GET /_tasks?pretty=true&human=true&detailed=true 

You can filter on a specific action, such as bulk indexing or search-related tasks. These tend to be long-running.

  • Filter on bulk index actions:

     GET /_tasks?human&detailed&actions=indices:data/write/bulk 
  • Filter on search actions:

     GET /_tasks?human&detailed&actions=indices:data/write/search 

Long-running tasks might need to be canceled.

Use the cluster pending tasks API to identify delays in cluster state synchronization:

 GET /_cluster/pending_tasks 

Tasks with a high timeInQueue value are likely contributing to the backlog and might need to be canceled.

After identifying problematic threads and tasks, resolve the issue by increasing resources or canceling tasks.

If tasks are progressing slowly, try reducing CPU usage.

In some cases, you might need to increase the thread pool size. For example, the force_merge thread pool defaults to a single thread. Increasing the size to 2 might help reduce a backlog of force merge requests.

If an active task’s hot thread shows no progress, consider canceling the task.

If a specific node’s thread pool is depleting faster than others, try addressing uneven node resource utilization, also known as hot spotting. For details on actions you can take, such as rebalancing shards, see Hot spotting.

Related symptoms: