Task queue backlog

A backlogged task queue can prevent tasks from completing and lead to an unhealthy cluster state. Contributing factors include resource constraints, a large number of tasks triggered at once, and long-running tasks.

Diagnose a backlogged task queue

To identify the cause of the backlog, try these diagnostic actions.

Check the thread pool status
Inspect hot threads on each node
Identify long-running node tasks
Look for long-running cluster tasks

Check the thread pool status

A depleted thread pool can result in rejected requests.

Use the cat thread pool API to monitor active threads, queued tasks, rejections, and completed tasks:

		 GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed 
	

Look for high active and queue metrics, which indicate potential bottlenecks and opportunities to reduce CPU usage.
Determine whether thread pool issues are specific to a data tier.
Check whether a specific node’s thread pool is depleting faster than others. This might indicate hot spotting.

Inspect hot threads on each node

If a particular thread pool queue is backed up, periodically poll the nodes hot threads API to gauge the thread’s progression and ensure it has sufficient resources:

 GET /_nodes/hot_threads

Although the hot threads API response does not list the specific tasks running on a thread, it provides a summary of the thread’s activities. You can correlate a hot threads response with a task management API response to identify any overlap with specific tasks. For example, if the hot threads response indicates the thread is performing a search query, you can check for long-running search tasks using the task management API.

Identify long-running node tasks

Long-running tasks can also cause a backlog. Use the task management API to check for excessive running_time_in_nanos values:

 GET /_tasks?pretty=true&human=true&detailed=true

You can filter on a specific action, such as bulk indexing or search-related tasks. These tend to be long-running.

Filter on bulk index actions:

 GET /_tasks?human&detailed&actions=indices:data/write/bulk

Filter on search actions:

 GET /_tasks?human&detailed&actions=indices:data/write/search

Long-running tasks might need to be canceled.

Look for long-running cluster tasks

Use the cluster pending tasks API to identify delays in cluster state synchronization:

 GET /_cluster/pending_tasks

Tasks with a high timeInQueue value are likely contributing to the backlog and might need to be canceled.

Task queue backlog

Diagnose a backlogged task queue

Check the thread pool status

Inspect hot threads on each node

Identify long-running node tasks

Look for long-running cluster tasks

Recommendations

Increase available resources

Cancel stuck tasks

Address hot spotting

Resources