Task queue backlog

Admonition

Product: Elasticsearch
Deployment type: Elastic Cloud Enterprise, Elastic Cloud Hosted, Elastic Cloud on Kubernetes, Elastic Self-Managed
Versions: All

A backlogged task queue can prevent tasks from completing and lead to an unhealthy cluster state. Contributing factors include resource constraints, a large number of tasks triggered at once, and long-running tasks.

Diagnose a backlogged task queue ¶

To identify the cause of the backlog, try these diagnostic actions.

Check the thread pool status
Inspect hot threads on each node
Identify long-running node tasks
Look for long-running cluster tasks

Check the thread pool status ¶

A depleted thread pool can result in rejected requests.

Use the cat thread pool API to monitor active threads, queued tasks, rejections, and completed tasks:

					GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed

Look for high active and queue metrics, which indicate potential bottlenecks and opportunities to reduce CPU usage.
Determine whether thread pool issues are specific to a data tier.
Check whether a specific node’s thread pool is depleting faster than others. This might indicate hot spotting.

Inspect hot threads on each node ¶

If a particular thread pool queue is backed up, periodically poll the nodes hot threads API to gauge the thread’s progression and ensure it has sufficient resources:

			GET /_nodes/hot_threads

Although the hot threads API response does not list the specific tasks running on a thread, it provides a summary of the thread’s activities. You can correlate a hot threads response with a task management API response to identify any overlap with specific tasks. For example, if the hot threads response indicates the thread is performing a search query, you can check for long-running search tasks using the task management API.

Identify long-running node tasks ¶

Long-running tasks can also cause a backlog. Use the task management API to check for excessive running_time_in_nanos values:

			GET /_tasks?pretty=true&human=true&detailed=true

You can filter on a specific action, such as bulk indexing or search-related tasks. These tend to be long-running.

Filter on bulk index actions:

			    GET /_tasks?human&detailed&actions=indices:data/write/bulk

Filter on search actions:

			    GET /_tasks?human&detailed&actions=indices:data/write/search

Long-running tasks might need to be canceled.

Look for long-running cluster tasks ¶

Use the cluster pending tasks API to identify delays in cluster state synchronization:

			GET /_cluster/pending_tasks

Tasks with a high timeInQueue value are likely contributing to the backlog and might need to be canceled.

Recommendations ¶

After identifying problematic threads and tasks, resolve the issue by increasing resources or canceling tasks.

Increase available resources ¶

If tasks are progressing slowly, try reducing CPU usage.

In some cases, you might need to increase the thread pool size. For example, the force_merge thread pool defaults to a single thread. Increasing the size to 2 might help reduce a backlog of force merge requests.

Cancel stuck tasks ¶

If an active task’s hot thread shows no progress, consider canceling the task.

Address hot spotting ¶

If a specific node’s thread pool is depleting faster than others, try addressing uneven node resource utilization, also known as hot spotting. For details on actions you can take, such as rebalancing shards, see Hot spotting.

Resources ¶

Related symptoms: