Kibana task manager health monitoring

Warning

This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.

The Kibana Task Manager has an internal monitoring mechanism to keep track of a variety of metrics, which can be consumed with either the health monitoring API or the Kibana server log.

The health monitoring API provides a reliable endpoint that can be monitored. Consuming this endpoint doesn’t cause additional load, but rather returns the latest health checks made by the system. This design enables consumption by external monitoring services at a regular cadence without additional load to the system.

Each Kibana instance exposes its own endpoint at:

$ curl -X GET api/task_manager/_health

Monitoring the _health endpoint of each Kibana instance in the cluster is the recommended method of ensuring confidence in mission critical services such as Alerting, Actions, and Reporting.

Configuring the monitored health statistics

The health monitoring API monitors the performance of Task Manager out of the box. However, certain performance considerations are deployment specific and you can configure them.

A health threshold is the threshold for failed task executions. Once a task exceeds this threshold, a status of warn or error is set on the task type execution. To configure a health threshold, use the xpack.task_manager.monitored_task_execution_thresholds setting. You can apply this this setting to all task types in the system, or to a custom task type.

By default, this setting marks the health of every task type as warning when it exceeds 80% failed executions, and as error at 90%. Set this value to a number between 0 to 100. The threshold is hit when the value exceeds this number. To avoid a status of error, set the threshold at 100. To hit error the moment any task fails, set the threshold to 0.

Create a custom configuration to set lower thresholds for task types you consider critical, such as alerting tasks that you want to detect sooner in an external monitoring service.

		xpack.task_manager.monitored_task_execution_thresholds:
  default:
    error_threshold: 70
    warn_threshold: 50
  custom:
    "alerting:.index-threshold":
      error_threshold: 50
      warn_threshold: 0
		
	

A default configuration that sets the system-wide warn threshold at a 50% failure rate, and error at 70% failure rate.
A custom configuration for the alerting:.index-threshold task type that sets a system wide warn threshold at 0% (which sets a warn status the moment any task of that type fails), and error at a 50% failure rate.

Consuming health stats

The health API is best consumed using the /api/task_manager/_health endpoint.

Additionally, there are two ways to consume these metrics:

Debug logging

In self-managed deployments, you can configure health stats to be logged in the Kibana DEBUG logger at a regular cadence. To enable Task Manager debug logging in your Kibana instance, add the following to your kibana.yml:

		logging:
  loggers:
      - context: plugins.taskManager
        appenders: [console]
        level: debug
		
	

These stats are logged based on the number of milliseconds set in your xpack.task_manager.poll_interval setting, which could add substantial noise to your logs. Only enable this level of logging temporarily.

Automatic logging

By default, the health API runs at a regular cadence, and each time it runs, it attempts to self evaluate its performance. If this self evaluation yields a potential problem, a message will log to the Kibana server log. In addition, the health API will look at how long tasks have waited to start (from when they were scheduled to start). If this number exceeds a configurable threshold (xpack.task_manager.monitored_stats_health_verbose_log.warn_delayed_task_start_in_seconds), the same message as above will log to the Kibana server log.

This message looks like:

		Detected potential performance issue with Task Manager. Set 'xpack.task_manager.monitored_stats_health_verbose_log.enabled: true' in your Kibana.yml to enable debug logging`
		
	

If this message appears, set xpack.task_manager.monitored_stats_health_verbose_log.enabled to true in your kibana.yml. This will start logging the health metrics at either a warn or error log level, depending on the detected severity of the potential problem.