Watermark errors
When a data node reaches critical disk space usage, its disk-based shard allocation watermark settings trigger to protect the node's disk. The default watermark percentage thresholds, the summary of Elasticsearch's response, and their corresponding Elasticsearch log are:
- 85%
low: Elasticsearch stops allocating replica shards and primary shards unless from newly-created indices to the affected node(s).low disk watermark [85%] exceeded on [NODE_ID][NODE_NAME] free: Xgb[X%], replicas will not be assigned to this node - 90%
high: Elasticsearch rebalances shards away from the affected node(s).high disk watermark [90%] exceeded on [NODE_ID][NODE_NAME] free: Xgb[X%], shards will be relocated away from this node - 95%
flood-stage: Elasticsearch sets all indices on the affected node(s) to read-only. The write block is automatically removed once disk usage on the affected node falls below the high watermark.flood-stage watermark [95%] exceeded on [NODE_ID][NODE_NAME], all indices on this node will be marked read-only
At 75% disk usage, the Elastic Cloud Console displays a red disk indicator for the node to signal elevated usage. This threshold is a visual indicator only and is not tied to any Elasticsearch watermark or disk-enforcement behavior. No Elasticsearch allocation or write restrictions are applied at this stage.
To prevent a full disk, when a node reaches flood-stage watermark, Elasticsearch blocks writes to any index with a shard on the affected node(s). If the block affects related system indices, Kibana and other Elastic Stack features can become unavailable. For example, flood-stage can induce errors like:
- Kibana's
Kibana Server is not Ready yeterror message. - Elasticsearch's ingest API's reject the request with HTTP 429 error bodies like:
{ "reason": "index [INDEX_NAME] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];", "type": "cluster_block_exception" }
The following are some common setup issues leading to watermark errors:
- Sudden ingestion of large volumes of data that consumes disk above peak load testing expectations. Refer to Indexing performance considerations for guidance.
- Inefficient index settings, unnecessary stored fields, and suboptimal document structures can increase disk consumption. Refer to Tune for disk usage for guidance.
- A high number of replicas can quickly multiply storage requirements, as each replica consumes the same disk space as the primary shard. Refer to Index settings for details.
- Oversized shards can make disk usage spikes more likely and slow down recovery and rebalancing. Refer to Size your shards for guidance.
AutoOps is a monitoring tool that simplifies cluster management through performance recommendations, resource utilization visibility, and real-time issue detection with resolution paths. Learn more about AutoOps.
To track disk usage over time, enable monitoring using one of the following options, depending on your deployment type:
- (Recommended) Enable AutoOps.
- Enable logs and metrics. When logs and metrics are enabled, monitoring information is visible on Kibana's Stack Monitoring page. You can also enable the Disk usage threshold alert to be notified about potential issues.
- From your deployment menu, view the Performance page's disk usage chart.
- (Recommended) Enable AutoOps.
- Enable Elasticsearch monitoring. When logs and metrics are enabled, monitoring information is visible on Kibana's Stack Monitoring page. You can also enable the Disk usage threshold alert to be notified about potential issues.
To verify that shards are moving off the affected node until it falls below high watermark, use the following Elasticsearch APIs:
Cluster health status API to check
relocating_shards.GET _cluster/healthCAT recovery API to check the count of recovering shards and their migrated
bpbytes percent oftbtotal bytes.GET _cat/recovery?v=true&expand_wildcards=all&active_only=true&h=time,tb,bp,top,ty,st,snode,tnode,idx,sh&s=time:desc
If shards remain on the node keeping it above high watermark, use the following Elasticsearch APIs:
CAT shards API to determine which shards are hosted on the node.
GET _cat/shards?v=trueCluster allocation explanation API to get an explanation for the chosen shard's allocation status.
GET _cluster/allocation/explain{ "index": "my-index-000001", "shard": 0, "primary": false }Refer to Using the cluster allocation API for troubleshooting for guidance on interpreting this output.
You should normally wait for Elasticsearch to balance itself. If advanced users determine shards which should migrate off node faster, whether due to forecasted ingestion rate or existing disk usage, they might consider using the Reroute the cluster API to push their chosen shard to immediately rebalance to their determined target node.
To immediately restore write operations, you can temporarily increase disk watermarks and remove the write block.
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "90%",
"cluster.routing.allocation.disk.watermark.low.max_headroom": "100GB",
"cluster.routing.allocation.disk.watermark.high": "95%",
"cluster.routing.allocation.disk.watermark.high.max_headroom": "20GB",
"cluster.routing.allocation.disk.watermark.flood_stage": "97%",
"cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": "5GB",
"cluster.routing.allocation.disk.watermark.flood_stage.frozen": "97%",
"cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": "5GB"
}
}
PUT */_settings?expand_wildcards=all
{
"index.blocks.read_only_allow_delete": null
}
When a long-term solution is in place, to reset or reconfigure the disk watermarks:
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": null,
"cluster.routing.allocation.disk.watermark.low.max_headroom": null,
"cluster.routing.allocation.disk.watermark.high": null,
"cluster.routing.allocation.disk.watermark.high.max_headroom": null,
"cluster.routing.allocation.disk.watermark.flood_stage": null,
"cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": null,
"cluster.routing.allocation.disk.watermark.flood_stage.frozen": null,
"cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": null
}
}
Elasticsearch recommends using default watermark settings. Advanced users can override the watermark thresholds and headroom but risk not giving enough disk for background processes such as force merge, not being right-sized to data ingestion rates vs index lifecycle management settings, and possibly disk is full errors if 100% disk is reached.
To resolve watermark errors permanently, perform one of the following actions:
- Horizontally scale nodes of the affected data tiers.
- Vertically scale existing nodes to increase disk space. Ensure nodes within a data tier are scaled to matching hardware profiles to avoid hot spotting.
- Delete indices using the delete index API, either permanently if the index isn’t needed, or temporarily to later restore from snapshot.
On Elastic Cloud Hosted and Elastic Cloud Enterprise, you might need to temporarily delete indices using the Elasticsearch API Console. This can resolve a status: red cluster health status, which blocks deployment changes. After resolving the issue, you can restore the indices from a snapshot. If you experience issues with this resolution flow, reach out to Elastic Support for assistance.
To reduce the likelihood of watermark errors:
- Enable Autoscaling to automatically adjust resources based on storage and performance needs.
- Implement more restrictive index lifecycle management policies to move data through data tiers sooner to help keep higher tiers' disk usage under control.
- Avoid a mix of overly large and small indices which can cause an unbalanced cluster. Refer to Size your shards.