Troubleshoot upgrades

Usually, Elasticsearch upgrades proceed smoothly due to due diligence in upgrade planning and preparation.

To avoid majority of errors discussed below, ensure to resolve all Upgrade Assistant critical items before beginning upgrading. For more information, refer to Troubleshoot Upgrade Assistant.

If you suspect an issue monitoring your upgrade, inspect progress through the following outline. We have compiled the most common error resolutions encountered for your reference to review based on your findings.

Monitor upgrade

Elasticsearch supports running two versions during a rolling upgrade, from an earlier version to later version. It does not ever support running more than two versions. It does not support two versions beyond the duration of the rolling upgrade.

Assuming Elasticsearch configuration uniformity outside of node role designations and that the nodes upgrade order is respected, then the majority of rolling upgrade errors will surface when the first node upgrades.

You can monitor the rolling upgrade high-level by checking cluster nodes' list and low-level by tailing the restarting node's logs.

Poll cluster nodes

To monitor which nodes have been upgraded, use the CAT nodes API:

				GET _cat/nodes?v=true&h=name,ip,role,master,version,uptime&s=uptime

For an example three node cluster, this first node's upgrade could appear like

All nodes report in cluster.

		name                  ip          role   master version uptime
instance-0000000000   10.42.4.93  himrst *      8.19.x   20d
instance-0000000001   10.42.1.10  himrst -      8.19.x   20d
tiebreaker-0000000003 10.42.0.222 mv     -      8.19.x   20d
		
	

As the node shuts down, it stops syncing to the elected-master.

		name                  ip          role   master version uptime
instance-0000000000   10.42.4.93  himrst *      8.19.x   20d
instance-0000000001
tiebreaker-0000000003 10.42.0.222 mv     -      8.19.x   20d
		
	

The elected-master removes the node from the cluster and it no longer shows.

		name                  ip          role   master version uptime
instance-0000000000   10.42.4.93  himrst *      8.19.x   20d
tiebreaker-0000000003 10.42.0.222 mv     -      8.19.x   20d
		
	

After the node starts back up and rejoins cluster, it again reports in cluster.

		name                  ip          role   master version uptime
instance-0000000001   10.42.1.10  himrst -      9.x.x     5s
instance-0000000000   10.42.4.93  himrst *      8.19.x   20d
tiebreaker-0000000003 10.42.0.222 mv     -      8.19.x   20d
		
	

If a node does not rejoin cluster, you will inspect its restart logs.

Check node logs

While the node is restarting, you can tail its logs for information related to its upgrade and restart.

You would commonly filter to logs specific to discovery and cluster formation events. For example, from:

An attached monitoring cluster, you could Lucene filter in Discover for .monitoring*:

		"node-join" OR "node-left" OR "master node changed" OR "elected-as-master" exitcode OR initializing OR fatal OR "publish_address"
		
	

The host's terminal, doing a tail of the Elasticsearch logging with a grep filter:

		grep -Ei 'node-join|node-left|master node changed|elected-as-master|exitcode|initializing|fatal|publish_address'
		
	

We have compiled the most common error resolutions encountered for your reference to review based on your findings.

Rolling upgrades considerations

During a rolling upgrade, the cluster continues to operate normally.

New functionality is either inactive or operates in a backward-compatible mode until the last node of earlier version leaves the cluster. New functionality becomes operational when all nodes in the cluster are running the later version.

Usually, the earlier version nodes only leave the cluster when you shut them down to upgrade them. In this case, the last earlier version node leaves the cluster when there are no more nodes to upgrade.

The following outline edge cases and their impacts where:

One or more nodes unexpectedly leave cluster.
Nodes leave cluster out of expected upgrade order.
Cluster was not architected to be highly available.

Unexpected node disconnect

It is possible that an earlier version node might temporarily or permanently (until intervened) leave the cluster before you purposely shut it down due to cluster fault detection. You should normally recover node into cluster before continuing rolling upgrade.

Note :applies_to: { ece:, ess: }

A node unexpectedly out of cluster during a rolling upgrade can cause the platform to stall the upgrade to avoid data loss. If this occurs, the Deployment Activity's Elasticsearch plan step "Performing a rolling change" status "Waiting until cluster recovers" will report a subset of expected node counts.

Premature cluster version update

If all the remaining earlier version nodes unexpectedly leave the cluster during an upgrade, the cluster will

consider itself to be fully-upgraded
automatically activate new functionality
leave its backward-compatible mode

Once that has happened, there is no way to return the cluster to a state that is compatible with the earlier version nodes.

Nodes running the earlier version will not be able to join this fully-upgraded cluster. Their Elasticsearch logs will report failed to join issues due to Caused by errors like

node version [x.x.x] may not join a cluster comprising only nodes of version [y.y.y] or greater
node with version [x.x.x] may not join a cluster with minimum version [y.y.y]
node with system index mappings versions [y.y.y] may not join a cluster with minimum system index mappings versions [x.x.x]
handshake with [NODE_ID] failed: remote node version [x.x.x] is incompatible with local node version [y.y.y]

Elasticsearch maintains the data in the data paths of the older nodes and will recover the cluster to health using this data after the nodes are fully upgraded. Therefore, to bring these nodes back into the cluster, upgrade them.

Note :applies_to: { ece:, ess: }

You can re-trigger your Deployment Upgrade to pick up upgrade where it left off to complete it.

If the node out of cluster causes a Cluster health API status of red, then plans will be blocked for data safety.

If this is the case, contact us with either the

Elastic Cloud Hosted deployment ID
Elastic Cloud Enterprise diagnostic flagged --deployments for the problematic deployment ID after attempting a pause and resume instance

Stopping master-eligible nodes

If you stop half or more of the master-eligible nodes all at once during the upgrade, the cluster will become unavailable due to insufficient voting configurations.

Production environments should have at least three master-eligible nodes for high availability. In a testing or development environment with only one or two master-eligible nodes, you cannot avoid stopping half or more of the master-eligible nodes, so the cluster will always become unavailable at some point during the upgrade.

You must restart all the stopped master-eligible nodes to allow the cluster to re-form. This might cause a premature cluster version update.

Upgrade the master-eligible nodes last to make it less likely that this occurs.

Common issues

Restarting nodes can encounter errors which might otherwise surface. Most commonly:

during start-up due to misconfigured systemctl timeout settings
during start-up due to misconfigured settings tripping bootstrap check failures
during node discovery and cluster formation
circuit breaker or watermark errors due to temporary resource unavailability
related to lack of high availability architecture

The following supplements this list with errors specific to the rolling upgrade period.

Bootstrap checks

The following are bootstrap checks which uniquely surface during rolling upgrades.

Index compatibility

Elasticsearch indices are compatible for sequential major versions. Restarting nodes will error attempting to load metadata for outdated incompatible versions like

The index [index-000001] created in version [y-1.x.x] with current compatibility version [y-1.x.x] must be marked as read-only using the setting [index.blocks.write] set to [true] before upgrading to y+1.z.z.
Cannot start this node because it holds metadata for indices with version [y-1.x.x] with which this node of version [y+1.z.z] is incompatible. Revert this node to version [y.y.y] and delete any indices with versions earlier than [y.0.0] before upgrading to version [y+1.z.z]. If all such indices have already been deleted, revert this node to version [y.y.y] and wait for it to join the cluster to clean up any older indices from its metadata.
cannot upgrade node because incompatible indices created with version [y-1.x.x] exist, while the minimum compatible index version is [y.y.y]. Upgrade your older indices by reindexing them in version [y+1.z.z] first

This error indicates the Upgrade Assistant was not fully completed during upgrade preparation work.

You should reset this node's version upgrade, rejoin it to cluster at the earlier version, and complete the Upgrade Assistant critical items before beginning upgrading. Refer also to Troubleshoot Upgrade Assistant.

Unknown settings

If the Elasticsearch configuration contains settings deprecated on the later version, it might error like:

unknown setting [X] please check that any required plugins are installed, or check the breaking changes documentation for removed settings
The configuration setting [X] is required

This error indicates the Preparation steps' Review breaking changes was not sufficiently completed. You will need to resolve all unknown setting startup errors. For the most common examples, refer to Troubleshooting node bootlooping.

Shard allocation issues

You might experience shard allocation issues if

nodes upgrade order was not respected
one of the earlier rolling upgrades considerations triggers

The following supplement common allocation issues with errors which uniquely surface during rolling upgrades:

incompatible index versions
- illegal_argument_exception: The index [my_index] was created with version [X.X.X] but the minimum compatible version is [Y.Y.Y]
- java.lang.IllegalStateException: index [my_index] version not supported: X.X.X maximum compatible index version is: Y.Y.Y
incompatible shard versions
- cannot allocate replica shard to a node with version [X.X.X] since this is older than the primary version [Y.Y.Y]

If any of these are encountered, you should continue rolling upgrading your nodes. As more nodes on expected later version become available, the data will allocate.

Cannot stop any Transform while the Transform feature is upgrading (408)
Transform task will not be assigned while upgrade mode is enabled.

Update this to enabled=false to exit upgrade mode for transforms.

Machine Learning upgrade mode

If you Set upgrade_mode for machine learning indices, then you might encounter errors unexpected after upgrade like:

You don't have permission to manage Machine Learning jobs. Access to the plugin requires the Machine Learning feature to be visible in this space.
Index migration in progress. Indices related to Machine Learning are currently being upgraded. Some actions will not be available during this time.

Update this to enabled=false to exit upgrade mode for machine learning.