Troubleshoot upgrades
Usually, Elasticsearch upgrades proceed smoothly due to due diligence in upgrade planning and preparation.
To avoid majority of errors discussed below, ensure to resolve all Upgrade Assistant critical items before beginning upgrading. For more information, refer to Troubleshoot Upgrade Assistant.
If you suspect an issue monitoring your upgrade, inspect progress through the following outline. We have compiled the most common error resolutions encountered for your reference to review based on your findings.
Elasticsearch supports running two versions during a rolling upgrade, from an earlier version to later version. It does not ever support running more than two versions. It does not support two versions beyond the duration of the rolling upgrade.
Assuming Elasticsearch configuration uniformity outside of node role designations and that the nodes upgrade order is respected, then the majority of rolling upgrade errors will surface when the first node upgrades.
You can monitor the rolling upgrade high-level by checking cluster nodes' list and low-level by tailing the restarting node's logs.
To monitor which nodes have been upgraded, use the CAT nodes API:
GET _cat/nodes?v=true&h=name,ip,role,master,version,uptime&s=uptime
For an example three node cluster, this first node's upgrade could appear like
All nodes report in cluster.
name ip role master version uptime instance-0000000000 10.42.4.93 himrst * 8.19.x 20d instance-0000000001 10.42.1.10 himrst - 8.19.x 20d tiebreaker-0000000003 10.42.0.222 mv - 8.19.x 20dAs the node shuts down, it stops syncing to the elected-master.
name ip role master version uptime instance-0000000000 10.42.4.93 himrst * 8.19.x 20d instance-0000000001 tiebreaker-0000000003 10.42.0.222 mv - 8.19.x 20dThe elected-master removes the node from the cluster and it no longer shows.
name ip role master version uptime instance-0000000000 10.42.4.93 himrst * 8.19.x 20d tiebreaker-0000000003 10.42.0.222 mv - 8.19.x 20dAfter the node starts back up and rejoins cluster, it again reports in cluster.
name ip role master version uptime instance-0000000001 10.42.1.10 himrst - 9.x.x 5s instance-0000000000 10.42.4.93 himrst * 8.19.x 20d tiebreaker-0000000003 10.42.0.222 mv - 8.19.x 20d
If a node does not rejoin cluster, you will inspect its restart logs.
While the node is restarting, you can tail its logs for information related to its upgrade and restart.
You would commonly filter to logs specific to discovery and cluster formation events. For example, from:
An attached monitoring cluster, you could Lucene filter in Discover for
.monitoring*:"node-join" OR "node-left" OR "master node changed" OR "elected-as-master" exitcode OR initializing OR fatal OR "publish_address"The host's terminal, doing a
tailof the Elasticsearch logging with agrepfilter:grep -Ei 'node-join|node-left|master node changed|elected-as-master|exitcode|initializing|fatal|publish_address'
We have compiled the most common error resolutions encountered for your reference to review based on your findings.
During a rolling upgrade, the cluster continues to operate normally.
New functionality is either inactive or operates in a backward-compatible mode until the last node of earlier version leaves the cluster. New functionality becomes operational when all nodes in the cluster are running the later version.
Usually, the earlier version nodes only leave the cluster when you shut them down to upgrade them. In this case, the last earlier version node leaves the cluster when there are no more nodes to upgrade.
The following outline edge cases and their impacts where:
- One or more nodes unexpectedly leave cluster.
- Nodes leave cluster out of expected upgrade order.
- Cluster was not architected to be highly available.
It is possible that an earlier version node might temporarily or permanently (until intervened) leave the cluster before you purposely shut it down due to cluster fault detection. You should normally recover node into cluster before continuing rolling upgrade.
A node unexpectedly out of cluster during a rolling upgrade can cause the platform to stall the upgrade to avoid data loss. If this occurs, the Deployment Activity's Elasticsearch plan step "Performing a rolling change" status "Waiting until cluster recovers" will report a subset of expected node counts.
If all the remaining earlier version nodes unexpectedly leave the cluster during an upgrade, the cluster will
- consider itself to be fully-upgraded
- automatically activate new functionality
- leave its backward-compatible mode
Once that has happened, there is no way to return the cluster to a state that is compatible with the earlier version nodes.
Nodes running the earlier version will not be able to join this fully-upgraded cluster. Their Elasticsearch logs will report failed to join issues due to Caused by errors like
node version [x.x.x] may not join a cluster comprising only nodes of version [y.y.y] or greaternode with version [x.x.x] may not join a cluster with minimum version [y.y.y]node with system index mappings versions [y.y.y] may not join a cluster with minimum system index mappings versions [x.x.x]handshake with [NODE_ID] failed: remote node version [x.x.x] is incompatible with local node version [y.y.y]
Elasticsearch maintains the data in the data paths of the older nodes and will recover the cluster to health using this data after the nodes are fully upgraded. Therefore, to bring these nodes back into the cluster, upgrade them.
You can re-trigger your Deployment Upgrade to pick up upgrade where it left off to complete it.
If the node out of cluster causes a Cluster health API status of red, then plans will be blocked for data safety.
If this is the case, contact us with either the
- Elastic Cloud Hosted deployment ID
- Elastic Cloud Enterprise diagnostic flagged
--deploymentsfor the problematic deployment ID after attempting a pause and resume instance
If you stop half or more of the master-eligible nodes all at once during the upgrade, the cluster will become unavailable due to insufficient voting configurations.
Production environments should have at least three master-eligible nodes for high availability. In a testing or development environment with only one or two master-eligible nodes, you cannot avoid stopping half or more of the master-eligible nodes, so the cluster will always become unavailable at some point during the upgrade.
You must restart all the stopped master-eligible nodes to allow the cluster to re-form. This might cause a premature cluster version update.
Upgrade the master-eligible nodes last to make it less likely that this occurs.
Restarting nodes can encounter errors which might otherwise surface. Most commonly:
- during start-up due to misconfigured
systemctltimeout settings - during start-up due to misconfigured settings tripping bootstrap check failures
- during node discovery and cluster formation
- circuit breaker or watermark errors due to temporary resource unavailability
- related to lack of high availability architecture
The following supplements this list with errors specific to the rolling upgrade period.
The following are bootstrap checks which uniquely surface during rolling upgrades.
Elasticsearch indices are compatible for sequential major versions. Restarting nodes will error attempting to load metadata for outdated incompatible versions like
The index [index-000001] created in version [y-1.x.x] with current compatibility version [y-1.x.x] must be marked as read-only using the setting [index.blocks.write] set to [true] before upgrading to y+1.z.z.Cannot start this node because it holds metadata for indices with version [y-1.x.x] with which this node of version [y+1.z.z] is incompatible. Revert this node to version [y.y.y] and delete any indices with versions earlier than [y.0.0] before upgrading to version [y+1.z.z]. If all such indices have already been deleted, revert this node to version [y.y.y] and wait for it to join the cluster to clean up any older indices from its metadata.cannot upgrade node because incompatible indices created with version [y-1.x.x] exist, while the minimum compatible index version is [y.y.y]. Upgrade your older indices by reindexing them in version [y+1.z.z] first
This error indicates the Upgrade Assistant was not fully completed during upgrade preparation work.
You should reset this node's version upgrade, rejoin it to cluster at the earlier version, and complete the Upgrade Assistant critical items before beginning upgrading. Refer also to Troubleshoot Upgrade Assistant.
If the Elasticsearch configuration contains settings deprecated on the later version, it might error like:
unknown setting [X] please check that any required plugins are installed, or check the breaking changes documentation for removed settingsThe configuration setting [X] is required
This error indicates the Preparation steps' Review breaking changes was not sufficiently completed. You will need to resolve all unknown setting startup errors. For the most common examples, refer to Troubleshooting node bootlooping.
You might experience shard allocation issues if
- nodes upgrade order was not respected
- one of the earlier rolling upgrades considerations triggers
The following supplement common allocation issues with errors which uniquely surface during rolling upgrades:
incompatible index versions
illegal_argument_exception: The index [my_index] was created with version [X.X.X] but the minimum compatible version is [Y.Y.Y]java.lang.IllegalStateException: index [my_index] version not supported: X.X.X maximum compatible index version is: Y.Y.Y
incompatible shard versions
cannot allocate replica shard to a node with version [X.X.X] since this is older than the primary version [Y.Y.Y]
If any of these are encountered, you should continue rolling upgrading your nodes. As more nodes on expected later version become available, the data will allocate.
The following are common issues which surface after Elasticsearch upgrade from unfinished upgrade tasks.
If Kibana does not start after its upgrade but continues to be unavailable or report Kibana server is not ready yet then ensure you re-enabled shard allocation.
If you Set upgrade_mode for transform indices, then you might encounter errors unexpected after upgrade like
Cannot stop any Transform while the Transform feature is upgrading (408)Transform task will not be assigned while upgrade mode is enabled.
Update this to enabled=false to exit upgrade mode for transforms.
If you Set upgrade_mode for machine learning indices, then you might encounter errors unexpected after upgrade like:
You don't have permission to manage Machine Learning jobs. Access to the plugin requires the Machine Learning feature to be visible in this space.Index migration in progress. Indices related to Machine Learning are currently being upgraded. Some actions will not be available during this time.
Update this to enabled=false to exit upgrade mode for machine learning.