Rebuilding a broken Zookeeper quorum
This article covers an advanced recovery method involving directly modifying Zookeeper. This process can potentially corrupt your data. Elastic recommends only following this outline after receiving confirmation by Elastic Support.
This article describes how to recover a broken Zookeeper leader or follower within Elastic Cloud Enterprise.
When an ECE director host’s Zookeeper status cannot be determined healthy using the Verify Zookeeper sync status command or from Elastic Cloud Enterprise > Platform > Settings, then you might need to recover Zookeeper.
This situation might surface when recovering the Elastic Cloud Enterprise director host from a full disk issue.
A healthy Zookeeper quorum returns a sync status similar to the following. Any other responses require further investigation.
$ # Zookeeper leader with id:10
$ echo mntr | nc 2191
zk_server_state leader
# ...
zk_followers 2
zk_synced_followers 2
$ # Zookeeper follower with id:11
$ echo mntr | nc 2192
zk_server_state follower
$ # Zookeeper follower with id:12
$ echo mntr | nc 2193
zk_server_state follower
Before recovering the Zookeeper leader or follower, back up all Elastic Cloud Enterprise hosts' Zookeeper data directories. Normally this is only applicable to director hosts, but may apply to other hosts during migrations.
Perform the following steps on each host to back up the Zookeeper data directory:
Extract the Zookeeper
directory path:docker inspect --format '{{ range .Mounts }}{{ .Source }} {{ end }}' frc-zookeeper-servers-zookeeper | grep --color=auto \"zookeeper/data\"
Make a copy or backup of the emitted directory. For example, if data directory is
, then run the following command:cp -R /mnt/data/elastic/ /mnt/data/elastic/ZK_data_backup
If a Zookeeper quorum is broken, you must establish the best Zookeeper leader to use for recovery before you start the recovery proces.
The simplest way to check is using the Zookeeper sync status command.
If this command is not reporting any leaders, then perform the following actions on each director host:
SSH into the host.
Enter the Docker
container and check its/app/logs/zookeeper.log
logs forLEADING
:$ docker exec -it frc-zookeeper-servers-zookeeper bash root@XXXXX:/# cat /app/logs/zookeeper.log | grep 'LEADING'
This command will return results similar to the following:
INFO [QuorumPeer[myid=10](plain=] - LEADING INFO [QuorumPeer[myid=10](plain=] - LEADING - LEADER ELECTION TOOK - 225 MS
If multiple directors report this log, then determine the one with the latest timestamp, which will contain the latest Zookeeper state.
In the following recovery steps, the steps for the determined leader are marked with [leader]
, and the steps for all other Zookeepers are marked with [followers]
. The [leader]
should be recovered as needed before its [followers]
. Steps marked [followers]
should be performed on each follower director host, and steps marked [director]
should be performed only on problematic director hosts.
To recover the Zookeeper leader, you should first try to restart the Docker Zookeeper container. Restarting the container is often enough to trigger the leader to resync its connection to its followers.
Within a SSH session of Zookeeper hosts, run the following command:
docker restart frc-zookeeper-servers-zookeeper
Wait a few minutes for state to attempt to sync across leader and followers, then verify the Zookeeper sync status to see if the quorum has recovered.
If the Zookeeper leader is still not recovered, proceed to the next section.
If restarting the Zookeeper container does not recover the leader, you can manually set the leader and rebuild the quorum.
Shut down the Docker Runner and Zookeeper containers:docker stop frc-runners-runner docker stop frc-zookeeper-servers-zookeeper
Stop the Zookeeper service within the Docker container. Note this is stopping the service within the Docker container and not stopping the Zookeeper Docker container itself:docker exec -it frc-zookeeper-servers-zookeeper sv stop zookeeper
Enter the Docker Zookeeper container and determine its Zookeeper ID:$ docker exec -it frc-zookeeper-servers-zookeeper bash root@XXXXX:/# cat /app/data/myid 10
In the directory/app/managed/
, modify the Zookeeper filereplicated.cfg.dynamic
:- Remove the lines referencing other Zookeeper hosts.
- If multiple lines reference
, then remove all but the one containing the Zookeeper ID from the previous step.
Restart the Docker Zookeeper and Director containers:docker restart frc-zookeeper-servers-zookeeper docker restart frc-directors-director
Check the Zookeeper sync status. The response should now show this director host as the Zookeeper leader.Confirm that Elastic Cloud Enterprise is now also able to check the Zookeeper status and make changes.
Restart the Docker Zookeeper, Director, and Runner containers:docker restart frc-zookeeper-servers-zookeeper docker restart frc-directors-director docker restart frc-runners-runner
Verify that the Zookeeper sync status reports an odd number for
and that no Zookeeper hosts are marked aslost
Zookeeper followers can sometimes refuse a [leader]
election or become state corrupted. The following steps can be used to recover a broken or corrupted Zookeeper [follower]
. These steps should only be considered after confirming a Zookeeper leader, as the [follower]
will be reset to copy the state from [leader]
On the [follower]
, do the following:
Get the director host’s Zookeeper
directory path:docker inspect --format '{{ range .Mounts }}{{ .Source }} {{ end }}' frc-zookeeper-servers-zookeeper | grep --color=auto \"zookeeper/data\"
Stop the Docker Runner and Zookeeper containers:
docker stop frc-runners-runner docker stop frc-zookeeper-servers-zookeeper
Under the determined
directory, remove the sub-directorydata/version-NUMBER
, replacing theNUMBER
placeholder./mnt/data/elastic/MY_IP/services/zookeeper/data$ rm -R ./version-NUMBER/
Make sure that
file exists and is retained.Start the Runner container, which will auto-start the Docker Zookeeper container.
docker start frc-runners-runner
Wait a few minutes for Zookeeper states to sync. Then check the Zookeeper sync status to confirm the following:
zk_server_state follower
zk_outstanding_requests 0
Confirm that the
recognizes the added[follower]
by checking the Zookeeper sync status for an incrementedzk_synced_followers