In zookeeper logs, you can see messages like below, this spew can be seen live by running the command
tail -n 2000 /opt/mesosphere/active/exhibitor/usr/zookeeper/zookeeper.out
2015-12-02 22:49:24,284 [myid:2] - WARN [SendWorker:5:QuorumCnxManager$SendWorker@697] - Interrupted while waiting for message on queue
You are able to trace the error to a stale zookeeper entry if you review the exhibitor UI. In addition to the the log spew above, an additional symptom is that if you try to delete the stale entry it does not go away. It is persistent the exhibitor UI.
It should also be noted that the znode ID is significantly lower than the others.
Provided the leading Zookeeper node is healthy, please go ahead and follow the resolution below. Otherwise contact Mesosphere support.
Provided the the lead zookeeper node is healthy, you will be able to bring the cluster back into healthy state by following the steps below. If the zookeeper leader is not in a healthy state please reach out to Mesosphere support.
Clearing out /var/lib/zookeeper on the nodes, allows them to restore from the healthy leader state. This will result in the exhibitor interface indicating that the node is online.
1. Stop Exhibitor - sudo systemctl stop dcos-exhibitor
2. Move zk directory - from /var/lib/zookeeper to /var/lib/zookeeper.old
3. Make new zk directory - sudo mkdir /var/lib/zookeeper
4. Start Exhibitor - sudo systemctl start dcos-exhibitor
Mesos UI confirms the cluster is healthy.