Marathon is unable to determine leader due to immutable znodes in Zookeeper.

Symptoms:

In zookeeper logs, you can see messages like below, this spew can be seen live by running the command

tail -n 2000 /opt/mesosphere/active/exhibitor/usr/zookeeper/zookeeper.out

Snippet:

2015-12-02 22:49:24,284 [myid:2] - WARN [SendWorker:5:QuorumCnxManager$SendWorker@697] - Interrupted while waiting for message on queue

java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:849)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:64)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:685)

 

You are able to trace the error to a stale zookeeper entry if you review the exhibitor UI. In addition to the the log spew above, an additional symptom is that if you try to delete the stale entry it does not go away. It is persistent the exhibitor UI.

It should also be noted that the znode ID is significantly lower than the others.

Provided the leading Zookeeper node is healthy, please go ahead and follow the resolution below. Otherwise contact Mesosphere support.

Resolution:

Provided the the lead zookeeper node is healthy, you will be able to bring the cluster back into healthy state by following the steps below. If the zookeeper leader is not in a healthy state please reach out to Mesosphere support.

Clearing out /var/lib/zookeeper on the nodes, allows them to restore from the healthy leader state. This will result in the exhibitor interface indicating that the node is online.

1. Stop Exhibitor - sudo systemctl stop dcos-exhibitor

2. Move zk directory -  from /var/lib/zookeeper to /var/lib/zookeeper.old

3. Make new zk directory - sudo mkdir /var/lib/zookeeper

4. Start Exhibitor - sudo systemctl start dcos-exhibitor

Result:

Mesos UI confirms the cluster is healthy.

 

 

Have more questions? Submit a request

Comments

Powered by Zendesk