Should "rm -f /var/lib/mesos/slave/meta/slaves/latest" be automated ?

Sometimes when the Mesos slave recovery fails, messages similar to the following are displayed :

To remedy this do as follows:
Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
This ensures slave doesn't recover old live executors.
Step 2: Restart the slave.

It’s not recommended to automate this because the error being reported is exceptional. This may occur if slave recovery fails for some reason. Usually it indicates that either

  • slave flags were changed, and therefore the executors cannot be recovered
  • something unexpected happened making it impossible to recover the tasks.

When a slave enters into a bad state (such as above) it will continue to exit until the issue is resolved manually. This also provides a good signal for the health of the slave. If the slave is ‘flapping’, you know there’s a problem.

