How do I fix failing Mesos master and worker nodes?

Warning: The following procedures might not fix the problem and could cause fatal DCOS errors in your clusters or services. For more information, please contact Mesosphere support.

  1. Open the AWS EC2 console
  2. Select the region where you created your cluster.
  3. In the navigation pane, under INSTANCES, click Instances.
  4. Select your failing Mesos instance. The server group type MasterServerGroup is for masters and SlaveServerGroup is for slaves.
    • Tip: To view the server group types, click Show/Hide Columns and add the aws:autoscaling:groupName to Your Tag Keys.     
  5. For 1 Master nodes, you can try the following:
    • Reboot your master: select MasterServerGroup node and click Instance State -> Reboot. With only a single master, if reboot does not fix the issue you may need to delete and recreate your DCOS cluster.
  6. For HA 3 Master nodes, you can try any of the following:
    • Delete failing slave nodes: select SlaveServerGroup node and click Action -> Instance State -> Terminate.
    • Reboot your master: select MasterServerGroup node and click Instance State -> Reboot.
    • Delete failing master nodes: select MasterServerGroup node and click Action -> Instance State -> Terminate. Use caution when deleting masters. Before deleting additional masters, wait for the terminated instances to come back online.

 

Have more questions? Submit a request

Comments

Powered by Zendesk