Warning: The following procedures might not fix the problem and could cause fatal DCOS errors in your clusters or services. For more information, please contact Mesosphere support.
- Open the AWS EC2 console
- Select the region where you created your cluster.
- In the navigation pane, under INSTANCES, click Instances.
- Select your failing Mesos instance. The server group type MasterServerGroup is for masters and SlaveServerGroup is for slaves.
- Tip: To view the server group types, click Show/Hide Columns and add the aws:autoscaling:groupName to Your Tag Keys.
- Reboot your master: select MasterServerGroup node and click Instance State -> Reboot. With only a single master, if reboot does not fix the issue you may need to delete and recreate your DCOS cluster.
- Delete failing slave nodes: select SlaveServerGroup node and click Action -> Instance State -> Terminate.
- Reboot your master: select MasterServerGroup node and click Instance State -> Reboot.
- Delete failing master nodes: select MasterServerGroup node and click Action -> Instance State -> Terminate. Use caution when deleting masters. Before deleting additional masters, wait for the terminated instances to come back online.