When running a cluster in production, it's important to get early warnings about compromised cluster health. We find the following statistics can provide some insight into cluster health.
Of course, your own needs for your cluster may vary; this is just a general guideline.
- Framework disconnections - Measure this through the
master/state.jsonendpoint to watch for a large number of frameworks falling out of ACTIVE state. Look for a high number of frameworks switching to framework.active==false.
- Slave deactivations - Available through
master/stats.json; look for a rapid increase to the deactivated_slaves number.
- Long registration queue - Found in
master/stats.json; a large number of queued tasks indicates a potential failure and reduced cluster throughput.
- Losing quorum - Review
master/state.jsonlooking for "quorum" or "leader" flags. A rapid change here can signal issues.
- Over-provisioning of cluster: Watch
master/stats.jsonfor the mem_percent, disk_percent, and cpus_percent stats to watch for a high util percentage.
- Task failures: Look for large increases to the task_error, task_failed, and task_lost counters at
These endpoints can be reached at <marathon-ip>/mesos.