When running a cluster in production, it's important to get early warnings about compromised cluster health. We find the following statistics can provide some insight into cluster health.
Of course, your own needs for your cluster may vary; this is just a general guideline.
- Framework disconnections - Measure this through the
master/state.json
endpoint to watch for a large number of frameworks falling out of ACTIVE state. Look for a high number of frameworks switching to framework.active==false. - Slave deactivations - Available through
master/stats.json
; look for a rapid increase to the deactivated_slaves number. - Long registration queue - Found in
master/stats.json
; a large number of queued tasks indicates a potential failure and reduced cluster throughput. - Losing quorum - Review
master/state.json
looking for "quorum" or "leader" flags. A rapid change here can signal issues. - Over-provisioning of cluster: Watch
master/stats.json
for the mem_percent, disk_percent, and cpus_percent stats to watch for a high util percentage. - Task failures: Look for large increases to the task_error, task_failed, and task_lost counters at
metrics/snapshot
These endpoints can be reached at <marathon-ip>/mesos.
Comments