Generic Cluster Health Checks

When running a cluster in production, it's important to get early warnings about compromised cluster health. We find the following statistics can provide some insight into cluster health.

Of course, your own needs for your cluster may vary; this is just a general guideline.

  1. Framework disconnections - Measure this through the master/state.json endpoint to watch for a large number of frameworks falling out of ACTIVE state. Look for a high number of frameworks switching to framework.active==false.
  2. Slave deactivations - Available through master/stats.json; look for a rapid increase to the deactivated_slaves number.
  3. Long registration queue - Found in master/stats.json; a large number of queued tasks indicates a potential failure and reduced cluster throughput.
  4. Losing quorum - Review master/state.json looking for "quorum" or "leader" flags. A rapid change here can signal issues.
  5. Over-provisioning of cluster: Watch master/stats.json for the mem_percent, disk_percent, and cpus_percent stats to watch for a high util percentage.
  6. Task failures: Look for large increases to the task_error, task_failed, and task_lost counters at metrics/snapshot

These endpoints can be reached at <marathon-ip>/mesos.

Have more questions? Submit a request

Comments

Powered by Zendesk