ceph/doc/dev/health-reports.rst

   1 ==============
   2 Health Reports
   3 ==============
   4
   5
   6 How to Get Reports
   7 ==================
   8
   9 In general, there are two channels to retrieve the health reports:
  10
  11 ceph (CLI)
  12    which sends ``health`` mon command for retrieving the health status of the cluster
  13 mgr module
  14    which calls ``mgr.get('health')`` for the same report in the form of a JSON encoded string
  15
  16 The following diagrams outline the involved parties and how the interact when the clients
  17 query for the reports:
  18
  19 .. seqdiag::
  20
  21    seqdiag {
  22      default_note_color = lightblue;
  23      osd; mon; ceph-cli;
  24      osd  => mon [ label = "update osdmap service" ];
  25      osd  => mon [ label = "update osdmap service" ];
  26      ceph-cli  -> mon [ label = "send 'health' command" ];
  27      mon -> mon [ leftnote = "gather checks from services" ];
  28      ceph-cli <-- mon [ label = "checks and mutes" ];
  29    }
  30
  31 .. seqdiag::
  32
  33    seqdiag {
  34      default_note_color = lightblue;
  35      osd; mon; mgr; mgr-module;
  36      mgr  -> mon [ label = "subscribe for 'mgrdigest'" ];
  37      osd  => mon [ label = "update osdmap service" ];
  38      osd  => mon [ label = "update osdmap service" ];
  39      mon  -> mgr [ label = "send MMgrDigest" ];
  40      mgr  -> mgr [ note = "update cluster state" ];
  41      mon <-- mgr;
  42      mgr-module  -> mgr [ label = "mgr.get('health')" ];
  43      mgr-module <-- mgr [ label = "heath reports in json" ];
  44    }
  45
  46 Where are the Reports Generated
  47 ===============================
  48
  49 Aggregator of Aggregators
  50 -------------------------
  51
  52 Health reports are aggregated from multiple Paxos services:
  53
  54 - AuthMonitor
  55 - HealthMonitor
  56 - MDSMonitor
  57 - MgrMonitor
  58 - MgrStatMonitor
  59 - MonmapMonitor
  60 - OSDMonitor
  61
  62 When persisting the pending changes in their own domain, each of them identifies the
  63 health related issues and store them into the monstore with the prefix of ``health``
  64 using the same transaction. For instance, ``OSDMonitor`` checks a pending new osdmap
  65 for possible issues, like down OSDs and missing scrub flag in a pool, and then stores
  66 the encoded form of the health reports along with the new osdmap. These reports are
  67 later loaded and decoded, so they can be collected on demand. When it comes to
  68 ``MDSMonitor``, it persists the health metrics in the beacon sent by the MDS daemons,
  69 and prepares health reports when storing the pending changes.
  70
  71 .. seqdiag::
  72
  73    seqdiag {
  74      default_note_color = lightblue;
  75      mds; mon-mds; mon-health; ceph-cli;
  76      mds  -> mon-mds [ label = "send beacon" ];
  77      mon-mds -> mon-mds [ note = "store health metrics in beacon" ];
  78      mds <-- mon-mds;
  79      mon-mds -> mon-mds [ note = "encode_health(checks)" ];
  80      ceph-cli -> mon-health [ label = "send 'health' command" ];
  81      mon-health => mon-mds [ label = "gather health checks" ];
  82      ceph-cli <-- mon-health [ label = "checks and mutes" ];
  83    }
  84
  85 So, if we want to add a new warning related to cephfs, probably the best place to
  86 start is ``MDSMonitor::encode_pending()``, where health reports are collected from
  87 the latest ``FSMap`` and the health metrics reported by MDS daemons.
  88
  89 But it's noteworthy that ``MgrStatMonitor`` does *not* prepare the reports by itself,
  90 it just stores whatever the health reports received from mgr!
  91
  92 ceph-mgr -- A Delegate Aggregator
  93 ---------------------------------
  94
  95 In Ceph, mgr is created to share the burden of monitor, which is used to establish
  96 the consensus of information which is critical to keep the cluster function.
  97 Apparently, osdmap, mdsmap and monmap fall into this category. But what about the
  98 aggregated statistics of the cluster? They are crucial for the administrator to
  99 understand the status of the cluster, but they might not be that important to keep
 100 the cluster running. To address this scalability issue,  we offloaded the work of
 101 collecting and aggregating the metrics to mgr.
 102
 103 Now, mgr is responsible for receiving and processing the ``MPGStats`` messages from
 104 OSDs. And we also developed a protocol allowing a daemon to periodically report its
 105 metrics and status to mgr using ``MMgrReport``. On the mgr side, it periodically sends
 106 an aggregated report to the ``MgrStatMonitor`` service on mon. As explained earlier,
 107 this service just persists the health reports in the aggregated report to the monstore.
 108
 109 .. seqdiag::
 110
 111    seqdiag {
 112      default_note_color = lightblue;
 113      service; mgr; mon-mgr-stat; mon-health;
 114      service -> mgr [ label = "send(open)" ];
 115      mgr -> mgr [ note = "register the new service" ];
 116      service <-- mgr;
 117      mgr => service [ label = "send(configure)" ];
 118      service -> mgr [ label = "send(report)" ];
 119      mgr -> mgr [ note = "update/aggregate service metrics" ];
 120      service <-- mgr;
 121      service => mgr [ label = "send(report)" ];
 122      mgr -> mon-mgr-stat [ label = "send(mgr-report)" ];
 123      mon-mgr-stat -> mon-mgr-stat [ note = "store health checks in the report" ];
 124      mgr <-- mon-mgr-stat;
 125      mon-health => mon-mgr-stat [ label = "gather health checks" ];
 126      service => mgr [ label = "send(report)" ];
 127      service => mgr [ label = "send(close)" ];
 128    }