ceph/doc/dev/cephadm/host-maintenance.rst

   1 ================
   2 Host Maintenance
   3 ================
   4
   5 All hosts that support Ceph daemons need to support maintenance activity, whether the host
   6 is physical or virtual. This means that management workflows should provide
   7 a simple and consistent way to support this operational requirement. This document defines
   8 the maintenance strategy that could be implemented in cephadm and mgr/cephadm.
   9
  10
  11 High Level Design
  12 =================
  13 Placing a host into maintenance, adopts the following workflow;
  14
  15 #. confirm that the removal of the host does not impact data availability (the following
  16    steps will assume it is safe to proceed)
  17
  18    * ``orch host ok-to-stop <host>`` would be used here
  19
  20 #. if the host has osd daemons, apply noout to the host subtree to prevent data migration
  21    from triggering during the planned maintenance slot.
  22 #. Stop the ceph target (all daemons stop)
  23 #. Disable the ceph target on that host, to prevent a reboot from automatically starting
  24    ceph services again)
  25
  26
  27 Exiting Maintenance, is basically the reverse of the above sequence
  28
  29 Admin Interaction
  30 =================
  31 The ceph orch command will be extended to support maintenance.
  32
  33 .. code-block::
  34
  35     ceph orch host enter-maintenance <host> [ --check ]
  36     ceph orch host exit-maintenance <host>
  37
  38 .. note:: In addition, the host's status should be updated to reflect whether it
  39    is in maintenance or not.
  40
  41 The 'check' Option
  42 __________________
  43 The orch host ok-to-stop command focuses on ceph daemons (mon, osd, mds), which
  44 provides the first check. However, a ceph cluster also uses other types of daemons
  45 for monitoring, management and non-native protocol support which means the
  46 logic will need to consider service impact too. The 'check' option provides
  47 this additional layer to alert the user of service impact to *secondary*
  48 daemons.
  49
  50 The list below shows some of these additional daemons.
  51
  52 * mgr (not included in ok-to-stop checks)
  53 * prometheus, grafana, alertmanager
  54 * rgw
  55 * haproxy
  56 * iscsi gateways
  57 * ganesha gateways
  58
  59 By using the --check option first, the Admin can choose whether to proceed. This
  60 workflow is obviously optional for the CLI user, but could be integrated into the
  61 UI workflow to help less experienced Administators manage the cluster.
  62
  63 By adopting this two-phase approach, a UI based workflow would look something
  64 like this.
  65
  66 #. User selects a host to place into maintenance
  67
  68    * orchestrator checks for data **and** service impact
  69 #. If potential impact is shown, the next steps depend on the impact type
  70
  71    * **data availability** : maintenance is denied, informing the user of the issue
  72    * **service availability** : user is provided a list of affected services and
  73      asked to confirm
  74
  75
  76 Components Impacted
  77 ===================
  78 Implementing this capability will require changes to the following;
  79
  80 * cephadm
  81
  82   * Add maintenance subcommand with the following 'verbs'; enter, exit, check
  83
  84 * mgr/cephadm
  85
  86   * add methods to CephadmOrchestrator for enter/exit and check
  87   * data gathering would be skipped for hosts in a maintenance state
  88
  89 * mgr/orchestrator
  90
  91   * add CLI commands to OrchestratorCli which expose the enter/exit and check interaction
  92
  93
  94 Ideas for Future Work
  95 =====================
  96 #. When a host is placed into maintenance, the time of the event could be persisted. This
  97    would allow the orchestrator layer to establish a maintenance window for the task and
  98    alert if the maintenance window has been exceeded.
  99 #. The maintenance process could support plugins to allow other integration tasks to be
 100    initiated as part of the transition to and from maintenance. This plugin capability could
 101    support actions like;
 102
 103    * alert suppression to 3rd party monitoring framework(s)
 104    * service level reporting, to record outage windows