ceph/doc/cephfs/standby.rst

   1 .. _mds-standby:
   2
   3 Terminology
   4 -----------
   5
   6 A Ceph cluster may have zero or more CephFS *file systems*.  Each CephFS has
   7 a human readable name (set at creation time with ``fs new``) and an integer
   8 ID.  The ID is called the file system cluster ID, or *FSCID*.
   9
  10 Each CephFS file system has a number of *ranks*, numbered beginning with zero.
  11 By default there is one rank per file system.  A rank may be thought of as a
  12 metadata shard.  Management of ranks is described in :doc:`/cephfs/multimds` .
  13
  14 Each CephFS ``ceph-mds`` daemon starts without a rank.  It may be assigned one
  15 by the cluster's monitors. A daemon may only hold one rank at a time, and only
  16 give up a rank when the ``ceph-mds`` process stops.
  17
  18 If a rank is not associated with any daemon, that rank is considered ``failed``.
  19 Once a rank is assigned to a daemon, the rank is considered ``up``.
  20
  21 Each ``ceph-mds`` daemon has a *name* that is assigned statically by the
  22 administrator when the daemon is first configured.  Each daemon's *name* is
  23 typically that of the hostname where the process runs.
  24
  25 A ``ceph-mds`` daemon may be assigned to a specific file system by
  26 setting its ``mds_join_fs`` configuration option to the file system's
  27 ``name``.
  28
  29 When a ``ceph-mds`` daemon starts, it is also assigned an integer ``GID``,
  30 which is unique to this current daemon's process.  In other words, when a
  31 ``ceph-mds`` daemon is restarted, it runs as a new process and is assigned a
  32 *new* ``GID`` that is different from that of the previous process.
  33
  34 Referring to MDS daemons
  35 ------------------------
  36
  37 Most administrative commands that refer to a ``ceph-mds`` daemon (MDS)
  38 accept a flexible argument format that may specify a ``rank``, a ``GID``
  39 or a ``name``.
  40
  41 Where a ``rank`` is used, it  may optionally be qualified by
  42 a leading file system ``name`` or ``GID``.  If a daemon is a standby (i.e.
  43 it is not currently assigned a ``rank``), then it may only be
  44 referred to by ``GID`` or ``name``.
  45
  46 For example, say we have an MDS daemon with ``name`` 'myhost' and
  47 ``GID`` 5446, and which is assigned ``rank`` 0 for the file system 'myfs'
  48 with ``FSCID`` 3.  Any of the following are suitable forms of the ``fail``
  49 command:
  50
  51 ::
  52
  53     ceph mds fail 5446     # GID
  54     ceph mds fail myhost   # Daemon name
  55     ceph mds fail 0        # Unqualified rank
  56     ceph mds fail 3:0      # FSCID and rank
  57     ceph mds fail myfs:0   # File System name and rank
  58
  59 Managing failover
  60 -----------------
  61
  62 If an MDS daemon stops communicating with the cluster's monitors, the monitors
  63 will wait ``mds_beacon_grace`` seconds (default 15) before marking the daemon as
  64 *laggy*.  If a standby MDS is available, the monitor will immediately replace the
  65 laggy daemon.
  66
  67 Each file system may specify a minimum number of standby daemons in order to be
  68 considered healthy. This number includes daemons in the ``standby-replay`` state
  69 waiting for a ``rank`` to fail. Note that a ``standby-replay`` daemon will not
  70 be assigned to take over a failure for another ``rank`` or a failure in a
  71 different CephFS file system). The pool of standby daemons not in ``replay``
  72 counts towards any file system count.
  73 Each file system may set the desired number of standby daemons by:
  74
  75 ::
  76
  77     ceph fs set <fs name> standby_count_wanted <count>
  78
  79 Setting ``count`` to 0 will disable the health check.
  80
  81
  82 .. _mds-standby-replay:
  83
  84 Configuring standby-replay
  85 --------------------------
  86
  87 Each CephFS file system may be configured to add ``standby-replay`` daemons.
  88 These standby daemons follow the active MDS's metadata journal in order to
  89 reduce failover time in the event that the active MDS becomes unavailable. Each
  90 active MDS may have only one ``standby-replay`` daemon following it.
  91
  92 Configuration of ``standby-replay`` on a file system is done using the below:
  93
  94 ::
  95
  96     ceph fs set <fs name> allow_standby_replay <bool>
  97
  98 Once set, the monitors will assign available standby daemons to follow the
  99 active MDSs in that file system.
 100
 101 Once an MDS has entered the ``standby-replay`` state, it will only be used as a
 102 standby for the ``rank`` that it is following. If another ``rank`` fails, this
 103 ``standby-replay`` daemon will not be used as a replacement, even if no other
 104 standbys are available. For this reason, it is advised that if ``standby-replay``
 105 is used then *every* active MDS should have a ``standby-replay`` daemon.
 106
 107 .. _mds-join-fs:
 108
 109 Configuring MDS file system affinity
 110 ------------------------------------
 111
 112 You might elect to dedicate an MDS to a particular file system. Or, perhaps you
 113 have MDSs that run on better hardware that should be preferred over a last-resort
 114 standby on modest or over-provisioned systems. To configure this preference,
 115 CephFS provides a configuration option for MDS called ``mds_join_fs`` which
 116 enforces this affinity.
 117
 118 When failing over MDS daemons, a cluster's monitors will prefer standby daemons with
 119 ``mds_join_fs`` equal to the file system ``name`` with the failed ``rank``.  If no
 120 standby exists with ``mds_join_fs`` equal to the file system ``name``, it will
 121 choose an unqualified standby (no setting for ``mds_join_fs``) for the replacement,
 122 or any other available standby, as a last resort. Note, this does not change the
 123 behavior that ``standby-replay`` daemons are always selected before
 124 other standbys.
 125
 126 Even further, the monitors will regularly examine the CephFS file systems even when
 127 stable to check if a standby with stronger affinity is available to replace an
 128 MDS with lower affinity. This process is also done for ``standby-replay`` daemons:
 129 if a regular standby has stronger affinity than the ``standby-replay`` MDS, it will
 130 replace the standby-replay MDS.
 131
 132 For example, given this stable and healthy file system:
 133
 134 ::
 135
 136     $ ceph fs dump
 137     dumped fsmap epoch 399
 138     ...
 139     Filesystem 'cephfs' (27)
 140     ...
 141     e399
 142     max_mds 1
 143     in      0
 144     up      {0=20384}
 145     failed
 146     damaged
 147     stopped
 148     ...
 149     [mds.a{0:20384} state up:active seq 239 addr [v2:127.0.0.1:6854/966242805,v1:127.0.0.1:6855/966242805]]
 150
 151     Standby daemons:
 152
 153     [mds.b{-1:10420} state up:standby seq 2 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]
 154
 155
 156 You may set ``mds_join_fs`` on the standby to enforce your preference: ::
 157
 158     $ ceph config set mds.b mds_join_fs cephfs
 159
 160 after automatic failover: ::
 161
 162     $ ceph fs dump
 163     dumped fsmap epoch 405
 164     e405
 165     ...
 166     Filesystem 'cephfs' (27)
 167     ...
 168     max_mds 1
 169     in      0
 170     up      {0=10420}
 171     failed
 172     damaged
 173     stopped
 174     ...
 175     [mds.b{0:10420} state up:active seq 274 join_fscid=27 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]
 176
 177     Standby daemons:
 178
 179     [mds.a{-1:10720} state up:standby seq 2 addr [v2:127.0.0.1:6854/1340357658,v1:127.0.0.1:6855/1340357658]]
 180
 181 Note in the above example that ``mds.b`` now has ``join_fscid=27``. In this
 182 output, the file system name from ``mds_join_fs`` is changed to the file system
 183 identifier (27). If the file system is recreated with the same name, the
 184 standby will follow the new file system as expected.
 185
 186 Finally, if the file system is degraded or undersized, no failover will occur
 187 to enforce ``mds_join_fs``.