ceph/doc/cephfs/standby.rst

   1 .. _mds-standby:
   2
   3 Terminology
   4 -----------
   5
   6 A Ceph cluster may have zero or more CephFS *file systems*.  CephFS
   7 file systems have a human readable name (set in ``fs new``)
   8 and an integer ID.  The ID is called the file system cluster ID,
   9 or *FSCID*.
  10
  11 Each CephFS file system has a number of *ranks*, one by default,
  12 which start at zero.  A rank may be thought of as a metadata shard.
  13 Controlling the number of ranks in a file system is described
  14 in :doc:`/cephfs/multimds`
  15
  16 Each CephFS ceph-mds process (a *daemon*) initially starts up
  17 without a rank.  It may be assigned one by the monitor cluster.
  18 A daemon may only hold one rank at a time.  Daemons only give up
  19 a rank when the ceph-mds process stops.
  20
  21 If a rank is not associated with a daemon, the rank is
  22 considered *failed*.  Once a rank is assigned to a daemon,
  23 the rank is considered *up*.
  24
  25 A daemon has a *name* that is set statically by the administrator
  26 when the daemon is first configured.  Typical configurations
  27 use the hostname where the daemon runs as the daemon name.
  28
  29 A ceph-mds daemons can be assigned to a particular file system by
  30 setting the `mds_join_fs` configuration option to the file system
  31 name.
  32
  33 Each time a daemon starts up, it is also assigned a *GID*, which
  34 is unique to this particular process lifetime of the daemon.  The
  35 GID is an integer.
  36
  37 Referring to MDS daemons
  38 ------------------------
  39
  40 Most of the administrative commands that refer to an MDS daemon
  41 accept a flexible argument format that may contain a rank, a GID
  42 or a name.
  43
  44 Where a rank is used, this may optionally be qualified with
  45 a leading file system name or ID.  If a daemon is a standby (i.e.
  46 it is not currently assigned a rank), then it may only be
  47 referred to by GID or name.
  48
  49 For example, if we had an MDS daemon which was called 'myhost',
  50 had GID 5446, and was assigned rank 0 in the file system 'myfs'
  51 which had FSCID 3, then any of the following would be suitable
  52 forms of the 'fail' command:
  53
  54 ::
  55
  56     ceph mds fail 5446     # GID
  57     ceph mds fail myhost   # Daemon name
  58     ceph mds fail 0        # Unqualified rank
  59     ceph mds fail 3:0      # FSCID and rank
  60     ceph mds fail myfs:0   # File System name and rank
  61
  62 Managing failover
  63 -----------------
  64
  65 If an MDS daemon stops communicating with the monitor, the monitor will wait
  66 ``mds_beacon_grace`` seconds (default 15 seconds) before marking the daemon as
  67 *laggy*. If a standby is available, the monitor will immediately replace the
  68 laggy daemon.
  69
  70 Each file system may specify a number of standby daemons to be considered
  71 healthy. This number includes daemons in standby-replay waiting for a rank to
  72 fail (remember that a standby-replay daemon will not be assigned to take over a
  73 failure for another rank or a failure in a another CephFS file system). The
  74 pool of standby daemons not in replay count towards any file system count.
  75 Each file system may set the number of standby daemons wanted using:
  76
  77 ::
  78
  79     ceph fs set <fs name> standby_count_wanted <count>
  80
  81 Setting ``count`` to 0 will disable the health check.
  82
  83
  84 .. _mds-standby-replay:
  85
  86 Configuring standby-replay
  87 --------------------------
  88
  89 Each CephFS file system may be configured to add standby-replay daemons.  These
  90 standby daemons follow the active MDS's metadata journal to reduce failover
  91 time in the event the active MDS becomes unavailable. Each active MDS may have
  92 only one standby-replay daemon following it.
  93
  94 Configuring standby-replay on a file system is done using:
  95
  96 ::
  97
  98     ceph fs set <fs name> allow_standby_replay <bool>
  99
 100 Once set, the monitors will assign available standby daemons to follow the
 101 active MDSs in that file system.
 102
 103 Once an MDS has entered the standby-replay state, it will only be used as a
 104 standby for the rank that it is following. If another rank fails, this
 105 standby-replay daemon will not be used as a replacement, even if no other
 106 standbys are available. For this reason, it is advised that if standby-replay
 107 is used then every active MDS should have a standby-replay daemon.
 108
 109 .. _mds-join-fs:
 110
 111 Configuring MDS file system affinity
 112 ------------------------------------
 113
 114 You may want to have an MDS used for a particular file system. Or, perhaps you
 115 have larger MDSs on better hardware that should be preferred over a last-resort
 116 standby on lesser or over-provisioned hardware. To express this preference,
 117 CephFS provides a configuration option for MDS called ``mds_join_fs`` which
 118 enforces this `affinity`.
 119
 120 As part of any failover, the Ceph monitors will prefer standby daemons with
 121 ``mds_join_fs`` equal to the file system name with the failed rank.  If no
 122 standby exists with ``mds_join_fs`` equal to the file system name, it will
 123 choose a `vanilla` standby (no setting for ``mds_join_fs``) for the replacement
 124 or any other available standby as a last resort. Note, this does not change the
 125 behavior that ``standby-replay`` daemons are always selected before looking at
 126 other standbys.
 127
 128 Even further, the monitors will regularly examine the CephFS file systems when
 129 stable to check if a standby with stronger affinity is available to replace an
 130 MDS with lower affinity. This process is also done for standby-replay daemons:
 131 if a regular standby has stronger affinity than the standby-replay MDS, it will
 132 replace the standby-replay MDS.
 133
 134 For example, given this stable and healthy file system:
 135
 136 ::
 137
 138     $ ceph fs dump
 139     dumped fsmap epoch 399
 140     ...
 141     Filesystem 'cephfs' (27)
 142     ...
 143     e399
 144     max_mds 1
 145     in      0
 146     up      {0=20384}
 147     failed
 148     damaged
 149     stopped
 150     ...
 151     [mds.a{0:20384} state up:active seq 239 addr [v2:127.0.0.1:6854/966242805,v1:127.0.0.1:6855/966242805]]
 152
 153     Standby daemons:
 154
 155     [mds.b{-1:10420} state up:standby seq 2 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]
 156
 157
 158 You may set ``mds_join_fs`` on the standby to enforce your preference: ::
 159
 160     $ ceph config set mds.b mds_join_fs cephfs
 161
 162 after automatic failover: ::
 163
 164     $ ceph fs dump
 165     dumped fsmap epoch 405
 166     e405
 167     ...
 168     Filesystem 'cephfs' (27)
 169     ...
 170     max_mds 1
 171     in      0
 172     up      {0=10420}
 173     failed
 174     damaged
 175     stopped
 176     ...
 177     [mds.b{0:10420} state up:active seq 274 join_fscid=27 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]
 178
 179     Standby daemons:
 180
 181     [mds.a{-1:10720} state up:standby seq 2 addr [v2:127.0.0.1:6854/1340357658,v1:127.0.0.1:6855/1340357658]]
 182
 183 Note in the above example that ``mds.b`` now has ``join_fscid=27``. In this
 184 output, the file system name from ``mds_join_fs`` is changed to the file system
 185 identifier (27). If the file system is recreated with the same name, the
 186 standby will follow the new file system as expected.
 187
 188 Finally, if the file system is degraded or undersized, no failover will occur
 189 to enforce ``mds_join_fs``.