[ceph.git] / ceph / doc / cephfs / standby.rst

.. _mds-standby:

Terminology
-----------

A Ceph cluster may have zero or more CephFS *file systems*.  CephFS
file systems have a human readable name (set in ``fs new``)
and an integer ID.  The ID is called the file system cluster ID,
or *FSCID*.

Each CephFS file system has a number of *ranks*, one by default,
which start at zero.  A rank may be thought of as a metadata shard.
Controlling the number of ranks in a file system is described
in :doc:`/cephfs/multimds`

Each CephFS ceph-mds process (a *daemon*) initially starts up
without a rank.  It may be assigned one by the monitor cluster.
A daemon may only hold one rank at a time.  Daemons only give up
a rank when the ceph-mds process stops.

If a rank is not associated with a daemon, the rank is
considered *failed*.  Once a rank is assigned to a daemon,
the rank is considered *up*.

A daemon has a *name* that is set statically by the administrator
when the daemon is first configured.  Typical configurations
use the hostname where the daemon runs as the daemon name.

A ceph-mds daemons can be assigned to a particular file system by
setting the `mds_join_fs` configuration option to the file system
name.

Each time a daemon starts up, it is also assigned a *GID*, which
is unique to this particular process lifetime of the daemon.  The
GID is an integer.

Referring to MDS daemons
------------------------

Most of the administrative commands that refer to an MDS daemon
accept a flexible argument format that may contain a rank, a GID
or a name.

Where a rank is used, this may optionally be qualified with
a leading file system name or ID.  If a daemon is a standby (i.e.
it is not currently assigned a rank), then it may only be
referred to by GID or name.

For example, if we had an MDS daemon which was called 'myhost',
had GID 5446, and was assigned rank 0 in the file system 'myfs'
which had FSCID 3, then any of the following would be suitable
forms of the 'fail' command:

::

    ceph mds fail 5446     # GID
    ceph mds fail myhost   # Daemon name
    ceph mds fail 0        # Unqualified rank
    ceph mds fail 3:0      # FSCID and rank
    ceph mds fail myfs:0   # File System name and rank

Managing failover
-----------------

If an MDS daemon stops communicating with the monitor, the monitor will wait
``mds_beacon_grace`` seconds (default 15 seconds) before marking the daemon as
*laggy*. If a standby is available, the monitor will immediately replace the
laggy daemon.

Each file system may specify a number of standby daemons to be considered
healthy. This number includes daemons in standby-replay waiting for a rank to
fail (remember that a standby-replay daemon will not be assigned to take over a
failure for another rank or a failure in a another CephFS file system). The
pool of standby daemons not in replay count towards any file system count.
Each file system may set the number of standby daemons wanted using:

::

    ceph fs set <fs name> standby_count_wanted <count>

Setting ``count`` to 0 will disable the health check.


.. _mds-standby-replay:

Configuring standby-replay
--------------------------

Each CephFS file system may be configured to add standby-replay daemons.  These
standby daemons follow the active MDS's metadata journal to reduce failover
time in the event the active MDS becomes unavailable. Each active MDS may have
only one standby-replay daemon following it.

Configuring standby-replay on a file system is done using:

::

    ceph fs set <fs name> allow_standby_replay <bool>

Once set, the monitors will assign available standby daemons to follow the
active MDSs in that file system.

Once an MDS has entered the standby-replay state, it will only be used as a
standby for the rank that it is following. If another rank fails, this
standby-replay daemon will not be used as a replacement, even if no other
standbys are available. For this reason, it is advised that if standby-replay
is used then every active MDS should have a standby-replay daemon.

.. _mds-join-fs:

Configuring MDS file system affinity
------------------------------------

You may want to have an MDS used for a particular file system. Or, perhaps you
have larger MDSs on better hardware that should be preferred over a last-resort
standby on lesser or over-provisioned hardware. To express this preference,
CephFS provides a configuration option for MDS called ``mds_join_fs`` which
enforces this `affinity`.

As part of any failover, the Ceph monitors will prefer standby daemons with
``mds_join_fs`` equal to the file system name with the failed rank.  If no
standby exists with ``mds_join_fs`` equal to the file system name, it will
choose a `vanilla` standby (no setting for ``mds_join_fs``) for the replacement
or any other available standby as a last resort. Note, this does not change the
behavior that ``standby-replay`` daemons are always selected before looking at
other standbys.

Even further, the monitors will regularly examine the CephFS file systems when
stable to check if a standby with stronger affinity is available to replace an
MDS with lower affinity. This process is also done for standby-replay daemons:
if a regular standby has stronger affinity than the standby-replay MDS, it will
replace the standby-replay MDS.

For example, given this stable and healthy file system:

::

    $ ceph fs dump
    dumped fsmap epoch 399
    ...
    Filesystem 'cephfs' (27)
    ...
    e399
    max_mds 1
    in      0
    up      {0=20384}
    failed
    damaged
    stopped
    ...
    [mds.a{0:20384} state up:active seq 239 addr [v2:127.0.0.1:6854/966242805,v1:127.0.0.1:6855/966242805]]

    Standby daemons:

    [mds.b{-1:10420} state up:standby seq 2 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]


You may set ``mds_join_fs`` on the standby to enforce your preference: ::

    $ ceph config set mds.b mds_join_fs cephfs

after automatic failover: ::

    $ ceph fs dump
    dumped fsmap epoch 405
    e405
    ...
    Filesystem 'cephfs' (27)
    ...
    max_mds 1
    in      0
    up      {0=10420}
    failed
    damaged
    stopped
    ...
    [mds.b{0:10420} state up:active seq 274 join_fscid=27 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]

    Standby daemons:

    [mds.a{-1:10720} state up:standby seq 2 addr [v2:127.0.0.1:6854/1340357658,v1:127.0.0.1:6855/1340357658]]

Note in the above example that ``mds.b`` now has ``join_fscid=27``. In this
output, the file system name from ``mds_join_fs`` is changed to the file system
identifier (27). If the file system is recreated with the same name, the
standby will follow the new file system as expected.

Finally, if the file system is degraded or undersized, no failover will occur
to enforce ``mds_join_fs``.
Commit	Line	Data
11fdf7f2	1	.. _mds-standby:
7c673cae FG	2
	3	Terminology
	4	-----------
	5
9f95a23c TL	6	A Ceph cluster may have zero or more CephFS file systems. CephFS
	7	file systems have a human readable name (set in ``fs new``)
	8	and an integer ID. The ID is called the file system cluster ID,
7c673cae FG	9	or FSCID.
7c673cae FG	10
9f95a23c	11	Each CephFS file system has a number of ranks, one by default,
7c673cae	12	which start at zero. A rank may be thought of as a metadata shard.
9f95a23c	13	Controlling the number of ranks in a file system is described
7c673cae FG	14	in :doc:`/cephfs/multimds`
	15
	16	Each CephFS ceph-mds process (a daemon) initially starts up
	17	without a rank. It may be assigned one by the monitor cluster.
	18	A daemon may only hold one rank at a time. Daemons only give up
	19	a rank when the ceph-mds process stops.
	20
	21	If a rank is not associated with a daemon, the rank is
	22	considered failed. Once a rank is assigned to a daemon,
	23	the rank is considered up.
	24
	25	A daemon has a name that is set statically by the administrator
	26	when the daemon is first configured. Typical configurations
	27	use the hostname where the daemon runs as the daemon name.
	28
9f95a23c TL	29	A ceph-mds daemons can be assigned to a particular file system by
	30	setting the `mds_join_fs` configuration option to the file system
	31	name.
	32
7c673cae FG	33	Each time a daemon starts up, it is also assigned a GID, which
	34	is unique to this particular process lifetime of the daemon. The
	35	GID is an integer.
	36
	37	Referring to MDS daemons
	38	------------------------
	39
	40	Most of the administrative commands that refer to an MDS daemon
	41	accept a flexible argument format that may contain a rank, a GID
	42	or a name.
	43
	44	Where a rank is used, this may optionally be qualified with
9f95a23c	45	a leading file system name or ID. If a daemon is a standby (i.e.
7c673cae FG	46	it is not currently assigned a rank), then it may only be
	47	referred to by GID or name.
	48
	49	For example, if we had an MDS daemon which was called 'myhost',
9f95a23c	50	had GID 5446, and was assigned rank 0 in the file system 'myfs'
7c673cae FG	51	which had FSCID 3, then any of the following would be suitable
	52	forms of the 'fail' command:
	53
	54	::
	55
	56	ceph mds fail 5446 # GID
	57	ceph mds fail myhost # Daemon name
	58	ceph mds fail 0 # Unqualified rank
	59	ceph mds fail 3:0 # FSCID and rank
9f95a23c	60	ceph mds fail myfs:0 # File System name and rank
7c673cae FG	61
	62	Managing failover
	63	-----------------
	64
11fdf7f2 TL	65	If an MDS daemon stops communicating with the monitor, the monitor will wait
	66	``mds_beacon_grace`` seconds (default 15 seconds) before marking the daemon as
	67	laggy. If a standby is available, the monitor will immediately replace the
	68	laggy daemon.
7c673cae FG	69
	70	Each file system may specify a number of standby daemons to be considered
	71	healthy. This number includes daemons in standby-replay waiting for a rank to
	72	fail (remember that a standby-replay daemon will not be assigned to take over a
	73	failure for another rank or a failure in a another CephFS file system). The
	74	pool of standby daemons not in replay count towards any file system count.
	75	Each file system may set the number of standby daemons wanted using:
	76
	77	::
	78
	79	ceph fs set <fs name> standby_count_wanted <count>
	80
	81	Setting ``count`` to 0 will disable the health check.
	82
	83
11fdf7f2	84	.. _mds-standby-replay:
7c673cae	85
11fdf7f2 TL	86	Configuring standby-replay
11fdf7f2 TL	87	--------------------------
7c673cae	88
11fdf7f2 TL	89	Each CephFS file system may be configured to add standby-replay daemons. These
	90	standby daemons follow the active MDS's metadata journal to reduce failover
	91	time in the event the active MDS becomes unavailable. Each active MDS may have
	92	only one standby-replay daemon following it.
7c673cae	93
11fdf7f2	94	Configuring standby-replay on a file system is done using:
7c673cae FG	95
7c673cae FG	96	::
7c673cae	97
11fdf7f2	98	ceph fs set <fs name> allow_standby_replay <bool>
7c673cae	99
11fdf7f2 TL	100	Once set, the monitors will assign available standby daemons to follow the
11fdf7f2 TL	101	active MDSs in that file system.
7c673cae	102
11fdf7f2 TL	103	Once an MDS has entered the standby-replay state, it will only be used as a
	104	standby for the rank that it is following. If another rank fails, this
	105	standby-replay daemon will not be used as a replacement, even if no other
	106	standbys are available. For this reason, it is advised that if standby-replay
	107	is used then every active MDS should have a standby-replay daemon.
9f95a23c TL	108
	109	.. _mds-join-fs:
	110
	111	Configuring MDS file system affinity
	112	------------------------------------
	113
	114	You may want to have an MDS used for a particular file system. Or, perhaps you
	115	have larger MDSs on better hardware that should be preferred over a last-resort
	116	standby on lesser or over-provisioned hardware. To express this preference,
	117	CephFS provides a configuration option for MDS called ``mds_join_fs`` which
	118	enforces this `affinity`.
	119
	120	As part of any failover, the Ceph monitors will prefer standby daemons with
	121	``mds_join_fs`` equal to the file system name with the failed rank. If no
	122	standby exists with ``mds_join_fs`` equal to the file system name, it will
	123	choose a `vanilla` standby (no setting for ``mds_join_fs``) for the replacement
	124	or any other available standby as a last resort. Note, this does not change the
	125	behavior that ``standby-replay`` daemons are always selected before looking at
	126	other standbys.
	127
	128	Even further, the monitors will regularly examine the CephFS file systems when
	129	stable to check if a standby with stronger affinity is available to replace an
	130	MDS with lower affinity. This process is also done for standby-replay daemons:
	131	if a regular standby has stronger affinity than the standby-replay MDS, it will
	132	replace the standby-replay MDS.
	133
	134	For example, given this stable and healthy file system:
	135
	136	::
	137
	138	$ ceph fs dump
	139	dumped fsmap epoch 399
	140	...
	141	Filesystem 'cephfs' (27)
	142	...
	143	e399
	144	max_mds 1
	145	in 0
	146	up {0=20384}
	147	failed
	148	damaged
	149	stopped
	150	...
	151	[mds.a{0:20384} state up:active seq 239 addr [v2:127.0.0.1:6854/966242805,v1:127.0.0.1:6855/966242805]]
	152
	153	Standby daemons:
	154
	155	[mds.b{-1:10420} state up:standby seq 2 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]
	156
	157
	158	You may set ``mds_join_fs`` on the standby to enforce your preference: ::
	159
	160	$ ceph config set mds.b mds_join_fs cephfs
	161
	162	after automatic failover: ::
	163
	164	$ ceph fs dump
	165	dumped fsmap epoch 405
	166	e405
	167	...
	168	Filesystem 'cephfs' (27)
	169	...
	170	max_mds 1
	171	in 0
172	up {0=10420}
173	failed
174	damaged
175	stopped
176	...
177	[mds.b{0:10420} state up:active seq 274 join_fscid=27 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]
178
179	Standby daemons:
180
181	[mds.a{-1:10720} state up:standby seq 2 addr [v2:127.0.0.1:6854/1340357658,v1:127.0.0.1:6855/1340357658]]
182
183	Note in the above example that ``mds.b`` now has ``join_fscid=27``. In this
184	output, the file system name from ``mds_join_fs`` is changed to the file system
185	identifier (27). If the file system is recreated with the same name, the
186	standby will follow the new file system as expected.
187
188	Finally, if the file system is degraded or undersized, no failover will occur
189	to enforce ``mds_join_fs``.