[ceph.git] / ceph / doc / cephfs / standby.rst


Terminology
-----------

A Ceph cluster may have zero or more CephFS *filesystems*.  CephFS
filesystems have a human readable name (set in ``fs new``)
and an integer ID.  The ID is called the filesystem cluster ID,
or *FSCID*.

Each CephFS filesystem has a number of *ranks*, one by default,
which start at zero.  A rank may be thought of as a metadata shard.
Controlling the number of ranks in a filesystem is described
in :doc:`/cephfs/multimds`

Each CephFS ceph-mds process (a *daemon*) initially starts up
without a rank.  It may be assigned one by the monitor cluster.
A daemon may only hold one rank at a time.  Daemons only give up
a rank when the ceph-mds process stops.

If a rank is not associated with a daemon, the rank is
considered *failed*.  Once a rank is assigned to a daemon,
the rank is considered *up*.

A daemon has a *name* that is set statically by the administrator
when the daemon is first configured.  Typical configurations
use the hostname where the daemon runs as the daemon name.

Each time a daemon starts up, it is also assigned a *GID*, which
is unique to this particular process lifetime of the daemon.  The
GID is an integer.

Referring to MDS daemons
------------------------

Most of the administrative commands that refer to an MDS daemon
accept a flexible argument format that may contain a rank, a GID
or a name.

Where a rank is used, this may optionally be qualified with
a leading filesystem name or ID.  If a daemon is a standby (i.e.
it is not currently assigned a rank), then it may only be
referred to by GID or name.

For example, if we had an MDS daemon which was called 'myhost',
had GID 5446, and was assigned rank 0 in the filesystem 'myfs'
which had FSCID 3, then any of the following would be suitable
forms of the 'fail' command:

::

    ceph mds fail 5446     # GID
    ceph mds fail myhost   # Daemon name
    ceph mds fail 0        # Unqualified rank
    ceph mds fail 3:0      # FSCID and rank
    ceph mds fail myfs:0   # Filesystem name and rank

Managing failover
-----------------

If an MDS daemon stops communicating with the monitor, the monitor will
wait ``mds_beacon_grace`` seconds (default 15 seconds) before marking
the daemon as *laggy*.

Each file system may specify a number of standby daemons to be considered
healthy. This number includes daemons in standby-replay waiting for a rank to
fail (remember that a standby-replay daemon will not be assigned to take over a
failure for another rank or a failure in a another CephFS file system). The
pool of standby daemons not in replay count towards any file system count.
Each file system may set the number of standby daemons wanted using:

::

    ceph fs set <fs name> standby_count_wanted <count>

Setting ``count`` to 0 will disable the health check.


Configuring standby daemons
---------------------------

There are four configuration settings that control how a daemon
will behave while in standby:

::

    mds_standby_for_name
    mds_standby_for_rank
    mds_standby_for_fscid
    mds_standby_replay

These may be set in the ceph.conf on the host where the MDS daemon
runs (as opposed to on the monitor).  The daemon loads these settings
when it starts, and sends them to the monitor.

By default, if none of these settings are used, all MDS daemons
which do not hold a rank will be used as standbys for any rank.

The settings which associate a standby daemon with a particular
name or rank do not guarantee that the daemon will *only* be used
for that rank.  They mean that when several standbys are available,
the associated standby daemon will be used.  If a rank is failed,
and a standby is available, it will be used even if it is associated
with a different rank or named daemon.

mds_standby_replay
~~~~~~~~~~~~~~~~~~

If this is set to true, then the standby daemon will continuously read
the metadata journal of an up rank.  This will give it
a warm metadata cache, and speed up the process of failing over
if the daemon serving the rank fails.

An up rank may only have one standby replay daemon assigned to it,
if two daemons are both set to be standby replay then one of them
will arbitrarily win, and the other will become a normal non-replay
standby.

Once a daemon has entered the standby replay state, it will only be
used as a standby for the rank that it is following.  If another rank
fails, this standby replay daemon will not be used as a replacement,
even if no other standbys are available.

*Historical note:* In Ceph prior to v10.2.1, this setting (when ``false``) is
always true when ``mds_standby_for_*`` is also set.

mds_standby_for_name
~~~~~~~~~~~~~~~~~~~~

Set this to make the standby daemon only take over a failed rank
if the last daemon to hold it matches this name.

mds_standby_for_rank
~~~~~~~~~~~~~~~~~~~~

Set this to make the standby daemon only take over the specified
rank.  If another rank fails, this daemon will not be used to
replace it.

Use in conjunction with ``mds_standby_for_fscid`` to be specific
about which filesystem's rank you are targeting, if you have
multiple filesystems.

mds_standby_for_fscid
~~~~~~~~~~~~~~~~~~~~~

If ``mds_standby_for_rank`` is set, this is simply a qualifier to
say which filesystem's rank is referred to.

If ``mds_standby_for_rank`` is not set, then setting FSCID will
cause this daemon to target any rank in the specified FSCID.  Use
this if you have a daemon that you want to use for any rank, but
only within a particular filesystem.

mon_force_standby_active
~~~~~~~~~~~~~~~~~~~~~~~~

This setting is used on monitor hosts.  It defaults to true.

If it is false, then daemons configured with standby_replay=true
will **only** become active if the rank/name that they have
been configured to follow fails.  On the other hand, if this
setting is true, then a daemon configured with standby_replay=true
may be assigned some other rank.

Examples
--------

These are example ceph.conf snippets.  In practice you can either
copy a ceph.conf with all daemons' configuration to all your servers,
or you can have a different file on each server that contains just
that server's daemons' configuration.

Simple pair
~~~~~~~~~~~

Two MDS daemons 'a' and 'b' acting as a pair, where whichever one is not
currently assigned a rank will be the standby replay follower
of the other.

::

    [mds.a]
    mds standby replay = true
    mds standby for rank = 0

    [mds.b]
    mds standby replay = true
    mds standby for rank = 0

Floating standby
~~~~~~~~~~~~~~~~

Three MDS daemons 'a', 'b' and 'c', in a filesystem that has
``max_mds`` set to 2.

::
    
    # No explicit configuration required: whichever daemon is
    # not assigned a rank will go into 'standby' and take over
    # for whichever other daemon fails.

Two MDS clusters
~~~~~~~~~~~~~~~~

With two filesystems, I have four MDS daemons, and I want two
to act as a pair for one filesystem and two to act as a pair
for the other filesystem.

::

    [mds.a]
    mds standby for fscid = 1

    [mds.b]
    mds standby for fscid = 1

    [mds.c]
    mds standby for fscid = 2

    [mds.d]
    mds standby for fscid = 2
Commit	Line	Data
7c673cae FG	1
	2	Terminology
	3	-----------
	4
	5	A Ceph cluster may have zero or more CephFS filesystems. CephFS
	6	filesystems have a human readable name (set in ``fs new``)
	7	and an integer ID. The ID is called the filesystem cluster ID,
	8	or FSCID.
	9
	10	Each CephFS filesystem has a number of ranks, one by default,
	11	which start at zero. A rank may be thought of as a metadata shard.
	12	Controlling the number of ranks in a filesystem is described
	13	in :doc:`/cephfs/multimds`
	14
	15	Each CephFS ceph-mds process (a daemon) initially starts up
	16	without a rank. It may be assigned one by the monitor cluster.
	17	A daemon may only hold one rank at a time. Daemons only give up
	18	a rank when the ceph-mds process stops.
	19
	20	If a rank is not associated with a daemon, the rank is
	21	considered failed. Once a rank is assigned to a daemon,
	22	the rank is considered up.
	23
	24	A daemon has a name that is set statically by the administrator
	25	when the daemon is first configured. Typical configurations
	26	use the hostname where the daemon runs as the daemon name.
	27
	28	Each time a daemon starts up, it is also assigned a GID, which
	29	is unique to this particular process lifetime of the daemon. The
	30	GID is an integer.
	31
	32	Referring to MDS daemons
	33	------------------------
	34
	35	Most of the administrative commands that refer to an MDS daemon
	36	accept a flexible argument format that may contain a rank, a GID
	37	or a name.
	38
	39	Where a rank is used, this may optionally be qualified with
	40	a leading filesystem name or ID. If a daemon is a standby (i.e.
	41	it is not currently assigned a rank), then it may only be
	42	referred to by GID or name.
	43
	44	For example, if we had an MDS daemon which was called 'myhost',
	45	had GID 5446, and was assigned rank 0 in the filesystem 'myfs'
	46	which had FSCID 3, then any of the following would be suitable
	47	forms of the 'fail' command:
	48
	49	::
	50
	51	ceph mds fail 5446 # GID
	52	ceph mds fail myhost # Daemon name
	53	ceph mds fail 0 # Unqualified rank
	54	ceph mds fail 3:0 # FSCID and rank
	55	ceph mds fail myfs:0 # Filesystem name and rank
	56
	57	Managing failover
	58	-----------------
	59
	60	If an MDS daemon stops communicating with the monitor, the monitor will
	61	wait ``mds_beacon_grace`` seconds (default 15 seconds) before marking
	62	the daemon as laggy.
	63
	64	Each file system may specify a number of standby daemons to be considered
65	healthy. This number includes daemons in standby-replay waiting for a rank to
66	fail (remember that a standby-replay daemon will not be assigned to take over a
67	failure for another rank or a failure in a another CephFS file system). The
68	pool of standby daemons not in replay count towards any file system count.
69	Each file system may set the number of standby daemons wanted using:
70
71	::
72
73	ceph fs set <fs name> standby_count_wanted <count>
74
75	Setting ``count`` to 0 will disable the health check.
76
77
78	Configuring standby daemons
79	---------------------------
80
81	There are four configuration settings that control how a daemon
82	will behave while in standby:
83
84	::
85
86	mds_standby_for_name
87	mds_standby_for_rank
88	mds_standby_for_fscid
89	mds_standby_replay
90
91	These may be set in the ceph.conf on the host where the MDS daemon
92	runs (as opposed to on the monitor). The daemon loads these settings
93	when it starts, and sends them to the monitor.
94
95	By default, if none of these settings are used, all MDS daemons
96	which do not hold a rank will be used as standbys for any rank.
97
98	The settings which associate a standby daemon with a particular
99	name or rank do not guarantee that the daemon will only be used
100	for that rank. They mean that when several standbys are available,
101	the associated standby daemon will be used. If a rank is failed,
102	and a standby is available, it will be used even if it is associated
103	with a different rank or named daemon.
104
105	mds_standby_replay
106	~~~~~~~~~~~~~~~~~~
107
108	If this is set to true, then the standby daemon will continuously read
109	the metadata journal of an up rank. This will give it
110	a warm metadata cache, and speed up the process of failing over
111	if the daemon serving the rank fails.
112
113	An up rank may only have one standby replay daemon assigned to it,
114	if two daemons are both set to be standby replay then one of them
115	will arbitrarily win, and the other will become a normal non-replay
116	standby.
117
118	Once a daemon has entered the standby replay state, it will only be
119	used as a standby for the rank that it is following. If another rank
120	fails, this standby replay daemon will not be used as a replacement,
121	even if no other standbys are available.
122
123	Historical note: In Ceph prior to v10.2.1, this setting (when ``false``) is
124	always true when ``mds_standby_for_*`` is also set.
125
126	mds_standby_for_name
127	~~~~~~~~~~~~~~~~~~~~
128
129	Set this to make the standby daemon only take over a failed rank
130	if the last daemon to hold it matches this name.
131
132	mds_standby_for_rank
133	~~~~~~~~~~~~~~~~~~~~
134
135	Set this to make the standby daemon only take over the specified
136	rank. If another rank fails, this daemon will not be used to
137	replace it.
138
139	Use in conjunction with ``mds_standby_for_fscid`` to be specific
140	about which filesystem's rank you are targeting, if you have
141	multiple filesystems.
142
143	mds_standby_for_fscid
144	~~~~~~~~~~~~~~~~~~~~~
145
146	If ``mds_standby_for_rank`` is set, this is simply a qualifier to
147	say which filesystem's rank is referred to.
148
149	If ``mds_standby_for_rank`` is not set, then setting FSCID will
150	cause this daemon to target any rank in the specified FSCID. Use
151	this if you have a daemon that you want to use for any rank, but
152	only within a particular filesystem.
153
154	mon_force_standby_active
155	~~~~~~~~~~~~~~~~~~~~~~~~
156
157	This setting is used on monitor hosts. It defaults to true.
158
159	If it is false, then daemons configured with standby_replay=true
160	will only become active if the rank/name that they have
161	been configured to follow fails. On the other hand, if this
162	setting is true, then a daemon configured with standby_replay=true
163	may be assigned some other rank.
164
165	Examples
166	--------
167
168	These are example ceph.conf snippets. In practice you can either
169	copy a ceph.conf with all daemons' configuration to all your servers,
170	or you can have a different file on each server that contains just
171	that server's daemons' configuration.
172
173	Simple pair
174	~~~~~~~~~~~
175
176	Two MDS daemons 'a' and 'b' acting as a pair, where whichever one is not
177	currently assigned a rank will be the standby replay follower
178	of the other.
179
180	::
181
182	[mds.a]
183	mds standby replay = true
184	mds standby for rank = 0
185
186	[mds.b]
187	mds standby replay = true
188	mds standby for rank = 0
189
190	Floating standby
191	~~~~~~~~~~~~~~~~
192
193	Three MDS daemons 'a', 'b' and 'c', in a filesystem that has
194	``max_mds`` set to 2.
195
196	::
197
198	# No explicit configuration required: whichever daemon is
199	# not assigned a rank will go into 'standby' and take over
200	# for whichever other daemon fails.
201
202	Two MDS clusters
203	~~~~~~~~~~~~~~~~
204
205	With two filesystems, I have four MDS daemons, and I want two
206	to act as a pair for one filesystem and two to act as a pair
207	for the other filesystem.
208
209	::
210
211	[mds.a]
212	mds standby for fscid = 1
213
214	[mds.b]
215	mds standby for fscid = 1
216
217	[mds.c]
218	mds standby for fscid = 2
219
220	[mds.d]
221	mds standby for fscid = 2
222