]> git.proxmox.com Git - ceph.git/blame - ceph/doc/cephfs/standby.rst
import 15.2.5
[ceph.git] / ceph / doc / cephfs / standby.rst
CommitLineData
11fdf7f2 1.. _mds-standby:
7c673cae
FG
2
3Terminology
4-----------
5
9f95a23c
TL
6A Ceph cluster may have zero or more CephFS *file systems*. CephFS
7file systems have a human readable name (set in ``fs new``)
8and an integer ID. The ID is called the file system cluster ID,
7c673cae
FG
9or *FSCID*.
10
9f95a23c 11Each CephFS file system has a number of *ranks*, one by default,
7c673cae 12which start at zero. A rank may be thought of as a metadata shard.
9f95a23c 13Controlling the number of ranks in a file system is described
7c673cae
FG
14in :doc:`/cephfs/multimds`
15
16Each CephFS ceph-mds process (a *daemon*) initially starts up
17without a rank. It may be assigned one by the monitor cluster.
18A daemon may only hold one rank at a time. Daemons only give up
19a rank when the ceph-mds process stops.
20
21If a rank is not associated with a daemon, the rank is
22considered *failed*. Once a rank is assigned to a daemon,
23the rank is considered *up*.
24
25A daemon has a *name* that is set statically by the administrator
26when the daemon is first configured. Typical configurations
27use the hostname where the daemon runs as the daemon name.
28
9f95a23c
TL
29A ceph-mds daemons can be assigned to a particular file system by
30setting the `mds_join_fs` configuration option to the file system
31name.
32
7c673cae
FG
33Each time a daemon starts up, it is also assigned a *GID*, which
34is unique to this particular process lifetime of the daemon. The
35GID is an integer.
36
37Referring to MDS daemons
38------------------------
39
40Most of the administrative commands that refer to an MDS daemon
41accept a flexible argument format that may contain a rank, a GID
42or a name.
43
44Where a rank is used, this may optionally be qualified with
9f95a23c 45a leading file system name or ID. If a daemon is a standby (i.e.
7c673cae
FG
46it is not currently assigned a rank), then it may only be
47referred to by GID or name.
48
49For example, if we had an MDS daemon which was called 'myhost',
9f95a23c 50had GID 5446, and was assigned rank 0 in the file system 'myfs'
7c673cae
FG
51which had FSCID 3, then any of the following would be suitable
52forms of the 'fail' command:
53
54::
55
56 ceph mds fail 5446 # GID
57 ceph mds fail myhost # Daemon name
58 ceph mds fail 0 # Unqualified rank
59 ceph mds fail 3:0 # FSCID and rank
9f95a23c 60 ceph mds fail myfs:0 # File System name and rank
7c673cae
FG
61
62Managing failover
63-----------------
64
11fdf7f2
TL
65If an MDS daemon stops communicating with the monitor, the monitor will wait
66``mds_beacon_grace`` seconds (default 15 seconds) before marking the daemon as
67*laggy*. If a standby is available, the monitor will immediately replace the
68laggy daemon.
7c673cae
FG
69
70Each file system may specify a number of standby daemons to be considered
71healthy. This number includes daemons in standby-replay waiting for a rank to
72fail (remember that a standby-replay daemon will not be assigned to take over a
73failure for another rank or a failure in a another CephFS file system). The
74pool of standby daemons not in replay count towards any file system count.
75Each file system may set the number of standby daemons wanted using:
76
77::
78
79 ceph fs set <fs name> standby_count_wanted <count>
80
81Setting ``count`` to 0 will disable the health check.
82
83
11fdf7f2 84.. _mds-standby-replay:
7c673cae 85
11fdf7f2
TL
86Configuring standby-replay
87--------------------------
7c673cae 88
11fdf7f2
TL
89Each CephFS file system may be configured to add standby-replay daemons. These
90standby daemons follow the active MDS's metadata journal to reduce failover
91time in the event the active MDS becomes unavailable. Each active MDS may have
92only one standby-replay daemon following it.
7c673cae 93
11fdf7f2 94Configuring standby-replay on a file system is done using:
7c673cae
FG
95
96::
7c673cae 97
11fdf7f2 98 ceph fs set <fs name> allow_standby_replay <bool>
7c673cae 99
11fdf7f2
TL
100Once set, the monitors will assign available standby daemons to follow the
101active MDSs in that file system.
7c673cae 102
11fdf7f2
TL
103Once an MDS has entered the standby-replay state, it will only be used as a
104standby for the rank that it is following. If another rank fails, this
105standby-replay daemon will not be used as a replacement, even if no other
106standbys are available. For this reason, it is advised that if standby-replay
107is used then every active MDS should have a standby-replay daemon.
9f95a23c
TL
108
109.. _mds-join-fs:
110
111Configuring MDS file system affinity
112------------------------------------
113
114You may want to have an MDS used for a particular file system. Or, perhaps you
115have larger MDSs on better hardware that should be preferred over a last-resort
116standby on lesser or over-provisioned hardware. To express this preference,
117CephFS provides a configuration option for MDS called ``mds_join_fs`` which
118enforces this `affinity`.
119
120As part of any failover, the Ceph monitors will prefer standby daemons with
121``mds_join_fs`` equal to the file system name with the failed rank. If no
122standby exists with ``mds_join_fs`` equal to the file system name, it will
123choose a `vanilla` standby (no setting for ``mds_join_fs``) for the replacement
124or any other available standby as a last resort. Note, this does not change the
125behavior that ``standby-replay`` daemons are always selected before looking at
126other standbys.
127
128Even further, the monitors will regularly examine the CephFS file systems when
129stable to check if a standby with stronger affinity is available to replace an
130MDS with lower affinity. This process is also done for standby-replay daemons:
131if a regular standby has stronger affinity than the standby-replay MDS, it will
132replace the standby-replay MDS.
133
134For example, given this stable and healthy file system:
135
136::
137
138 $ ceph fs dump
139 dumped fsmap epoch 399
140 ...
141 Filesystem 'cephfs' (27)
142 ...
143 e399
144 max_mds 1
145 in 0
146 up {0=20384}
147 failed
148 damaged
149 stopped
150 ...
151 [mds.a{0:20384} state up:active seq 239 addr [v2:127.0.0.1:6854/966242805,v1:127.0.0.1:6855/966242805]]
152
153 Standby daemons:
154
155 [mds.b{-1:10420} state up:standby seq 2 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]
156
157
158You may set ``mds_join_fs`` on the standby to enforce your preference: ::
159
160 $ ceph config set mds.b mds_join_fs cephfs
161
162after automatic failover: ::
163
164 $ ceph fs dump
165 dumped fsmap epoch 405
166 e405
167 ...
168 Filesystem 'cephfs' (27)
169 ...
170 max_mds 1
171 in 0
172 up {0=10420}
173 failed
174 damaged
175 stopped
176 ...
177 [mds.b{0:10420} state up:active seq 274 join_fscid=27 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]
178
179 Standby daemons:
180
181 [mds.a{-1:10720} state up:standby seq 2 addr [v2:127.0.0.1:6854/1340357658,v1:127.0.0.1:6855/1340357658]]
182
183Note in the above example that ``mds.b`` now has ``join_fscid=27``. In this
184output, the file system name from ``mds_join_fs`` is changed to the file system
185identifier (27). If the file system is recreated with the same name, the
186standby will follow the new file system as expected.
187
188Finally, if the file system is degraded or undersized, no failover will occur
189to enforce ``mds_join_fs``.