]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | |
2 | Terminology | |
3 | ----------- | |
4 | ||
5 | A Ceph cluster may have zero or more CephFS *filesystems*. CephFS | |
6 | filesystems have a human readable name (set in ``fs new``) | |
7 | and an integer ID. The ID is called the filesystem cluster ID, | |
8 | or *FSCID*. | |
9 | ||
10 | Each CephFS filesystem has a number of *ranks*, one by default, | |
11 | which start at zero. A rank may be thought of as a metadata shard. | |
12 | Controlling the number of ranks in a filesystem is described | |
13 | in :doc:`/cephfs/multimds` | |
14 | ||
15 | Each CephFS ceph-mds process (a *daemon*) initially starts up | |
16 | without a rank. It may be assigned one by the monitor cluster. | |
17 | A daemon may only hold one rank at a time. Daemons only give up | |
18 | a rank when the ceph-mds process stops. | |
19 | ||
20 | If a rank is not associated with a daemon, the rank is | |
21 | considered *failed*. Once a rank is assigned to a daemon, | |
22 | the rank is considered *up*. | |
23 | ||
24 | A daemon has a *name* that is set statically by the administrator | |
25 | when the daemon is first configured. Typical configurations | |
26 | use the hostname where the daemon runs as the daemon name. | |
27 | ||
28 | Each time a daemon starts up, it is also assigned a *GID*, which | |
29 | is unique to this particular process lifetime of the daemon. The | |
30 | GID is an integer. | |
31 | ||
32 | Referring to MDS daemons | |
33 | ------------------------ | |
34 | ||
35 | Most of the administrative commands that refer to an MDS daemon | |
36 | accept a flexible argument format that may contain a rank, a GID | |
37 | or a name. | |
38 | ||
39 | Where a rank is used, this may optionally be qualified with | |
40 | a leading filesystem name or ID. If a daemon is a standby (i.e. | |
41 | it is not currently assigned a rank), then it may only be | |
42 | referred to by GID or name. | |
43 | ||
44 | For example, if we had an MDS daemon which was called 'myhost', | |
45 | had GID 5446, and was assigned rank 0 in the filesystem 'myfs' | |
46 | which had FSCID 3, then any of the following would be suitable | |
47 | forms of the 'fail' command: | |
48 | ||
49 | :: | |
50 | ||
51 | ceph mds fail 5446 # GID | |
52 | ceph mds fail myhost # Daemon name | |
53 | ceph mds fail 0 # Unqualified rank | |
54 | ceph mds fail 3:0 # FSCID and rank | |
55 | ceph mds fail myfs:0 # Filesystem name and rank | |
56 | ||
57 | Managing failover | |
58 | ----------------- | |
59 | ||
60 | If an MDS daemon stops communicating with the monitor, the monitor will | |
61 | wait ``mds_beacon_grace`` seconds (default 15 seconds) before marking | |
62 | the daemon as *laggy*. | |
63 | ||
64 | Each file system may specify a number of standby daemons to be considered | |
65 | healthy. This number includes daemons in standby-replay waiting for a rank to | |
66 | fail (remember that a standby-replay daemon will not be assigned to take over a | |
67 | failure for another rank or a failure in a another CephFS file system). The | |
68 | pool of standby daemons not in replay count towards any file system count. | |
69 | Each file system may set the number of standby daemons wanted using: | |
70 | ||
71 | :: | |
72 | ||
73 | ceph fs set <fs name> standby_count_wanted <count> | |
74 | ||
75 | Setting ``count`` to 0 will disable the health check. | |
76 | ||
77 | ||
78 | Configuring standby daemons | |
79 | --------------------------- | |
80 | ||
81 | There are four configuration settings that control how a daemon | |
82 | will behave while in standby: | |
83 | ||
84 | :: | |
85 | ||
86 | mds_standby_for_name | |
87 | mds_standby_for_rank | |
88 | mds_standby_for_fscid | |
89 | mds_standby_replay | |
90 | ||
91 | These may be set in the ceph.conf on the host where the MDS daemon | |
92 | runs (as opposed to on the monitor). The daemon loads these settings | |
93 | when it starts, and sends them to the monitor. | |
94 | ||
95 | By default, if none of these settings are used, all MDS daemons | |
96 | which do not hold a rank will be used as standbys for any rank. | |
97 | ||
98 | The settings which associate a standby daemon with a particular | |
99 | name or rank do not guarantee that the daemon will *only* be used | |
100 | for that rank. They mean that when several standbys are available, | |
101 | the associated standby daemon will be used. If a rank is failed, | |
102 | and a standby is available, it will be used even if it is associated | |
103 | with a different rank or named daemon. | |
104 | ||
105 | mds_standby_replay | |
106 | ~~~~~~~~~~~~~~~~~~ | |
107 | ||
108 | If this is set to true, then the standby daemon will continuously read | |
109 | the metadata journal of an up rank. This will give it | |
110 | a warm metadata cache, and speed up the process of failing over | |
111 | if the daemon serving the rank fails. | |
112 | ||
113 | An up rank may only have one standby replay daemon assigned to it, | |
114 | if two daemons are both set to be standby replay then one of them | |
115 | will arbitrarily win, and the other will become a normal non-replay | |
116 | standby. | |
117 | ||
118 | Once a daemon has entered the standby replay state, it will only be | |
119 | used as a standby for the rank that it is following. If another rank | |
120 | fails, this standby replay daemon will not be used as a replacement, | |
121 | even if no other standbys are available. | |
122 | ||
123 | *Historical note:* In Ceph prior to v10.2.1, this setting (when ``false``) is | |
124 | always true when ``mds_standby_for_*`` is also set. | |
125 | ||
126 | mds_standby_for_name | |
127 | ~~~~~~~~~~~~~~~~~~~~ | |
128 | ||
129 | Set this to make the standby daemon only take over a failed rank | |
130 | if the last daemon to hold it matches this name. | |
131 | ||
132 | mds_standby_for_rank | |
133 | ~~~~~~~~~~~~~~~~~~~~ | |
134 | ||
135 | Set this to make the standby daemon only take over the specified | |
136 | rank. If another rank fails, this daemon will not be used to | |
137 | replace it. | |
138 | ||
139 | Use in conjunction with ``mds_standby_for_fscid`` to be specific | |
140 | about which filesystem's rank you are targeting, if you have | |
141 | multiple filesystems. | |
142 | ||
143 | mds_standby_for_fscid | |
144 | ~~~~~~~~~~~~~~~~~~~~~ | |
145 | ||
146 | If ``mds_standby_for_rank`` is set, this is simply a qualifier to | |
147 | say which filesystem's rank is referred to. | |
148 | ||
149 | If ``mds_standby_for_rank`` is not set, then setting FSCID will | |
150 | cause this daemon to target any rank in the specified FSCID. Use | |
151 | this if you have a daemon that you want to use for any rank, but | |
152 | only within a particular filesystem. | |
153 | ||
154 | mon_force_standby_active | |
155 | ~~~~~~~~~~~~~~~~~~~~~~~~ | |
156 | ||
157 | This setting is used on monitor hosts. It defaults to true. | |
158 | ||
159 | If it is false, then daemons configured with standby_replay=true | |
160 | will **only** become active if the rank/name that they have | |
161 | been configured to follow fails. On the other hand, if this | |
162 | setting is true, then a daemon configured with standby_replay=true | |
163 | may be assigned some other rank. | |
164 | ||
165 | Examples | |
166 | -------- | |
167 | ||
168 | These are example ceph.conf snippets. In practice you can either | |
169 | copy a ceph.conf with all daemons' configuration to all your servers, | |
170 | or you can have a different file on each server that contains just | |
171 | that server's daemons' configuration. | |
172 | ||
173 | Simple pair | |
174 | ~~~~~~~~~~~ | |
175 | ||
176 | Two MDS daemons 'a' and 'b' acting as a pair, where whichever one is not | |
177 | currently assigned a rank will be the standby replay follower | |
178 | of the other. | |
179 | ||
180 | :: | |
181 | ||
182 | [mds.a] | |
183 | mds standby replay = true | |
184 | mds standby for rank = 0 | |
185 | ||
186 | [mds.b] | |
187 | mds standby replay = true | |
188 | mds standby for rank = 0 | |
189 | ||
190 | Floating standby | |
191 | ~~~~~~~~~~~~~~~~ | |
192 | ||
193 | Three MDS daemons 'a', 'b' and 'c', in a filesystem that has | |
194 | ``max_mds`` set to 2. | |
195 | ||
196 | :: | |
197 | ||
198 | # No explicit configuration required: whichever daemon is | |
199 | # not assigned a rank will go into 'standby' and take over | |
200 | # for whichever other daemon fails. | |
201 | ||
202 | Two MDS clusters | |
203 | ~~~~~~~~~~~~~~~~ | |
204 | ||
205 | With two filesystems, I have four MDS daemons, and I want two | |
206 | to act as a pair for one filesystem and two to act as a pair | |
207 | for the other filesystem. | |
208 | ||
209 | :: | |
210 | ||
211 | [mds.a] | |
212 | mds standby for fscid = 1 | |
213 | ||
214 | [mds.b] | |
215 | mds standby for fscid = 1 | |
216 | ||
217 | [mds.c] | |
218 | mds standby for fscid = 2 | |
219 | ||
220 | [mds.d] | |
221 | mds standby for fscid = 2 | |
222 |