]>
Commit | Line | Data |
---|---|---|
11fdf7f2 | 1 | .. _mds-standby: |
7c673cae FG |
2 | |
3 | Terminology | |
4 | ----------- | |
5 | ||
9f95a23c TL |
6 | A Ceph cluster may have zero or more CephFS *file systems*. CephFS |
7 | file systems have a human readable name (set in ``fs new``) | |
8 | and an integer ID. The ID is called the file system cluster ID, | |
7c673cae FG |
9 | or *FSCID*. |
10 | ||
9f95a23c | 11 | Each CephFS file system has a number of *ranks*, one by default, |
7c673cae | 12 | which start at zero. A rank may be thought of as a metadata shard. |
9f95a23c | 13 | Controlling the number of ranks in a file system is described |
7c673cae FG |
14 | in :doc:`/cephfs/multimds` |
15 | ||
16 | Each CephFS ceph-mds process (a *daemon*) initially starts up | |
17 | without a rank. It may be assigned one by the monitor cluster. | |
18 | A daemon may only hold one rank at a time. Daemons only give up | |
19 | a rank when the ceph-mds process stops. | |
20 | ||
21 | If a rank is not associated with a daemon, the rank is | |
22 | considered *failed*. Once a rank is assigned to a daemon, | |
23 | the rank is considered *up*. | |
24 | ||
25 | A daemon has a *name* that is set statically by the administrator | |
26 | when the daemon is first configured. Typical configurations | |
27 | use the hostname where the daemon runs as the daemon name. | |
28 | ||
9f95a23c TL |
29 | A ceph-mds daemons can be assigned to a particular file system by |
30 | setting the `mds_join_fs` configuration option to the file system | |
31 | name. | |
32 | ||
7c673cae FG |
33 | Each time a daemon starts up, it is also assigned a *GID*, which |
34 | is unique to this particular process lifetime of the daemon. The | |
35 | GID is an integer. | |
36 | ||
37 | Referring to MDS daemons | |
38 | ------------------------ | |
39 | ||
40 | Most of the administrative commands that refer to an MDS daemon | |
41 | accept a flexible argument format that may contain a rank, a GID | |
42 | or a name. | |
43 | ||
44 | Where a rank is used, this may optionally be qualified with | |
9f95a23c | 45 | a leading file system name or ID. If a daemon is a standby (i.e. |
7c673cae FG |
46 | it is not currently assigned a rank), then it may only be |
47 | referred to by GID or name. | |
48 | ||
49 | For example, if we had an MDS daemon which was called 'myhost', | |
9f95a23c | 50 | had GID 5446, and was assigned rank 0 in the file system 'myfs' |
7c673cae FG |
51 | which had FSCID 3, then any of the following would be suitable |
52 | forms of the 'fail' command: | |
53 | ||
54 | :: | |
55 | ||
56 | ceph mds fail 5446 # GID | |
57 | ceph mds fail myhost # Daemon name | |
58 | ceph mds fail 0 # Unqualified rank | |
59 | ceph mds fail 3:0 # FSCID and rank | |
9f95a23c | 60 | ceph mds fail myfs:0 # File System name and rank |
7c673cae FG |
61 | |
62 | Managing failover | |
63 | ----------------- | |
64 | ||
11fdf7f2 TL |
65 | If an MDS daemon stops communicating with the monitor, the monitor will wait |
66 | ``mds_beacon_grace`` seconds (default 15 seconds) before marking the daemon as | |
67 | *laggy*. If a standby is available, the monitor will immediately replace the | |
68 | laggy daemon. | |
7c673cae FG |
69 | |
70 | Each file system may specify a number of standby daemons to be considered | |
71 | healthy. This number includes daemons in standby-replay waiting for a rank to | |
72 | fail (remember that a standby-replay daemon will not be assigned to take over a | |
73 | failure for another rank or a failure in a another CephFS file system). The | |
74 | pool of standby daemons not in replay count towards any file system count. | |
75 | Each file system may set the number of standby daemons wanted using: | |
76 | ||
77 | :: | |
78 | ||
79 | ceph fs set <fs name> standby_count_wanted <count> | |
80 | ||
81 | Setting ``count`` to 0 will disable the health check. | |
82 | ||
83 | ||
11fdf7f2 | 84 | .. _mds-standby-replay: |
7c673cae | 85 | |
11fdf7f2 TL |
86 | Configuring standby-replay |
87 | -------------------------- | |
7c673cae | 88 | |
11fdf7f2 TL |
89 | Each CephFS file system may be configured to add standby-replay daemons. These |
90 | standby daemons follow the active MDS's metadata journal to reduce failover | |
91 | time in the event the active MDS becomes unavailable. Each active MDS may have | |
92 | only one standby-replay daemon following it. | |
7c673cae | 93 | |
11fdf7f2 | 94 | Configuring standby-replay on a file system is done using: |
7c673cae FG |
95 | |
96 | :: | |
7c673cae | 97 | |
11fdf7f2 | 98 | ceph fs set <fs name> allow_standby_replay <bool> |
7c673cae | 99 | |
11fdf7f2 TL |
100 | Once set, the monitors will assign available standby daemons to follow the |
101 | active MDSs in that file system. | |
7c673cae | 102 | |
11fdf7f2 TL |
103 | Once an MDS has entered the standby-replay state, it will only be used as a |
104 | standby for the rank that it is following. If another rank fails, this | |
105 | standby-replay daemon will not be used as a replacement, even if no other | |
106 | standbys are available. For this reason, it is advised that if standby-replay | |
107 | is used then every active MDS should have a standby-replay daemon. | |
9f95a23c TL |
108 | |
109 | .. _mds-join-fs: | |
110 | ||
111 | Configuring MDS file system affinity | |
112 | ------------------------------------ | |
113 | ||
114 | You may want to have an MDS used for a particular file system. Or, perhaps you | |
115 | have larger MDSs on better hardware that should be preferred over a last-resort | |
116 | standby on lesser or over-provisioned hardware. To express this preference, | |
117 | CephFS provides a configuration option for MDS called ``mds_join_fs`` which | |
118 | enforces this `affinity`. | |
119 | ||
120 | As part of any failover, the Ceph monitors will prefer standby daemons with | |
121 | ``mds_join_fs`` equal to the file system name with the failed rank. If no | |
122 | standby exists with ``mds_join_fs`` equal to the file system name, it will | |
123 | choose a `vanilla` standby (no setting for ``mds_join_fs``) for the replacement | |
124 | or any other available standby as a last resort. Note, this does not change the | |
125 | behavior that ``standby-replay`` daemons are always selected before looking at | |
126 | other standbys. | |
127 | ||
128 | Even further, the monitors will regularly examine the CephFS file systems when | |
129 | stable to check if a standby with stronger affinity is available to replace an | |
130 | MDS with lower affinity. This process is also done for standby-replay daemons: | |
131 | if a regular standby has stronger affinity than the standby-replay MDS, it will | |
132 | replace the standby-replay MDS. | |
133 | ||
134 | For example, given this stable and healthy file system: | |
135 | ||
136 | :: | |
137 | ||
138 | $ ceph fs dump | |
139 | dumped fsmap epoch 399 | |
140 | ... | |
141 | Filesystem 'cephfs' (27) | |
142 | ... | |
143 | e399 | |
144 | max_mds 1 | |
145 | in 0 | |
146 | up {0=20384} | |
147 | failed | |
148 | damaged | |
149 | stopped | |
150 | ... | |
151 | [mds.a{0:20384} state up:active seq 239 addr [v2:127.0.0.1:6854/966242805,v1:127.0.0.1:6855/966242805]] | |
152 | ||
153 | Standby daemons: | |
154 | ||
155 | [mds.b{-1:10420} state up:standby seq 2 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]] | |
156 | ||
157 | ||
158 | You may set ``mds_join_fs`` on the standby to enforce your preference: :: | |
159 | ||
160 | $ ceph config set mds.b mds_join_fs cephfs | |
161 | ||
162 | after automatic failover: :: | |
163 | ||
164 | $ ceph fs dump | |
165 | dumped fsmap epoch 405 | |
166 | e405 | |
167 | ... | |
168 | Filesystem 'cephfs' (27) | |
169 | ... | |
170 | max_mds 1 | |
171 | in 0 | |
172 | up {0=10420} | |
173 | failed | |
174 | damaged | |
175 | stopped | |
176 | ... | |
177 | [mds.b{0:10420} state up:active seq 274 join_fscid=27 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]] | |
178 | ||
179 | Standby daemons: | |
180 | ||
181 | [mds.a{-1:10720} state up:standby seq 2 addr [v2:127.0.0.1:6854/1340357658,v1:127.0.0.1:6855/1340357658]] | |
182 | ||
183 | Note in the above example that ``mds.b`` now has ``join_fscid=27``. In this | |
184 | output, the file system name from ``mds_join_fs`` is changed to the file system | |
185 | identifier (27). If the file system is recreated with the same name, the | |
186 | standby will follow the new file system as expected. | |
187 | ||
188 | Finally, if the file system is degraded or undersized, no failover will occur | |
189 | to enforce ``mds_join_fs``. |