]>
Commit | Line | Data |
---|---|---|
11fdf7f2 | 1 | .. _cephfs-multimds: |
7c673cae FG |
2 | |
3 | Configuring multiple active MDS daemons | |
4 | --------------------------------------- | |
5 | ||
6 | *Also known as: multi-mds, active-active MDS* | |
7 | ||
9f95a23c | 8 | Each CephFS file system is configured for a single active MDS daemon |
7c673cae FG |
9 | by default. To scale metadata performance for large scale systems, you |
10 | may enable multiple active MDS daemons, which will share the metadata | |
11 | workload with one another. | |
12 | ||
13 | When should I use multiple active MDS daemons? | |
14 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
15 | ||
16 | You should configure multiple active MDS daemons when your metadata performance | |
17 | is bottlenecked on the single MDS that runs by default. | |
18 | ||
19 | Adding more daemons may not increase performance on all workloads. Typically, | |
20 | a single application running on a single client will not benefit from an | |
21 | increased number of MDS daemons unless the application is doing a lot of | |
22 | metadata operations in parallel. | |
23 | ||
24 | Workloads that typically benefit from a larger number of active MDS daemons | |
25 | are those with many clients, perhaps working on many separate directories. | |
26 | ||
27 | ||
28 | Increasing the MDS active cluster size | |
29 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
30 | ||
9f95a23c TL |
31 | Each CephFS file system has a *max_mds* setting, which controls how many ranks |
32 | will be created. The actual number of ranks in the file system will only be | |
11fdf7f2 TL |
33 | increased if a spare daemon is available to take on the new rank. For example, |
34 | if there is only one MDS daemon running, and max_mds is set to two, no second | |
35 | rank will be created. (Note that such a configuration is not Highly Available | |
36 | (HA) because no standby is available to take over for a failed rank. The | |
37 | cluster will complain via health warnings when configured this way.) | |
7c673cae FG |
38 | |
39 | Set ``max_mds`` to the desired number of ranks. In the following examples | |
40 | the "fsmap" line of "ceph status" is shown to illustrate the expected | |
41 | result of commands. | |
42 | ||
43 | :: | |
44 | ||
45 | # fsmap e5: 1/1/1 up {0=a=up:active}, 2 up:standby | |
46 | ||
11fdf7f2 | 47 | ceph fs set <fs_name> max_mds 2 |
7c673cae FG |
48 | |
49 | # fsmap e8: 2/2/2 up {0=a=up:active,1=c=up:creating}, 1 up:standby | |
50 | # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby | |
51 | ||
52 | The newly created rank (1) will pass through the 'creating' state | |
53 | and then enter this 'active state'. | |
54 | ||
55 | Standby daemons | |
56 | ~~~~~~~~~~~~~~~ | |
57 | ||
58 | Even with multiple active MDS daemons, a highly available system **still | |
59 | requires standby daemons** to take over if any of the servers running | |
60 | an active daemon fail. | |
61 | ||
62 | Consequently, the practical maximum of ``max_mds`` for highly available systems | |
11fdf7f2 | 63 | is at most one less than the total number of MDS servers in your system. |
7c673cae FG |
64 | |
65 | To remain available in the event of multiple server failures, increase the | |
66 | number of standby daemons in the system to match the number of server failures | |
67 | you wish to withstand. | |
68 | ||
69 | Decreasing the number of ranks | |
70 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
71 | ||
11fdf7f2 | 72 | Reducing the number of ranks is as simple as reducing ``max_mds``: |
7c673cae FG |
73 | |
74 | :: | |
75 | ||
76 | # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby | |
11fdf7f2 TL |
77 | ceph fs set <fs_name> max_mds 1 |
78 | # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby | |
79 | # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby | |
80 | ... | |
81 | # fsmap e10: 1/1/1 up {0=a=up:active}, 2 up:standby | |
7c673cae | 82 | |
11fdf7f2 TL |
83 | The cluster will automatically stop extra ranks incrementally until ``max_mds`` |
84 | is reached. | |
7c673cae | 85 | |
c07f9fc5 FG |
86 | See :doc:`/cephfs/administration` for more details which forms ``<role>`` can |
87 | take. | |
88 | ||
11fdf7f2 TL |
89 | Note: stopped ranks will first enter the stopping state for a period of |
90 | time while it hands off its share of the metadata to the remaining active | |
91 | daemons. This phase can take from seconds to minutes. If the MDS appears to | |
92 | be stuck in the stopping state then that should be investigated as a possible | |
93 | bug. | |
7c673cae | 94 | |
11fdf7f2 TL |
95 | If an MDS daemon crashes or is killed while in the ``up:stopping`` state, a |
96 | standby will take over and the cluster monitors will against try to stop | |
97 | the daemon. | |
7c673cae | 98 | |
11fdf7f2 TL |
99 | When a daemon finishes stopping, it will respawn itself and go back to being a |
100 | standby. | |
7c673cae FG |
101 | |
102 | ||
a4b75251 TL |
103 | .. _cephfs-pinning: |
104 | ||
7c673cae FG |
105 | Manually pinning directory trees to a particular rank |
106 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
107 | ||
108 | In multiple active metadata server configurations, a balancer runs which works | |
109 | to spread metadata load evenly across the cluster. This usually works well | |
110 | enough for most users but sometimes it is desirable to override the dynamic | |
111 | balancer with explicit mappings of metadata to particular ranks. This can allow | |
112 | the administrator or users to evenly spread application load or limit impact of | |
113 | users' metadata requests on the entire cluster. | |
114 | ||
115 | The mechanism provided for this purpose is called an ``export pin``, an | |
116 | extended attribute of directories. The name of this extended attribute is | |
117 | ``ceph.dir.pin``. Users can set this attribute using standard commands: | |
118 | ||
119 | :: | |
31f18b77 | 120 | |
7c673cae FG |
121 | setfattr -n ceph.dir.pin -v 2 path/to/dir |
122 | ||
123 | The value of the extended attribute is the rank to assign the directory subtree | |
124 | to. A default value of ``-1`` indicates the directory is not pinned. | |
125 | ||
126 | A directory's export pin is inherited from its closest parent with a set export | |
127 | pin. In this way, setting the export pin on a directory affects all of its | |
11fdf7f2 | 128 | children. However, the parents pin can be overridden by setting the child |
7c673cae FG |
129 | directory's export pin. For example: |
130 | ||
131 | :: | |
31f18b77 | 132 | |
7c673cae FG |
133 | mkdir -p a/b |
134 | # "a" and "a/b" both start without an export pin set | |
135 | setfattr -n ceph.dir.pin -v 1 a/ | |
136 | # a and b are now pinned to rank 1 | |
137 | setfattr -n ceph.dir.pin -v 0 a/b | |
138 | # a/b is now pinned to rank 0 and a/ and the rest of its children are still pinned to rank 1 | |
31f18b77 | 139 | |
f6b5b4d7 | 140 | |
a4b75251 TL |
141 | .. _cephfs-ephemeral-pinning: |
142 | ||
f6b5b4d7 TL |
143 | Setting subtree partitioning policies |
144 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
145 | ||
146 | It is also possible to setup **automatic** static partitioning of subtrees via | |
147 | a set of **policies**. In CephFS, this automatic static partitioning is | |
148 | referred to as **ephemeral pinning**. Any directory (inode) which is | |
149 | ephemerally pinned will be automatically assigned to a particular rank | |
150 | according to a consistent hash of its inode number. The set of all | |
151 | ephemerally pinned directories should be uniformly distributed across all | |
152 | ranks. | |
153 | ||
154 | Ephemerally pinned directories are so named because the pin may not persist | |
155 | once the directory inode is dropped from cache. However, an MDS failover does | |
156 | not affect the ephemeral nature of the pinned directory. The MDS records what | |
157 | subtrees are ephemerally pinned in its journal so MDS failovers do not drop | |
158 | this information. | |
159 | ||
160 | A directory is either ephemerally pinned or not. Which rank it is pinned to is | |
161 | derived from its inode number and a consistent hash. This means that | |
162 | ephemerally pinned directories are somewhat evenly spread across the MDS | |
163 | cluster. The **consistent hash** also minimizes redistribution when the MDS | |
164 | cluster grows or shrinks. So, growing an MDS cluster may automatically increase | |
165 | your metadata throughput with no other administrative intervention. | |
166 | ||
167 | Presently, there are two types of ephemeral pinning: | |
168 | ||
b3b6e05e TL |
169 | **Distributed Ephemeral Pins**: This policy causes a directory to fragment |
170 | (even well below the normal fragmentation thresholds) and distribute its | |
171 | fragments as ephemerally pinned subtrees. This has the effect of distributing | |
172 | immediate children across a range of MDS ranks. The canonical example use-case | |
173 | would be the ``/home`` directory: we want every user's home directory to be | |
174 | spread across the entire MDS cluster. This can be set via: | |
f6b5b4d7 TL |
175 | |
176 | :: | |
177 | ||
178 | setfattr -n ceph.dir.pin.distributed -v 1 /cephfs/home | |
179 | ||
180 | ||
181 | **Random Ephemeral Pins**: This policy indicates any descendent sub-directory | |
182 | may be ephemerally pinned. This is set through the extended attribute | |
183 | ``ceph.dir.pin.random`` with the value set to the percentage of directories | |
184 | that should be pinned. For example: | |
185 | ||
186 | :: | |
187 | ||
188 | setfattr -n ceph.dir.pin.random -v 0.5 /cephfs/tmp | |
189 | ||
190 | Would cause any directory loaded into cache or created under ``/tmp`` to be | |
191 | ephemerally pinned 50 percent of the time. | |
192 | ||
f67539c2 | 193 | It is recommended to only set this to small values, like ``.001`` or ``0.1%``. |
f6b5b4d7 TL |
194 | Having too many subtrees may degrade performance. For this reason, the config |
195 | ``mds_export_ephemeral_random_max`` enforces a cap on the maximum of this | |
196 | percentage (default: ``.01``). The MDS returns ``EINVAL`` when attempting to | |
197 | set a value beyond this config. | |
198 | ||
199 | Both random and distributed ephemeral pin policies are off by default in | |
200 | Octopus. The features may be enabled via the | |
201 | ``mds_export_ephemeral_random`` and ``mds_export_ephemeral_distributed`` | |
202 | configuration options. | |
203 | ||
204 | Ephemeral pins may override parent export pins and vice versa. What determines | |
205 | which policy is followed is the rule of the closest parent: if a closer parent | |
206 | directory has a conflicting policy, use that one instead. For example: | |
207 | ||
208 | :: | |
209 | ||
210 | mkdir -p foo/bar1/baz foo/bar2 | |
211 | setfattr -n ceph.dir.pin -v 0 foo | |
212 | setfattr -n ceph.dir.pin.distributed -v 1 foo/bar1 | |
213 | ||
214 | The ``foo/bar1/baz`` directory will be ephemerally pinned because the | |
215 | ``foo/bar1`` policy overrides the export pin on ``foo``. The ``foo/bar2`` | |
216 | directory will obey the pin on ``foo`` normally. | |
217 | ||
218 | For the reverse situation: | |
219 | ||
220 | :: | |
221 | ||
222 | mkdir -p home/{patrick,john} | |
223 | setfattr -n ceph.dir.pin.distributed -v 1 home | |
224 | setfattr -n ceph.dir.pin -v 2 home/patrick | |
225 | ||
226 | The ``home/patrick`` directory and its children will be pinned to rank 2 | |
227 | because its export pin overrides the policy on ``home``. | |
1e59de90 TL |
228 | |
229 | ||
230 | Dynamic subtree partitioning with Balancer on specific ranks | |
231 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
232 | ||
233 | The CephFS file system provides the ``bal_rank_mask`` option to enable the balancer | |
234 | to dynamically rebalance subtrees within particular active MDS ranks. This | |
235 | allows administrators to employ both the dynamic subtree partitioning and | |
236 | static pining schemes in different active MDS ranks so that metadata loads | |
237 | are optimized based on user demand. For instance, in realistic cloud | |
238 | storage environments, where a lot of subvolumes are allotted to multiple | |
239 | computing nodes (e.g., VMs and containers), some subvolumes that require | |
240 | high performance are managed by static partitioning, whereas most subvolumes | |
241 | that experience a moderate workload are managed by the balancer. As the balancer | |
242 | evenly spreads the metadata workload to all active MDS ranks, performance of | |
243 | static pinned subvolumes inevitably may be affected or degraded. If this option | |
244 | is enabled, subtrees managed by the balancer are not affected by | |
245 | static pinned subtrees. | |
246 | ||
247 | This option can be configured with the ``ceph fs set`` command. For example: | |
248 | ||
249 | :: | |
250 | ||
251 | ceph fs set <fs_name> bal_rank_mask <hex> | |
252 | ||
253 | Each bitfield of the ``<hex>`` number represents a dedicated rank. If the ``<hex>`` is | |
254 | set to ``0x3``, the balancer runs on active ``0`` and ``1`` ranks. For example: | |
255 | ||
256 | :: | |
257 | ||
258 | ceph fs set <fs_name> bal_rank_mask 0x3 | |
259 | ||
260 | If the ``bal_rank_mask`` is set to ``-1`` or ``all``, all active ranks are masked | |
261 | and utilized by the balancer. As an example: | |
262 | ||
263 | :: | |
264 | ||
265 | ceph fs set <fs_name> bal_rank_mask -1 | |
266 | ||
267 | On the other hand, if the balancer needs to be disabled, | |
268 | the ``bal_rank_mask`` should be set to ``0x0``. For example: | |
269 | ||
270 | :: | |
271 | ||
272 | ceph fs set <fs_name> bal_rank_mask 0x0 |