]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | |
2 | Configuring multiple active MDS daemons | |
3 | --------------------------------------- | |
4 | ||
5 | *Also known as: multi-mds, active-active MDS* | |
6 | ||
7 | Each CephFS filesystem is configured for a single active MDS daemon | |
8 | by default. To scale metadata performance for large scale systems, you | |
9 | may enable multiple active MDS daemons, which will share the metadata | |
10 | workload with one another. | |
11 | ||
12 | When should I use multiple active MDS daemons? | |
13 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
14 | ||
15 | You should configure multiple active MDS daemons when your metadata performance | |
16 | is bottlenecked on the single MDS that runs by default. | |
17 | ||
18 | Adding more daemons may not increase performance on all workloads. Typically, | |
19 | a single application running on a single client will not benefit from an | |
20 | increased number of MDS daemons unless the application is doing a lot of | |
21 | metadata operations in parallel. | |
22 | ||
23 | Workloads that typically benefit from a larger number of active MDS daemons | |
24 | are those with many clients, perhaps working on many separate directories. | |
25 | ||
26 | ||
27 | Increasing the MDS active cluster size | |
28 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
29 | ||
30 | Each CephFS filesystem has a *max_mds* setting, which controls | |
31 | how many ranks will be created. The actual number of ranks | |
32 | in the filesystem will only be increased if a spare daemon is | |
33 | available to take on the new rank. For example, if there is only one MDS daemon running, and max_mds is set to two, no second rank will be created. | |
34 | ||
35 | Set ``max_mds`` to the desired number of ranks. In the following examples | |
36 | the "fsmap" line of "ceph status" is shown to illustrate the expected | |
37 | result of commands. | |
38 | ||
39 | :: | |
40 | ||
41 | # fsmap e5: 1/1/1 up {0=a=up:active}, 2 up:standby | |
42 | ||
43 | ceph fs set max_mds 2 | |
44 | ||
45 | # fsmap e8: 2/2/2 up {0=a=up:active,1=c=up:creating}, 1 up:standby | |
46 | # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby | |
47 | ||
48 | The newly created rank (1) will pass through the 'creating' state | |
49 | and then enter this 'active state'. | |
50 | ||
51 | Standby daemons | |
52 | ~~~~~~~~~~~~~~~ | |
53 | ||
54 | Even with multiple active MDS daemons, a highly available system **still | |
55 | requires standby daemons** to take over if any of the servers running | |
56 | an active daemon fail. | |
57 | ||
58 | Consequently, the practical maximum of ``max_mds`` for highly available systems | |
59 | is one less than the total number of MDS servers in your system. | |
60 | ||
61 | To remain available in the event of multiple server failures, increase the | |
62 | number of standby daemons in the system to match the number of server failures | |
63 | you wish to withstand. | |
64 | ||
65 | Decreasing the number of ranks | |
66 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
67 | ||
68 | All ranks, including the rank(s) to be removed must first be active. This | |
69 | means that you must have at least max_mds MDS daemons available. | |
70 | ||
71 | First, set max_mds to a lower number, for example we might go back to | |
72 | having just a single active MDS: | |
73 | ||
74 | :: | |
75 | ||
76 | # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby | |
77 | ceph fs set max_mds 1 | |
78 | # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:active}, 1 up:standby | |
79 | ||
80 | Note that we still have two active MDSs: the ranks still exist even though | |
81 | we have decreased max_mds, because max_mds only restricts creation | |
82 | of new ranks. | |
83 | ||
84 | Next, use the ``ceph mds deactivate <rank>`` command to remove the | |
85 | unneeded rank: | |
86 | ||
87 | :: | |
88 | ||
89 | ceph mds deactivate cephfs_a:1 | |
90 | telling mds.1:1 172.21.9.34:6806/837679928 to deactivate | |
91 | ||
92 | # fsmap e11: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby | |
93 | # fsmap e12: 1/1/1 up {0=a=up:active}, 1 up:standby | |
94 | # fsmap e13: 1/1/1 up {0=a=up:active}, 2 up:standby | |
95 | ||
96 | The deactivated rank will first enter the stopping state for a period | |
97 | of time while it hands off its share of the metadata to the remaining | |
98 | active daemons. This phase can take from seconds to minutes. If the | |
99 | MDS appears to be stuck in the stopping state then that should be investigated | |
100 | as a possible bug. | |
101 | ||
102 | If an MDS daemon crashes or is killed while in the 'stopping' state, a | |
103 | standby will take over and the rank will go back to 'active'. You can | |
104 | try to deactivate it again once it has come back up. | |
105 | ||
106 | When a daemon finishes stopping, it will respawn itself and go | |
107 | back to being a standby. | |
108 | ||
109 | ||
110 | Manually pinning directory trees to a particular rank | |
111 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
112 | ||
113 | In multiple active metadata server configurations, a balancer runs which works | |
114 | to spread metadata load evenly across the cluster. This usually works well | |
115 | enough for most users but sometimes it is desirable to override the dynamic | |
116 | balancer with explicit mappings of metadata to particular ranks. This can allow | |
117 | the administrator or users to evenly spread application load or limit impact of | |
118 | users' metadata requests on the entire cluster. | |
119 | ||
120 | The mechanism provided for this purpose is called an ``export pin``, an | |
121 | extended attribute of directories. The name of this extended attribute is | |
122 | ``ceph.dir.pin``. Users can set this attribute using standard commands: | |
123 | ||
124 | :: | |
31f18b77 | 125 | |
7c673cae FG |
126 | setfattr -n ceph.dir.pin -v 2 path/to/dir |
127 | ||
128 | The value of the extended attribute is the rank to assign the directory subtree | |
129 | to. A default value of ``-1`` indicates the directory is not pinned. | |
130 | ||
131 | A directory's export pin is inherited from its closest parent with a set export | |
132 | pin. In this way, setting the export pin on a directory affects all of its | |
133 | children. However, the parents pin can be overriden by setting the child | |
134 | directory's export pin. For example: | |
135 | ||
136 | :: | |
31f18b77 | 137 | |
7c673cae FG |
138 | mkdir -p a/b |
139 | # "a" and "a/b" both start without an export pin set | |
140 | setfattr -n ceph.dir.pin -v 1 a/ | |
141 | # a and b are now pinned to rank 1 | |
142 | setfattr -n ceph.dir.pin -v 0 a/b | |
143 | # a/b is now pinned to rank 0 and a/ and the rest of its children are still pinned to rank 1 | |
31f18b77 | 144 |