]> git.proxmox.com Git - ceph.git/blame - ceph/doc/cephfs/multimds.rst
import 15.2.5
[ceph.git] / ceph / doc / cephfs / multimds.rst
CommitLineData
11fdf7f2 1.. _cephfs-multimds:
7c673cae
FG
2
3Configuring multiple active MDS daemons
4---------------------------------------
5
6*Also known as: multi-mds, active-active MDS*
7
9f95a23c 8Each CephFS file system is configured for a single active MDS daemon
7c673cae
FG
9by default. To scale metadata performance for large scale systems, you
10may enable multiple active MDS daemons, which will share the metadata
11workload with one another.
12
13When should I use multiple active MDS daemons?
14~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15
16You should configure multiple active MDS daemons when your metadata performance
17is bottlenecked on the single MDS that runs by default.
18
19Adding more daemons may not increase performance on all workloads. Typically,
20a single application running on a single client will not benefit from an
21increased number of MDS daemons unless the application is doing a lot of
22metadata operations in parallel.
23
24Workloads that typically benefit from a larger number of active MDS daemons
25are those with many clients, perhaps working on many separate directories.
26
27
28Increasing the MDS active cluster size
29~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
30
9f95a23c
TL
31Each CephFS file system has a *max_mds* setting, which controls how many ranks
32will be created. The actual number of ranks in the file system will only be
11fdf7f2
TL
33increased if a spare daemon is available to take on the new rank. For example,
34if there is only one MDS daemon running, and max_mds is set to two, no second
35rank will be created. (Note that such a configuration is not Highly Available
36(HA) because no standby is available to take over for a failed rank. The
37cluster will complain via health warnings when configured this way.)
7c673cae
FG
38
39Set ``max_mds`` to the desired number of ranks. In the following examples
40the "fsmap" line of "ceph status" is shown to illustrate the expected
41result of commands.
42
43::
44
45 # fsmap e5: 1/1/1 up {0=a=up:active}, 2 up:standby
46
11fdf7f2 47 ceph fs set <fs_name> max_mds 2
7c673cae
FG
48
49 # fsmap e8: 2/2/2 up {0=a=up:active,1=c=up:creating}, 1 up:standby
50 # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
51
52The newly created rank (1) will pass through the 'creating' state
53and then enter this 'active state'.
54
55Standby daemons
56~~~~~~~~~~~~~~~
57
58Even with multiple active MDS daemons, a highly available system **still
59requires standby daemons** to take over if any of the servers running
60an active daemon fail.
61
62Consequently, the practical maximum of ``max_mds`` for highly available systems
11fdf7f2 63is at most one less than the total number of MDS servers in your system.
7c673cae
FG
64
65To remain available in the event of multiple server failures, increase the
66number of standby daemons in the system to match the number of server failures
67you wish to withstand.
68
69Decreasing the number of ranks
70~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
71
11fdf7f2 72Reducing the number of ranks is as simple as reducing ``max_mds``:
7c673cae
FG
73
74::
75
76 # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
11fdf7f2
TL
77 ceph fs set <fs_name> max_mds 1
78 # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
79 # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
80 ...
81 # fsmap e10: 1/1/1 up {0=a=up:active}, 2 up:standby
7c673cae 82
11fdf7f2
TL
83The cluster will automatically stop extra ranks incrementally until ``max_mds``
84is reached.
7c673cae 85
c07f9fc5
FG
86See :doc:`/cephfs/administration` for more details which forms ``<role>`` can
87take.
88
11fdf7f2
TL
89Note: stopped ranks will first enter the stopping state for a period of
90time while it hands off its share of the metadata to the remaining active
91daemons. This phase can take from seconds to minutes. If the MDS appears to
92be stuck in the stopping state then that should be investigated as a possible
93bug.
7c673cae 94
11fdf7f2
TL
95If an MDS daemon crashes or is killed while in the ``up:stopping`` state, a
96standby will take over and the cluster monitors will against try to stop
97the daemon.
7c673cae 98
11fdf7f2
TL
99When a daemon finishes stopping, it will respawn itself and go back to being a
100standby.
7c673cae
FG
101
102
103Manually pinning directory trees to a particular rank
104~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
105
106In multiple active metadata server configurations, a balancer runs which works
107to spread metadata load evenly across the cluster. This usually works well
108enough for most users but sometimes it is desirable to override the dynamic
109balancer with explicit mappings of metadata to particular ranks. This can allow
110the administrator or users to evenly spread application load or limit impact of
111users' metadata requests on the entire cluster.
112
113The mechanism provided for this purpose is called an ``export pin``, an
114extended attribute of directories. The name of this extended attribute is
115``ceph.dir.pin``. Users can set this attribute using standard commands:
116
117::
31f18b77 118
7c673cae
FG
119 setfattr -n ceph.dir.pin -v 2 path/to/dir
120
121The value of the extended attribute is the rank to assign the directory subtree
122to. A default value of ``-1`` indicates the directory is not pinned.
123
124A directory's export pin is inherited from its closest parent with a set export
125pin. In this way, setting the export pin on a directory affects all of its
11fdf7f2 126children. However, the parents pin can be overridden by setting the child
7c673cae
FG
127directory's export pin. For example:
128
129::
31f18b77 130
7c673cae
FG
131 mkdir -p a/b
132 # "a" and "a/b" both start without an export pin set
133 setfattr -n ceph.dir.pin -v 1 a/
134 # a and b are now pinned to rank 1
135 setfattr -n ceph.dir.pin -v 0 a/b
136 # a/b is now pinned to rank 0 and a/ and the rest of its children are still pinned to rank 1
31f18b77 137
f6b5b4d7
TL
138
139Setting subtree partitioning policies
140~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
141
142It is also possible to setup **automatic** static partitioning of subtrees via
143a set of **policies**. In CephFS, this automatic static partitioning is
144referred to as **ephemeral pinning**. Any directory (inode) which is
145ephemerally pinned will be automatically assigned to a particular rank
146according to a consistent hash of its inode number. The set of all
147ephemerally pinned directories should be uniformly distributed across all
148ranks.
149
150Ephemerally pinned directories are so named because the pin may not persist
151once the directory inode is dropped from cache. However, an MDS failover does
152not affect the ephemeral nature of the pinned directory. The MDS records what
153subtrees are ephemerally pinned in its journal so MDS failovers do not drop
154this information.
155
156A directory is either ephemerally pinned or not. Which rank it is pinned to is
157derived from its inode number and a consistent hash. This means that
158ephemerally pinned directories are somewhat evenly spread across the MDS
159cluster. The **consistent hash** also minimizes redistribution when the MDS
160cluster grows or shrinks. So, growing an MDS cluster may automatically increase
161your metadata throughput with no other administrative intervention.
162
163Presently, there are two types of ephemeral pinning:
164
165**Distributed Ephemeral Pins**: This policy indicates that **all** of a
166directory's immediate children should be ephemerally pinned. The canonical
167example would be the ``/home`` directory: we want every user's home directory
168to be spread across the entire MDS cluster. This can be set via:
169
170::
171
172 setfattr -n ceph.dir.pin.distributed -v 1 /cephfs/home
173
174
175**Random Ephemeral Pins**: This policy indicates any descendent sub-directory
176may be ephemerally pinned. This is set through the extended attribute
177``ceph.dir.pin.random`` with the value set to the percentage of directories
178that should be pinned. For example:
179
180::
181
182 setfattr -n ceph.dir.pin.random -v 0.5 /cephfs/tmp
183
184Would cause any directory loaded into cache or created under ``/tmp`` to be
185ephemerally pinned 50 percent of the time.
186
187It is recomended to only set this to small values, like ``.001`` or ``0.1%``.
188Having too many subtrees may degrade performance. For this reason, the config
189``mds_export_ephemeral_random_max`` enforces a cap on the maximum of this
190percentage (default: ``.01``). The MDS returns ``EINVAL`` when attempting to
191set a value beyond this config.
192
193Both random and distributed ephemeral pin policies are off by default in
194Octopus. The features may be enabled via the
195``mds_export_ephemeral_random`` and ``mds_export_ephemeral_distributed``
196configuration options.
197
198Ephemeral pins may override parent export pins and vice versa. What determines
199which policy is followed is the rule of the closest parent: if a closer parent
200directory has a conflicting policy, use that one instead. For example:
201
202::
203
204 mkdir -p foo/bar1/baz foo/bar2
205 setfattr -n ceph.dir.pin -v 0 foo
206 setfattr -n ceph.dir.pin.distributed -v 1 foo/bar1
207
208The ``foo/bar1/baz`` directory will be ephemerally pinned because the
209``foo/bar1`` policy overrides the export pin on ``foo``. The ``foo/bar2``
210directory will obey the pin on ``foo`` normally.
211
212For the reverse situation:
213
214::
215
216 mkdir -p home/{patrick,john}
217 setfattr -n ceph.dir.pin.distributed -v 1 home
218 setfattr -n ceph.dir.pin -v 2 home/patrick
219
220The ``home/patrick`` directory and its children will be pinned to rank 2
221because its export pin overrides the policy on ``home``.
222
223If a directory has an export pin and an ephemeral pin policy, the export pin
224applies to the directory itself and the policy to its children. So:
225
226::
227
228 mkdir -p home/{patrick,john}
229 setfattr -n ceph.dir.pin -v 0 home
230 setfattr -n ceph.dir.pin.distributed -v 1 home
231
232The home directory inode (and all of its directory fragments) will always be
233located on rank 0. All children including ``home/patrick`` and ``home/john``
234will be ephemerally pinned according to the distributed policy. This may only
235matter for some obscure performance advantages. All the same, it's mentioned
236here so the override policy is clear.