[ceph.git] / ceph / doc / cephfs / multimds.rst

.. _cephfs-multimds:

Configuring multiple active MDS daemons
---------------------------------------

*Also known as: multi-mds, active-active MDS*

Each CephFS file system is configured for a single active MDS daemon
by default.  To scale metadata performance for large scale systems, you
may enable multiple active MDS daemons, which will share the metadata
workload with one another.

When should I use multiple active MDS daemons?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You should configure multiple active MDS daemons when your metadata performance
is bottlenecked on the single MDS that runs by default.

Adding more daemons may not increase performance on all workloads.  Typically,
a single application running on a single client will not benefit from an
increased number of MDS daemons unless the application is doing a lot of
metadata operations in parallel.

Workloads that typically benefit from a larger number of active MDS daemons
are those with many clients, perhaps working on many separate directories.


Increasing the MDS active cluster size
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Each CephFS file system has a *max_mds* setting, which controls how many ranks
will be created.  The actual number of ranks in the file system will only be
increased if a spare daemon is available to take on the new rank. For example,
if there is only one MDS daemon running, and max_mds is set to two, no second
rank will be created. (Note that such a configuration is not Highly Available
(HA) because no standby is available to take over for a failed rank. The
cluster will complain via health warnings when configured this way.)

Set ``max_mds`` to the desired number of ranks.  In the following examples
the "fsmap" line of "ceph status" is shown to illustrate the expected
result of commands.

::

    # fsmap e5: 1/1/1 up {0=a=up:active}, 2 up:standby

    ceph fs set <fs_name> max_mds 2

    # fsmap e8: 2/2/2 up {0=a=up:active,1=c=up:creating}, 1 up:standby
    # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby

The newly created rank (1) will pass through the 'creating' state
and then enter this 'active state'.

Standby daemons
~~~~~~~~~~~~~~~

Even with multiple active MDS daemons, a highly available system **still
requires standby daemons** to take over if any of the servers running
an active daemon fail.

Consequently, the practical maximum of ``max_mds`` for highly available systems
is at most one less than the total number of MDS servers in your system.

To remain available in the event of multiple server failures, increase the
number of standby daemons in the system to match the number of server failures
you wish to withstand.

Decreasing the number of ranks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Reducing the number of ranks is as simple as reducing ``max_mds``:

::
    
    # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
    ceph fs set <fs_name> max_mds 1
    # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
    # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
    ...
    # fsmap e10: 1/1/1 up {0=a=up:active}, 2 up:standby

The cluster will automatically stop extra ranks incrementally until ``max_mds``
is reached.

See :doc:`/cephfs/administration` for more details which forms ``<role>`` can
take.

Note: stopped ranks will first enter the stopping state for a period of
time while it hands off its share of the metadata to the remaining active
daemons.  This phase can take from seconds to minutes.  If the MDS appears to
be stuck in the stopping state then that should be investigated as a possible
bug.

If an MDS daemon crashes or is killed while in the ``up:stopping`` state, a
standby will take over and the cluster monitors will against try to stop
the daemon.

When a daemon finishes stopping, it will respawn itself and go back to being a
standby.


Manually pinning directory trees to a particular rank
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In multiple active metadata server configurations, a balancer runs which works
to spread metadata load evenly across the cluster. This usually works well
enough for most users but sometimes it is desirable to override the dynamic
balancer with explicit mappings of metadata to particular ranks. This can allow
the administrator or users to evenly spread application load or limit impact of
users' metadata requests on the entire cluster.

The mechanism provided for this purpose is called an ``export pin``, an
extended attribute of directories. The name of this extended attribute is
``ceph.dir.pin``.  Users can set this attribute using standard commands:

::

    setfattr -n ceph.dir.pin -v 2 path/to/dir

The value of the extended attribute is the rank to assign the directory subtree
to. A default value of ``-1`` indicates the directory is not pinned.

A directory's export pin is inherited from its closest parent with a set export
pin.  In this way, setting the export pin on a directory affects all of its
children. However, the parents pin can be overridden by setting the child
directory's export pin. For example:

::

    mkdir -p a/b
    # "a" and "a/b" both start without an export pin set
    setfattr -n ceph.dir.pin -v 1 a/
    # a and b are now pinned to rank 1
    setfattr -n ceph.dir.pin -v 0 a/b
    # a/b is now pinned to rank 0 and a/ and the rest of its children are still pinned to rank 1


Setting subtree partitioning policies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It is also possible to setup **automatic** static partitioning of subtrees via
a set of **policies**. In CephFS, this automatic static partitioning is
referred to as **ephemeral pinning**. Any directory (inode) which is
ephemerally pinned will be automatically assigned to a particular rank
according to a consistent hash of its inode number. The set of all
ephemerally pinned directories should be uniformly distributed across all
ranks.

Ephemerally pinned directories are so named because the pin may not persist
once the directory inode is dropped from cache. However, an MDS failover does
not affect the ephemeral nature of the pinned directory. The MDS records what
subtrees are ephemerally pinned in its journal so MDS failovers do not drop
this information.

A directory is either ephemerally pinned or not. Which rank it is pinned to is
derived from its inode number and a consistent hash. This means that
ephemerally pinned directories are somewhat evenly spread across the MDS
cluster. The **consistent hash** also minimizes redistribution when the MDS
cluster grows or shrinks. So, growing an MDS cluster may automatically increase
your metadata throughput with no other administrative intervention.

Presently, there are two types of ephemeral pinning:

**Distributed Ephemeral Pins**: This policy indicates that **all** of a
directory's immediate children should be ephemerally pinned. The canonical
example would be the ``/home`` directory: we want every user's home directory
to be spread across the entire MDS cluster. This can be set via:

::

    setfattr -n ceph.dir.pin.distributed -v 1 /cephfs/home


**Random Ephemeral Pins**: This policy indicates any descendent sub-directory
may be ephemerally pinned. This is set through the extended attribute
``ceph.dir.pin.random`` with the value set to the percentage of directories
that should be pinned. For example:

::

    setfattr -n ceph.dir.pin.random -v 0.5 /cephfs/tmp

Would cause any directory loaded into cache or created under ``/tmp`` to be
ephemerally pinned 50 percent of the time.

It is recomended to only set this to small values, like ``.001`` or ``0.1%``.
Having too many subtrees may degrade performance. For this reason, the config
``mds_export_ephemeral_random_max`` enforces a cap on the maximum of this
percentage (default: ``.01``). The MDS returns ``EINVAL`` when attempting to
set a value beyond this config.

Both random and distributed ephemeral pin policies are off by default in
Octopus. The features may be enabled via the
``mds_export_ephemeral_random`` and ``mds_export_ephemeral_distributed``
configuration options.

Ephemeral pins may override parent export pins and vice versa. What determines
which policy is followed is the rule of the closest parent: if a closer parent
directory has a conflicting policy, use that one instead. For example:

::

    mkdir -p foo/bar1/baz foo/bar2
    setfattr -n ceph.dir.pin -v 0 foo
    setfattr -n ceph.dir.pin.distributed -v 1 foo/bar1

The ``foo/bar1/baz`` directory will be ephemerally pinned because the
``foo/bar1`` policy overrides the export pin on ``foo``. The ``foo/bar2``
directory will obey the pin on ``foo`` normally.

For the reverse situation:

::

    mkdir -p home/{patrick,john}
    setfattr -n ceph.dir.pin.distributed -v 1 home
    setfattr -n ceph.dir.pin -v 2 home/patrick

The ``home/patrick`` directory and its children will be pinned to rank 2
because its export pin overrides the policy on ``home``.

If a directory has an export pin and an ephemeral pin policy, the export pin
applies to the directory itself and the policy to its children. So:

::

    mkdir -p home/{patrick,john}
    setfattr -n ceph.dir.pin -v 0 home
    setfattr -n ceph.dir.pin.distributed -v 1 home

The home directory inode (and all of its directory fragments) will always be
located on rank 0. All children including ``home/patrick`` and ``home/john``
will be ephemerally pinned according to the distributed policy. This may only
matter for some obscure performance advantages. All the same, it's mentioned
here so the override policy is clear.
Commit	Line	Data
11fdf7f2	1	.. _cephfs-multimds:
7c673cae FG	2
	3	Configuring multiple active MDS daemons
	4	---------------------------------------
	5
	6	Also known as: multi-mds, active-active MDS
	7
9f95a23c	8	Each CephFS file system is configured for a single active MDS daemon
7c673cae FG	9	by default. To scale metadata performance for large scale systems, you
	10	may enable multiple active MDS daemons, which will share the metadata
	11	workload with one another.
	12
	13	When should I use multiple active MDS daemons?
	14	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	15
	16	You should configure multiple active MDS daemons when your metadata performance
	17	is bottlenecked on the single MDS that runs by default.
	18
	19	Adding more daemons may not increase performance on all workloads. Typically,
	20	a single application running on a single client will not benefit from an
	21	increased number of MDS daemons unless the application is doing a lot of
	22	metadata operations in parallel.
	23
	24	Workloads that typically benefit from a larger number of active MDS daemons
	25	are those with many clients, perhaps working on many separate directories.
	26
	27
	28	Increasing the MDS active cluster size
	29	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	30
9f95a23c TL	31	Each CephFS file system has a max_mds setting, which controls how many ranks
9f95a23c TL	32	will be created. The actual number of ranks in the file system will only be
11fdf7f2 TL	33	increased if a spare daemon is available to take on the new rank. For example,
	34	if there is only one MDS daemon running, and max_mds is set to two, no second
	35	rank will be created. (Note that such a configuration is not Highly Available
	36	(HA) because no standby is available to take over for a failed rank. The
	37	cluster will complain via health warnings when configured this way.)
7c673cae FG	38
	39	Set ``max_mds`` to the desired number of ranks. In the following examples
	40	the "fsmap" line of "ceph status" is shown to illustrate the expected
	41	result of commands.
	42
	43	::
	44
	45	# fsmap e5: 1/1/1 up {0=a=up:active}, 2 up:standby
	46
11fdf7f2	47	ceph fs set <fs_name> max_mds 2
7c673cae FG	48
	49	# fsmap e8: 2/2/2 up {0=a=up:active,1=c=up:creating}, 1 up:standby
	50	# fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
	51
	52	The newly created rank (1) will pass through the 'creating' state
	53	and then enter this 'active state'.
	54
	55	Standby daemons
	56	~~~~~~~~~~~~~~~
	57
	58	Even with multiple active MDS daemons, a highly available system **still
	59	requires standby daemons** to take over if any of the servers running
	60	an active daemon fail.
	61
	62	Consequently, the practical maximum of ``max_mds`` for highly available systems
11fdf7f2	63	is at most one less than the total number of MDS servers in your system.
7c673cae FG	64
	65	To remain available in the event of multiple server failures, increase the
	66	number of standby daemons in the system to match the number of server failures
	67	you wish to withstand.
	68
	69	Decreasing the number of ranks
	70	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	71
11fdf7f2	72	Reducing the number of ranks is as simple as reducing ``max_mds``:
7c673cae FG	73
	74	::
	75
	76	# fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
11fdf7f2 TL	77	ceph fs set <fs_name> max_mds 1
	78	# fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
	79	# fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
	80	...
	81	# fsmap e10: 1/1/1 up {0=a=up:active}, 2 up:standby
7c673cae	82
11fdf7f2 TL	83	The cluster will automatically stop extra ranks incrementally until ``max_mds``
11fdf7f2 TL	84	is reached.
7c673cae	85
c07f9fc5 FG	86	See :doc:`/cephfs/administration` for more details which forms ``<role>`` can
	87	take.
	88
11fdf7f2 TL	89	Note: stopped ranks will first enter the stopping state for a period of
	90	time while it hands off its share of the metadata to the remaining active
	91	daemons. This phase can take from seconds to minutes. If the MDS appears to
	92	be stuck in the stopping state then that should be investigated as a possible
	93	bug.
7c673cae	94
11fdf7f2 TL	95	If an MDS daemon crashes or is killed while in the ``up:stopping`` state, a
	96	standby will take over and the cluster monitors will against try to stop
	97	the daemon.
7c673cae	98
11fdf7f2 TL	99	When a daemon finishes stopping, it will respawn itself and go back to being a
11fdf7f2 TL	100	standby.
7c673cae FG	101
	102
	103	Manually pinning directory trees to a particular rank
	104	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	105
	106	In multiple active metadata server configurations, a balancer runs which works
	107	to spread metadata load evenly across the cluster. This usually works well
	108	enough for most users but sometimes it is desirable to override the dynamic
	109	balancer with explicit mappings of metadata to particular ranks. This can allow
	110	the administrator or users to evenly spread application load or limit impact of
	111	users' metadata requests on the entire cluster.
	112
	113	The mechanism provided for this purpose is called an ``export pin``, an
	114	extended attribute of directories. The name of this extended attribute is
	115	``ceph.dir.pin``. Users can set this attribute using standard commands:
	116
	117	::
31f18b77	118
7c673cae FG	119	setfattr -n ceph.dir.pin -v 2 path/to/dir
	120
	121	The value of the extended attribute is the rank to assign the directory subtree
	122	to. A default value of ``-1`` indicates the directory is not pinned.
	123
	124	A directory's export pin is inherited from its closest parent with a set export
	125	pin. In this way, setting the export pin on a directory affects all of its
11fdf7f2	126	children. However, the parents pin can be overridden by setting the child
7c673cae FG	127	directory's export pin. For example:
	128
	129	::
31f18b77	130
7c673cae FG	131	mkdir -p a/b
	132	# "a" and "a/b" both start without an export pin set
	133	setfattr -n ceph.dir.pin -v 1 a/
	134	# a and b are now pinned to rank 1
	135	setfattr -n ceph.dir.pin -v 0 a/b
	136	# a/b is now pinned to rank 0 and a/ and the rest of its children are still pinned to rank 1
31f18b77	137
f6b5b4d7 TL	138
	139	Setting subtree partitioning policies
	140	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	141
	142	It is also possible to setup automatic static partitioning of subtrees via
	143	a set of policies. In CephFS, this automatic static partitioning is
	144	referred to as ephemeral pinning. Any directory (inode) which is
	145	ephemerally pinned will be automatically assigned to a particular rank
	146	according to a consistent hash of its inode number. The set of all
	147	ephemerally pinned directories should be uniformly distributed across all
	148	ranks.
	149
	150	Ephemerally pinned directories are so named because the pin may not persist
	151	once the directory inode is dropped from cache. However, an MDS failover does
	152	not affect the ephemeral nature of the pinned directory. The MDS records what
	153	subtrees are ephemerally pinned in its journal so MDS failovers do not drop
	154	this information.
	155
	156	A directory is either ephemerally pinned or not. Which rank it is pinned to is
	157	derived from its inode number and a consistent hash. This means that
	158	ephemerally pinned directories are somewhat evenly spread across the MDS
	159	cluster. The consistent hash also minimizes redistribution when the MDS
	160	cluster grows or shrinks. So, growing an MDS cluster may automatically increase
	161	your metadata throughput with no other administrative intervention.
	162
	163	Presently, there are two types of ephemeral pinning:
	164
	165	Distributed Ephemeral Pins: This policy indicates that all of a
	166	directory's immediate children should be ephemerally pinned. The canonical
	167	example would be the ``/home`` directory: we want every user's home directory
	168	to be spread across the entire MDS cluster. This can be set via:
	169
	170	::
	171
	172	setfattr -n ceph.dir.pin.distributed -v 1 /cephfs/home
	173
	174
	175	Random Ephemeral Pins: This policy indicates any descendent sub-directory
	176	may be ephemerally pinned. This is set through the extended attribute
	177	``ceph.dir.pin.random`` with the value set to the percentage of directories
	178	that should be pinned. For example:
	179
	180	::
	181
	182	setfattr -n ceph.dir.pin.random -v 0.5 /cephfs/tmp
	183
	184	Would cause any directory loaded into cache or created under ``/tmp`` to be
	185	ephemerally pinned 50 percent of the time.
	186
	187	It is recomended to only set this to small values, like ``.001`` or ``0.1%``.
	188	Having too many subtrees may degrade performance. For this reason, the config
	189	``mds_export_ephemeral_random_max`` enforces a cap on the maximum of this
	190	percentage (default: ``.01``). The MDS returns ``EINVAL`` when attempting to
	191	set a value beyond this config.
	192
	193	Both random and distributed ephemeral pin policies are off by default in
	194	Octopus. The features may be enabled via the
	195	``mds_export_ephemeral_random`` and ``mds_export_ephemeral_distributed``
	196	configuration options.
	197
	198	Ephemeral pins may override parent export pins and vice versa. What determines
	199	which policy is followed is the rule of the closest parent: if a closer parent
	200	directory has a conflicting policy, use that one instead. For example:
	201
202	::
203
204	mkdir -p foo/bar1/baz foo/bar2
205	setfattr -n ceph.dir.pin -v 0 foo
206	setfattr -n ceph.dir.pin.distributed -v 1 foo/bar1
207
208	The ``foo/bar1/baz`` directory will be ephemerally pinned because the
209	``foo/bar1`` policy overrides the export pin on ``foo``. The ``foo/bar2``
210	directory will obey the pin on ``foo`` normally.
211
212	For the reverse situation:
213
214	::
215
216	mkdir -p home/{patrick,john}
217	setfattr -n ceph.dir.pin.distributed -v 1 home
218	setfattr -n ceph.dir.pin -v 2 home/patrick
219
220	The ``home/patrick`` directory and its children will be pinned to rank 2
221	because its export pin overrides the policy on ``home``.
222
223	If a directory has an export pin and an ephemeral pin policy, the export pin
224	applies to the directory itself and the policy to its children. So:
225
226	::
227
228	mkdir -p home/{patrick,john}
229	setfattr -n ceph.dir.pin -v 0 home
230	setfattr -n ceph.dir.pin.distributed -v 1 home
231
232	The home directory inode (and all of its directory fragments) will always be
233	located on rank 0. All children including ``home/patrick`` and ``home/john``
234	will be ephemerally pinned according to the distributed policy. This may only
235	matter for some obscure performance advantages. All the same, it's mentioned
236	here so the override policy is clear.