[ceph.git] / ceph / doc / cephfs / multimds.rst


Configuring multiple active MDS daemons
---------------------------------------

*Also known as: multi-mds, active-active MDS*

Each CephFS filesystem is configured for a single active MDS daemon
by default.  To scale metadata performance for large scale systems, you
may enable multiple active MDS daemons, which will share the metadata
workload with one another.

When should I use multiple active MDS daemons?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You should configure multiple active MDS daemons when your metadata performance
is bottlenecked on the single MDS that runs by default.

Adding more daemons may not increase performance on all workloads.  Typically,
a single application running on a single client will not benefit from an
increased number of MDS daemons unless the application is doing a lot of
metadata operations in parallel.

Workloads that typically benefit from a larger number of active MDS daemons
are those with many clients, perhaps working on many separate directories.


Increasing the MDS active cluster size
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Each CephFS filesystem has a *max_mds* setting, which controls
how many ranks will be created.  The actual number of ranks
in the filesystem will only be increased if a spare daemon is
available to take on the new rank. For example, if there is only one MDS daemon running, and max_mds is set to two, no second rank will be created.

Set ``max_mds`` to the desired number of ranks.  In the following examples
the "fsmap" line of "ceph status" is shown to illustrate the expected
result of commands.

::

    # fsmap e5: 1/1/1 up {0=a=up:active}, 2 up:standby

    ceph fs set max_mds 2

    # fsmap e8: 2/2/2 up {0=a=up:active,1=c=up:creating}, 1 up:standby
    # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby

The newly created rank (1) will pass through the 'creating' state
and then enter this 'active state'.

Standby daemons
~~~~~~~~~~~~~~~

Even with multiple active MDS daemons, a highly available system **still
requires standby daemons** to take over if any of the servers running
an active daemon fail.

Consequently, the practical maximum of ``max_mds`` for highly available systems
is one less than the total number of MDS servers in your system.

To remain available in the event of multiple server failures, increase the
number of standby daemons in the system to match the number of server failures
you wish to withstand.

Decreasing the number of ranks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All ranks, including the rank(s) to be removed must first be active.  This
means that you must have at least max_mds MDS daemons available.

First, set max_mds to a lower number, for example we might go back to
having just a single active MDS:

::
    
    # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
    ceph fs set max_mds 1
    # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:active}, 1 up:standby

Note that we still have two active MDSs: the ranks still exist even though
we have decreased max_mds, because max_mds only restricts creation
of new ranks.

Next, use the ``ceph mds deactivate <rank>`` command to remove the
unneeded rank:

::

    ceph mds deactivate cephfs_a:1
    telling mds.1:1 172.21.9.34:6806/837679928 to deactivate

    # fsmap e11: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
    # fsmap e12: 1/1/1 up {0=a=up:active}, 1 up:standby
    # fsmap e13: 1/1/1 up {0=a=up:active}, 2 up:standby

The deactivated rank will first enter the stopping state for a period
of time while it hands off its share of the metadata to the remaining
active daemons.  This phase can take from seconds to minutes.  If the
MDS appears to be stuck in the stopping state then that should be investigated
as a possible bug.

If an MDS daemon crashes or is killed while in the 'stopping' state, a
standby will take over and the rank will go back to 'active'.  You can
try to deactivate it again once it has come back up.

When a daemon finishes stopping, it will respawn itself and go
back to being a standby.


Manually pinning directory trees to a particular rank
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In multiple active metadata server configurations, a balancer runs which works
to spread metadata load evenly across the cluster. This usually works well
enough for most users but sometimes it is desirable to override the dynamic
balancer with explicit mappings of metadata to particular ranks. This can allow
the administrator or users to evenly spread application load or limit impact of
users' metadata requests on the entire cluster.

The mechanism provided for this purpose is called an ``export pin``, an
extended attribute of directories. The name of this extended attribute is
``ceph.dir.pin``.  Users can set this attribute using standard commands:

::

    setfattr -n ceph.dir.pin -v 2 path/to/dir

The value of the extended attribute is the rank to assign the directory subtree
to. A default value of ``-1`` indicates the directory is not pinned.

A directory's export pin is inherited from its closest parent with a set export
pin.  In this way, setting the export pin on a directory affects all of its
children. However, the parents pin can be overriden by setting the child
directory's export pin. For example:

::

    mkdir -p a/b
    # "a" and "a/b" both start without an export pin set
    setfattr -n ceph.dir.pin -v 1 a/
    # a and b are now pinned to rank 1
    setfattr -n ceph.dir.pin -v 0 a/b
    # a/b is now pinned to rank 0 and a/ and the rest of its children are still pinned to rank 1
Commit	Line	Data
7c673cae FG	1
	2	Configuring multiple active MDS daemons
	3	---------------------------------------
	4
	5	Also known as: multi-mds, active-active MDS
	6
	7	Each CephFS filesystem is configured for a single active MDS daemon
	8	by default. To scale metadata performance for large scale systems, you
	9	may enable multiple active MDS daemons, which will share the metadata
	10	workload with one another.
	11
	12	When should I use multiple active MDS daemons?
	13	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	14
	15	You should configure multiple active MDS daemons when your metadata performance
	16	is bottlenecked on the single MDS that runs by default.
	17
	18	Adding more daemons may not increase performance on all workloads. Typically,
	19	a single application running on a single client will not benefit from an
	20	increased number of MDS daemons unless the application is doing a lot of
	21	metadata operations in parallel.
	22
	23	Workloads that typically benefit from a larger number of active MDS daemons
	24	are those with many clients, perhaps working on many separate directories.
	25
	26
	27	Increasing the MDS active cluster size
	28	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	29
	30	Each CephFS filesystem has a max_mds setting, which controls
	31	how many ranks will be created. The actual number of ranks
	32	in the filesystem will only be increased if a spare daemon is
	33	available to take on the new rank. For example, if there is only one MDS daemon running, and max_mds is set to two, no second rank will be created.
	34
	35	Set ``max_mds`` to the desired number of ranks. In the following examples
	36	the "fsmap" line of "ceph status" is shown to illustrate the expected
	37	result of commands.
	38
	39	::
	40
	41	# fsmap e5: 1/1/1 up {0=a=up:active}, 2 up:standby
	42
	43	ceph fs set max_mds 2
	44
	45	# fsmap e8: 2/2/2 up {0=a=up:active,1=c=up:creating}, 1 up:standby
	46	# fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
	47
	48	The newly created rank (1) will pass through the 'creating' state
	49	and then enter this 'active state'.
	50
	51	Standby daemons
	52	~~~~~~~~~~~~~~~
	53
	54	Even with multiple active MDS daemons, a highly available system **still
	55	requires standby daemons** to take over if any of the servers running
	56	an active daemon fail.
	57
	58	Consequently, the practical maximum of ``max_mds`` for highly available systems
	59	is one less than the total number of MDS servers in your system.
	60
	61	To remain available in the event of multiple server failures, increase the
	62	number of standby daemons in the system to match the number of server failures
	63	you wish to withstand.
	64
65	Decreasing the number of ranks
66	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
67
68	All ranks, including the rank(s) to be removed must first be active. This
69	means that you must have at least max_mds MDS daemons available.
70
71	First, set max_mds to a lower number, for example we might go back to
72	having just a single active MDS:
73
74	::
75
76	# fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
77	ceph fs set max_mds 1
78	# fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:active}, 1 up:standby
79
80	Note that we still have two active MDSs: the ranks still exist even though
81	we have decreased max_mds, because max_mds only restricts creation
82	of new ranks.
83
84	Next, use the ``ceph mds deactivate <rank>`` command to remove the
85	unneeded rank:
86
87	::
88
89	ceph mds deactivate cephfs_a:1
90	telling mds.1:1 172.21.9.34:6806/837679928 to deactivate
91
92	# fsmap e11: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
93	# fsmap e12: 1/1/1 up {0=a=up:active}, 1 up:standby
94	# fsmap e13: 1/1/1 up {0=a=up:active}, 2 up:standby
95
96	The deactivated rank will first enter the stopping state for a period
97	of time while it hands off its share of the metadata to the remaining
98	active daemons. This phase can take from seconds to minutes. If the
99	MDS appears to be stuck in the stopping state then that should be investigated
100	as a possible bug.
101
102	If an MDS daemon crashes or is killed while in the 'stopping' state, a
103	standby will take over and the rank will go back to 'active'. You can
104	try to deactivate it again once it has come back up.
105
106	When a daemon finishes stopping, it will respawn itself and go
107	back to being a standby.
108
109
110	Manually pinning directory trees to a particular rank
111	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
112
113	In multiple active metadata server configurations, a balancer runs which works
114	to spread metadata load evenly across the cluster. This usually works well
115	enough for most users but sometimes it is desirable to override the dynamic
116	balancer with explicit mappings of metadata to particular ranks. This can allow
117	the administrator or users to evenly spread application load or limit impact of
118	users' metadata requests on the entire cluster.
119
120	The mechanism provided for this purpose is called an ``export pin``, an
121	extended attribute of directories. The name of this extended attribute is
122	``ceph.dir.pin``. Users can set this attribute using standard commands:
123
124	::
31f18b77	125
7c673cae FG	126	setfattr -n ceph.dir.pin -v 2 path/to/dir
	127
	128	The value of the extended attribute is the rank to assign the directory subtree
	129	to. A default value of ``-1`` indicates the directory is not pinned.
	130
	131	A directory's export pin is inherited from its closest parent with a set export
	132	pin. In this way, setting the export pin on a directory affects all of its
	133	children. However, the parents pin can be overriden by setting the child
	134	directory's export pin. For example:
	135
	136	::
31f18b77	137
7c673cae FG	138	mkdir -p a/b
	139	# "a" and "a/b" both start without an export pin set
	140	setfattr -n ceph.dir.pin -v 1 a/
	141	# a and b are now pinned to rank 1
	142	setfattr -n ceph.dir.pin -v 0 a/b
	143	# a/b is now pinned to rank 0 and a/ and the rest of its children are still pinned to rank 1
31f18b77	144