ceph/doc/cephfs/multimds.rst

   1
   2 Configuring multiple active MDS daemons
   3 ---------------------------------------
   4
   5 *Also known as: multi-mds, active-active MDS*
   6
   7 Each CephFS filesystem is configured for a single active MDS daemon
   8 by default.  To scale metadata performance for large scale systems, you
   9 may enable multiple active MDS daemons, which will share the metadata
  10 workload with one another.
  11
  12 When should I use multiple active MDS daemons?
  13 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  14
  15 You should configure multiple active MDS daemons when your metadata performance
  16 is bottlenecked on the single MDS that runs by default.
  17
  18 Adding more daemons may not increase performance on all workloads.  Typically,
  19 a single application running on a single client will not benefit from an
  20 increased number of MDS daemons unless the application is doing a lot of
  21 metadata operations in parallel.
  22
  23 Workloads that typically benefit from a larger number of active MDS daemons
  24 are those with many clients, perhaps working on many separate directories.
  25
  26
  27 Increasing the MDS active cluster size
  28 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  29
  30 Each CephFS filesystem has a *max_mds* setting, which controls
  31 how many ranks will be created.  The actual number of ranks
  32 in the filesystem will only be increased if a spare daemon is
  33 available to take on the new rank. For example, if there is only one MDS daemon running, and max_mds is set to two, no second rank will be created.
  34
  35 Set ``max_mds`` to the desired number of ranks.  In the following examples
  36 the "fsmap" line of "ceph status" is shown to illustrate the expected
  37 result of commands.
  38
  39 ::
  40
  41     # fsmap e5: 1/1/1 up {0=a=up:active}, 2 up:standby
  42
  43     ceph fs set max_mds 2
  44
  45     # fsmap e8: 2/2/2 up {0=a=up:active,1=c=up:creating}, 1 up:standby
  46     # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
  47
  48 The newly created rank (1) will pass through the 'creating' state
  49 and then enter this 'active state'.
  50
  51 Standby daemons
  52 ~~~~~~~~~~~~~~~
  53
  54 Even with multiple active MDS daemons, a highly available system **still
  55 requires standby daemons** to take over if any of the servers running
  56 an active daemon fail.
  57
  58 Consequently, the practical maximum of ``max_mds`` for highly available systems
  59 is one less than the total number of MDS servers in your system.
  60
  61 To remain available in the event of multiple server failures, increase the
  62 number of standby daemons in the system to match the number of server failures
  63 you wish to withstand.
  64
  65 Decreasing the number of ranks
  66 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  67
  68 All ranks, including the rank(s) to be removed must first be active.  This
  69 means that you must have at least max_mds MDS daemons available.
  70
  71 First, set max_mds to a lower number, for example we might go back to
  72 having just a single active MDS:
  73
  74 ::
  75
  76     # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
  77     ceph fs set max_mds 1
  78     # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:active}, 1 up:standby
  79
  80 Note that we still have two active MDSs: the ranks still exist even though
  81 we have decreased max_mds, because max_mds only restricts creation
  82 of new ranks.
  83
  84 Next, use the ``ceph mds deactivate <rank>`` command to remove the
  85 unneeded rank:
  86
  87 ::
  88
  89     ceph mds deactivate cephfs_a:1
  90     telling mds.1:1 172.21.9.34:6806/837679928 to deactivate
  91
  92     # fsmap e11: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
  93     # fsmap e12: 1/1/1 up {0=a=up:active}, 1 up:standby
  94     # fsmap e13: 1/1/1 up {0=a=up:active}, 2 up:standby
  95
  96 The deactivated rank will first enter the stopping state for a period
  97 of time while it hands off its share of the metadata to the remaining
  98 active daemons.  This phase can take from seconds to minutes.  If the
  99 MDS appears to be stuck in the stopping state then that should be investigated
 100 as a possible bug.
 101
 102 If an MDS daemon crashes or is killed while in the 'stopping' state, a
 103 standby will take over and the rank will go back to 'active'.  You can
 104 try to deactivate it again once it has come back up.
 105
 106 When a daemon finishes stopping, it will respawn itself and go
 107 back to being a standby.
 108
 109
 110 Manually pinning directory trees to a particular rank
 111 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 112
 113 In multiple active metadata server configurations, a balancer runs which works
 114 to spread metadata load evenly across the cluster. This usually works well
 115 enough for most users but sometimes it is desirable to override the dynamic
 116 balancer with explicit mappings of metadata to particular ranks. This can allow
 117 the administrator or users to evenly spread application load or limit impact of
 118 users' metadata requests on the entire cluster.
 119
 120 The mechanism provided for this purpose is called an ``export pin``, an
 121 extended attribute of directories. The name of this extended attribute is
 122 ``ceph.dir.pin``.  Users can set this attribute using standard commands:
 123
 124 ::
 125     setfattr -n ceph.dir.pin -v 2 path/to/dir
 126
 127 The value of the extended attribute is the rank to assign the directory subtree
 128 to. A default value of ``-1`` indicates the directory is not pinned.
 129
 130 A directory's export pin is inherited from its closest parent with a set export
 131 pin.  In this way, setting the export pin on a directory affects all of its
 132 children. However, the parents pin can be overriden by setting the child
 133 directory's export pin. For example:
 134
 135 ::
 136     mkdir -p a/b
 137     # "a" and "a/b" both start without an export pin set
 138     setfattr -n ceph.dir.pin -v 1 a/
 139     # a and b are now pinned to rank 1
 140     setfattr -n ceph.dir.pin -v 0 a/b
 141     # a/b is now pinned to rank 0 and a/ and the rest of its children are still pinned to rank 1