ceph/doc/cephfs/multimds.rst

   1 .. _cephfs-multimds:
   2
   3 Configuring multiple active MDS daemons
   4 ---------------------------------------
   5
   6 *Also known as: multi-mds, active-active MDS*
   7
   8 Each CephFS file system is configured for a single active MDS daemon
   9 by default.  To scale metadata performance for large scale systems, you
  10 may enable multiple active MDS daemons, which will share the metadata
  11 workload with one another.
  12
  13 When should I use multiple active MDS daemons?
  14 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  15
  16 You should configure multiple active MDS daemons when your metadata performance
  17 is bottlenecked on the single MDS that runs by default.
  18
  19 Adding more daemons may not increase performance on all workloads.  Typically,
  20 a single application running on a single client will not benefit from an
  21 increased number of MDS daemons unless the application is doing a lot of
  22 metadata operations in parallel.
  23
  24 Workloads that typically benefit from a larger number of active MDS daemons
  25 are those with many clients, perhaps working on many separate directories.
  26
  27
  28 Increasing the MDS active cluster size
  29 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  30
  31 Each CephFS file system has a *max_mds* setting, which controls how many ranks
  32 will be created.  The actual number of ranks in the file system will only be
  33 increased if a spare daemon is available to take on the new rank. For example,
  34 if there is only one MDS daemon running, and max_mds is set to two, no second
  35 rank will be created. (Note that such a configuration is not Highly Available
  36 (HA) because no standby is available to take over for a failed rank. The
  37 cluster will complain via health warnings when configured this way.)
  38
  39 Set ``max_mds`` to the desired number of ranks.  In the following examples
  40 the "fsmap" line of "ceph status" is shown to illustrate the expected
  41 result of commands.
  42
  43 ::
  44
  45     # fsmap e5: 1/1/1 up {0=a=up:active}, 2 up:standby
  46
  47     ceph fs set <fs_name> max_mds 2
  48
  49     # fsmap e8: 2/2/2 up {0=a=up:active,1=c=up:creating}, 1 up:standby
  50     # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
  51
  52 The newly created rank (1) will pass through the 'creating' state
  53 and then enter this 'active state'.
  54
  55 Standby daemons
  56 ~~~~~~~~~~~~~~~
  57
  58 Even with multiple active MDS daemons, a highly available system **still
  59 requires standby daemons** to take over if any of the servers running
  60 an active daemon fail.
  61
  62 Consequently, the practical maximum of ``max_mds`` for highly available systems
  63 is at most one less than the total number of MDS servers in your system.
  64
  65 To remain available in the event of multiple server failures, increase the
  66 number of standby daemons in the system to match the number of server failures
  67 you wish to withstand.
  68
  69 Decreasing the number of ranks
  70 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  71
  72 Reducing the number of ranks is as simple as reducing ``max_mds``:
  73
  74 ::
  75
  76     # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
  77     ceph fs set <fs_name> max_mds 1
  78     # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
  79     # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
  80     ...
  81     # fsmap e10: 1/1/1 up {0=a=up:active}, 2 up:standby
  82
  83 The cluster will automatically stop extra ranks incrementally until ``max_mds``
  84 is reached.
  85
  86 See :doc:`/cephfs/administration` for more details which forms ``<role>`` can
  87 take.
  88
  89 Note: stopped ranks will first enter the stopping state for a period of
  90 time while it hands off its share of the metadata to the remaining active
  91 daemons.  This phase can take from seconds to minutes.  If the MDS appears to
  92 be stuck in the stopping state then that should be investigated as a possible
  93 bug.
  94
  95 If an MDS daemon crashes or is killed while in the ``up:stopping`` state, a
  96 standby will take over and the cluster monitors will against try to stop
  97 the daemon.
  98
  99 When a daemon finishes stopping, it will respawn itself and go back to being a
 100 standby.
 101
 102
 103 Manually pinning directory trees to a particular rank
 104 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 105
 106 In multiple active metadata server configurations, a balancer runs which works
 107 to spread metadata load evenly across the cluster. This usually works well
 108 enough for most users but sometimes it is desirable to override the dynamic
 109 balancer with explicit mappings of metadata to particular ranks. This can allow
 110 the administrator or users to evenly spread application load or limit impact of
 111 users' metadata requests on the entire cluster.
 112
 113 The mechanism provided for this purpose is called an ``export pin``, an
 114 extended attribute of directories. The name of this extended attribute is
 115 ``ceph.dir.pin``.  Users can set this attribute using standard commands:
 116
 117 ::
 118
 119     setfattr -n ceph.dir.pin -v 2 path/to/dir
 120
 121 The value of the extended attribute is the rank to assign the directory subtree
 122 to. A default value of ``-1`` indicates the directory is not pinned.
 123
 124 A directory's export pin is inherited from its closest parent with a set export
 125 pin.  In this way, setting the export pin on a directory affects all of its
 126 children. However, the parents pin can be overridden by setting the child
 127 directory's export pin. For example:
 128
 129 ::
 130
 131     mkdir -p a/b
 132     # "a" and "a/b" both start without an export pin set
 133     setfattr -n ceph.dir.pin -v 1 a/
 134     # a and b are now pinned to rank 1
 135     setfattr -n ceph.dir.pin -v 0 a/b
 136     # a/b is now pinned to rank 0 and a/ and the rest of its children are still pinned to rank 1
 137