ceph/doc/cephfs/multimds.rst

   1 .. _cephfs-multimds:
   2
   3 Configuring multiple active MDS daemons
   4 ---------------------------------------
   5
   6 *Also known as: multi-mds, active-active MDS*
   7
   8 Each CephFS file system is configured for a single active MDS daemon
   9 by default.  To scale metadata performance for large scale systems, you
  10 may enable multiple active MDS daemons, which will share the metadata
  11 workload with one another.
  12
  13 When should I use multiple active MDS daemons?
  14 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  15
  16 You should configure multiple active MDS daemons when your metadata performance
  17 is bottlenecked on the single MDS that runs by default.
  18
  19 Adding more daemons may not increase performance on all workloads.  Typically,
  20 a single application running on a single client will not benefit from an
  21 increased number of MDS daemons unless the application is doing a lot of
  22 metadata operations in parallel.
  23
  24 Workloads that typically benefit from a larger number of active MDS daemons
  25 are those with many clients, perhaps working on many separate directories.
  26
  27
  28 Increasing the MDS active cluster size
  29 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  30
  31 Each CephFS file system has a *max_mds* setting, which controls how many ranks
  32 will be created.  The actual number of ranks in the file system will only be
  33 increased if a spare daemon is available to take on the new rank. For example,
  34 if there is only one MDS daemon running, and max_mds is set to two, no second
  35 rank will be created. (Note that such a configuration is not Highly Available
  36 (HA) because no standby is available to take over for a failed rank. The
  37 cluster will complain via health warnings when configured this way.)
  38
  39 Set ``max_mds`` to the desired number of ranks.  In the following examples
  40 the "fsmap" line of "ceph status" is shown to illustrate the expected
  41 result of commands.
  42
  43 ::
  44
  45     # fsmap e5: 1/1/1 up {0=a=up:active}, 2 up:standby
  46
  47     ceph fs set <fs_name> max_mds 2
  48
  49     # fsmap e8: 2/2/2 up {0=a=up:active,1=c=up:creating}, 1 up:standby
  50     # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
  51
  52 The newly created rank (1) will pass through the 'creating' state
  53 and then enter this 'active state'.
  54
  55 Standby daemons
  56 ~~~~~~~~~~~~~~~
  57
  58 Even with multiple active MDS daemons, a highly available system **still
  59 requires standby daemons** to take over if any of the servers running
  60 an active daemon fail.
  61
  62 Consequently, the practical maximum of ``max_mds`` for highly available systems
  63 is at most one less than the total number of MDS servers in your system.
  64
  65 To remain available in the event of multiple server failures, increase the
  66 number of standby daemons in the system to match the number of server failures
  67 you wish to withstand.
  68
  69 Decreasing the number of ranks
  70 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  71
  72 Reducing the number of ranks is as simple as reducing ``max_mds``:
  73
  74 ::
  75
  76     # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
  77     ceph fs set <fs_name> max_mds 1
  78     # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
  79     # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
  80     ...
  81     # fsmap e10: 1/1/1 up {0=a=up:active}, 2 up:standby
  82
  83 The cluster will automatically stop extra ranks incrementally until ``max_mds``
  84 is reached.
  85
  86 See :doc:`/cephfs/administration` for more details which forms ``<role>`` can
  87 take.
  88
  89 Note: stopped ranks will first enter the stopping state for a period of
  90 time while it hands off its share of the metadata to the remaining active
  91 daemons.  This phase can take from seconds to minutes.  If the MDS appears to
  92 be stuck in the stopping state then that should be investigated as a possible
  93 bug.
  94
  95 If an MDS daemon crashes or is killed while in the ``up:stopping`` state, a
  96 standby will take over and the cluster monitors will against try to stop
  97 the daemon.
  98
  99 When a daemon finishes stopping, it will respawn itself and go back to being a
 100 standby.
 101
 102
 103 .. _cephfs-pinning:
 104
 105 Manually pinning directory trees to a particular rank
 106 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 107
 108 In multiple active metadata server configurations, a balancer runs which works
 109 to spread metadata load evenly across the cluster. This usually works well
 110 enough for most users but sometimes it is desirable to override the dynamic
 111 balancer with explicit mappings of metadata to particular ranks. This can allow
 112 the administrator or users to evenly spread application load or limit impact of
 113 users' metadata requests on the entire cluster.
 114
 115 The mechanism provided for this purpose is called an ``export pin``, an
 116 extended attribute of directories. The name of this extended attribute is
 117 ``ceph.dir.pin``.  Users can set this attribute using standard commands:
 118
 119 ::
 120
 121     setfattr -n ceph.dir.pin -v 2 path/to/dir
 122
 123 The value of the extended attribute is the rank to assign the directory subtree
 124 to. A default value of ``-1`` indicates the directory is not pinned.
 125
 126 A directory's export pin is inherited from its closest parent with a set export
 127 pin.  In this way, setting the export pin on a directory affects all of its
 128 children. However, the parents pin can be overridden by setting the child
 129 directory's export pin. For example:
 130
 131 ::
 132
 133     mkdir -p a/b
 134     # "a" and "a/b" both start without an export pin set
 135     setfattr -n ceph.dir.pin -v 1 a/
 136     # a and b are now pinned to rank 1
 137     setfattr -n ceph.dir.pin -v 0 a/b
 138     # a/b is now pinned to rank 0 and a/ and the rest of its children are still pinned to rank 1
 139
 140
 141 .. _cephfs-ephemeral-pinning:
 142
 143 Setting subtree partitioning policies
 144 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 145
 146 It is also possible to setup **automatic** static partitioning of subtrees via
 147 a set of **policies**. In CephFS, this automatic static partitioning is
 148 referred to as **ephemeral pinning**. Any directory (inode) which is
 149 ephemerally pinned will be automatically assigned to a particular rank
 150 according to a consistent hash of its inode number. The set of all
 151 ephemerally pinned directories should be uniformly distributed across all
 152 ranks.
 153
 154 Ephemerally pinned directories are so named because the pin may not persist
 155 once the directory inode is dropped from cache. However, an MDS failover does
 156 not affect the ephemeral nature of the pinned directory. The MDS records what
 157 subtrees are ephemerally pinned in its journal so MDS failovers do not drop
 158 this information.
 159
 160 A directory is either ephemerally pinned or not. Which rank it is pinned to is
 161 derived from its inode number and a consistent hash. This means that
 162 ephemerally pinned directories are somewhat evenly spread across the MDS
 163 cluster. The **consistent hash** also minimizes redistribution when the MDS
 164 cluster grows or shrinks. So, growing an MDS cluster may automatically increase
 165 your metadata throughput with no other administrative intervention.
 166
 167 Presently, there are two types of ephemeral pinning:
 168
 169 **Distributed Ephemeral Pins**: This policy causes a directory to fragment
 170 (even well below the normal fragmentation thresholds) and distribute its
 171 fragments as ephemerally pinned subtrees. This has the effect of distributing
 172 immediate children across a range of MDS ranks.  The canonical example use-case
 173 would be the ``/home`` directory: we want every user's home directory to be
 174 spread across the entire MDS cluster. This can be set via:
 175
 176 ::
 177
 178     setfattr -n ceph.dir.pin.distributed -v 1 /cephfs/home
 179
 180
 181 **Random Ephemeral Pins**: This policy indicates any descendent sub-directory
 182 may be ephemerally pinned. This is set through the extended attribute
 183 ``ceph.dir.pin.random`` with the value set to the percentage of directories
 184 that should be pinned. For example:
 185
 186 ::
 187
 188     setfattr -n ceph.dir.pin.random -v 0.5 /cephfs/tmp
 189
 190 Would cause any directory loaded into cache or created under ``/tmp`` to be
 191 ephemerally pinned 50 percent of the time.
 192
 193 It is recommended to only set this to small values, like ``.001`` or ``0.1%``.
 194 Having too many subtrees may degrade performance. For this reason, the config
 195 ``mds_export_ephemeral_random_max`` enforces a cap on the maximum of this
 196 percentage (default: ``.01``). The MDS returns ``EINVAL`` when attempting to
 197 set a value beyond this config.
 198
 199 Both random and distributed ephemeral pin policies are off by default in
 200 Octopus. The features may be enabled via the
 201 ``mds_export_ephemeral_random`` and ``mds_export_ephemeral_distributed``
 202 configuration options.
 203
 204 Ephemeral pins may override parent export pins and vice versa. What determines
 205 which policy is followed is the rule of the closest parent: if a closer parent
 206 directory has a conflicting policy, use that one instead. For example:
 207
 208 ::
 209
 210     mkdir -p foo/bar1/baz foo/bar2
 211     setfattr -n ceph.dir.pin -v 0 foo
 212     setfattr -n ceph.dir.pin.distributed -v 1 foo/bar1
 213
 214 The ``foo/bar1/baz`` directory will be ephemerally pinned because the
 215 ``foo/bar1`` policy overrides the export pin on ``foo``. The ``foo/bar2``
 216 directory will obey the pin on ``foo`` normally.
 217
 218 For the reverse situation:
 219
 220 ::
 221
 222     mkdir -p home/{patrick,john}
 223     setfattr -n ceph.dir.pin.distributed -v 1 home
 224     setfattr -n ceph.dir.pin -v 2 home/patrick
 225
 226 The ``home/patrick`` directory and its children will be pinned to rank 2
 227 because its export pin overrides the policy on ``home``.
 228
 229
 230 Dynamic subtree partitioning with Balancer on specific ranks
 231 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 232
 233 The CephFS file system provides the ``bal_rank_mask`` option to enable the balancer
 234 to dynamically rebalance subtrees within particular active MDS ranks. This
 235 allows administrators to employ both the dynamic subtree partitioning and
 236 static pining schemes in different active MDS ranks so that metadata loads
 237 are optimized based on user demand. For instance, in realistic cloud
 238 storage environments, where a lot of subvolumes are allotted to multiple
 239 computing nodes (e.g., VMs and containers), some subvolumes that require
 240 high performance are managed by static partitioning, whereas most subvolumes
 241 that experience a moderate workload are managed by the balancer. As the balancer
 242 evenly spreads the metadata workload to all active MDS ranks, performance of
 243 static pinned subvolumes inevitably may be affected or degraded. If this option
 244 is enabled, subtrees managed by the balancer are not affected by
 245 static pinned subtrees.
 246
 247 This option can be configured with the ``ceph fs set`` command. For example:
 248
 249 ::
 250
 251     ceph fs set <fs_name> bal_rank_mask <hex>
 252
 253 Each bitfield of the ``<hex>`` number represents a dedicated rank. If the ``<hex>`` is
 254 set to ``0x3``, the balancer runs on active ``0`` and ``1`` ranks. For example:
 255
 256 ::
 257
 258     ceph fs set <fs_name> bal_rank_mask 0x3
 259
 260 If the ``bal_rank_mask`` is set to ``-1`` or ``all``, all active ranks are masked
 261 and utilized by the balancer. As an example:
 262
 263 ::
 264
 265     ceph fs set <fs_name> bal_rank_mask -1
 266
 267 On the other hand, if the balancer needs to be disabled,
 268 the ``bal_rank_mask`` should be set to ``0x0``. For example:
 269
 270 ::
 271
 272     ceph fs set <fs_name> bal_rank_mask 0x0