ceph/doc/cephfs/cache-configuration.rst

   1 =======================
   2 MDS Cache Configuration
   3 =======================
   4
   5 The Metadata Server coordinates a distributed cache among all MDS and CephFS
   6 clients. The cache serves to improve metadata access latency and allow clients
   7 to safely (coherently) mutate metadata state (e.g. via `chmod`). The MDS issues
   8 **capabilities** and **directory entry leases** to indicate what state clients
   9 may cache and what manipulations clients may perform (e.g. writing to a file).
  10
  11 The MDS and clients both try to enforce a cache size. The mechanism for
  12 specifying the MDS cache size is described below. Note that the MDS cache size
  13 is not a hard limit. The MDS always allows clients to lookup new metadata
  14 which is loaded into the cache. This is an essential policy as it avoids
  15 deadlock in client requests (some requests may rely on held capabilities before
  16 capabilities are released).
  17
  18 When the MDS cache is too large, the MDS will **recall** client state so cache
  19 items become unpinned and eligible to be dropped. The MDS can only drop cache
  20 state when no clients refer to the metadata to be dropped. Also described below
  21 is how to configure the MDS recall settings for your workload's needs. This is
  22 necessary if the internal throttles on the MDS recall can not keep up with the
  23 client workload.
  24
  25
  26 MDS Cache Size
  27 --------------
  28
  29 You can limit the size of the Metadata Server (MDS) cache by a byte count. This
  30 is done through the `mds_cache_memory_limit` configuration:
  31
  32 .. confval:: mds_cache_memory_limit
  33
  34 In addition, you can specify a cache reservation by using the
  35 `mds_cache_reservation` parameter for MDS operations:
  36
  37 .. confval:: mds_cache_reservation
  38
  39 The cache reservation is
  40 limited as a percentage of the memory and is set to 5% by default. The intent
  41 of this parameter is to have the MDS maintain an extra reserve of memory for
  42 its cache for new metadata operations to use. As a consequence, the MDS should
  43 in general operate below its memory limit because it will recall old state from
  44 clients in order to drop unused metadata in its cache.
  45
  46 If the MDS cannot keep its cache under the target size, the MDS will send a
  47 health alert to the Monitors indicating the cache is too large. This is
  48 controlled by the `mds_health_cache_threshold` configuration which is by
  49 default 150% of the maximum cache size:
  50
  51 .. confval:: mds_health_cache_threshold
  52
  53 Because the cache limit is not a hard limit, potential bugs in the CephFS
  54 client, MDS, or misbehaving applications might cause the MDS to exceed its
  55 cache size. The health warnings are intended to help the operator detect this
  56 situation and make necessary adjustments or investigate buggy clients.
  57
  58 MDS Cache Trimming
  59 ------------------
  60
  61 There are two configurations for throttling the rate of cache trimming in the MDS:
  62
  63 .. confval:: mds_cache_trim_threshold
  64
  65 .. confval:: mds_cache_trim_decay_rate
  66
  67 The intent of the throttle is to prevent the MDS from spending too much time
  68 trimming its cache. This may limit its ability to handle client requests or
  69 perform other upkeep.
  70
  71 The trim configurations control an internal **decay counter**. Anytime metadata
  72 is trimmed from the cache, the counter is incremented.  The threshold sets the
  73 maximum size of the counter while the decay rate indicates the exponential half
  74 life for the counter. If the MDS is continually removing items from its cache,
  75 it will reach a steady state of ``-ln(0.5)/rate*threshold`` items removed per
  76 second.
  77
  78 .. note:: Increasing the value of the configuration setting
  79           ``mds_cache_trim_decay_rate`` leads to the MDS spending less time
  80           trimming the cache. To increase the cache trimming rate, set a lower
  81           value.
  82
  83 The defaults are conservative and may need to be changed for production MDS with
  84 large cache sizes.
  85
  86
  87 MDS Recall
  88 ----------
  89
  90 MDS limits its recall of client state (capabilities/leases) to prevent creating
  91 too much work for itself handling release messages from clients. This is controlled
  92 via the following configurations:
  93
  94
  95 The maximum number of capabilities to recall from a single client in a given recall
  96 event:
  97
  98 .. confval:: mds_recall_max_caps
  99
 100 The threshold and decay rate for the decay counter on a session:
 101
 102 .. confval:: mds_recall_max_decay_threshold
 103
 104 .. confval:: mds_recall_max_decay_rate
 105
 106 The session decay counter controls the rate of recall for an individual
 107 session. The behavior of the counter works the same as for cache trimming
 108 above. Each capability that is recalled increments the counter.
 109
 110 There is also a global decay counter that throttles for all session recall:
 111
 112 .. confval:: mds_recall_global_max_decay_threshold
 113
 114 its decay rate is the same as ``mds_recall_max_decay_rate``. Any recalled
 115 capability for any session also increments this counter.
 116
 117 If clients are slow to release state, the warning "failing to respond to cache
 118 pressure" or ``MDS_HEALTH_CLIENT_RECALL`` will be reported. Each session's rate
 119 of release is monitored by another decay counter configured by:
 120
 121 .. confval:: mds_recall_warning_threshold
 122
 123 .. confval:: mds_recall_warning_decay_rate
 124
 125 Each time a capability is released, the counter is incremented.  If clients do
 126 not release capabilities quickly enough and there is cache pressure, the
 127 counter will indicate if the client is slow to release state.
 128
 129 Some workloads and client behaviors may require faster recall of client state
 130 to keep up with capability acquisition. It is recommended to increase the above
 131 counters as needed to resolve any slow recall warnings in the cluster health
 132 state.
 133
 134
 135 MDS Cap Acquisition Throttle
 136 ----------------------------
 137
 138 A trivial "find" command on a large directory hierarchy will cause the client
 139 to receive caps significantly faster than it will release. The MDS will try
 140 to have the client reduce its caps below the ``mds_max_caps_per_client`` limit
 141 but the recall throttles prevent it from catching up to the pace of acquisition.
 142 So the readdir is throttled to control cap acquisition via the following
 143 configurations:
 144
 145
 146 The threshold and decay rate for the readdir cap acquisition decay counter:
 147
 148 .. confval:: mds_session_cap_acquisition_throttle
 149
 150 .. confval:: mds_session_cap_acquisition_decay_rate
 151
 152 The cap acquisition decay counter controls the rate of cap acquisition via
 153 readdir. The behavior of the decay counter is the same as for cache trimming or
 154 caps recall. Each readdir call increments the counter by the number of files in
 155 the result.
 156
 157 The ratio of ``mds_max_caps_per_client`` that client must exceed before readdir
 158 maybe throttled by cap acquisition throttle:
 159
 160 .. confval:: mds_session_max_caps_throttle_ratio
 161
 162 The timeout in seconds after which a client request is retried due to cap
 163 acquisition throttling:
 164
 165 .. confval:: mds_cap_acquisition_throttle_retry_request_timeout
 166
 167 If the number of caps acquired by the client per session is greater than the
 168 ``mds_session_max_caps_throttle_ratio`` and cap acquisition decay counter is
 169 greater than ``mds_session_cap_acquisition_throttle``, the readdir is throttled.
 170 The readdir request is retried after ``mds_cap_acquisition_throttle_retry_request_timeout``
 171 seconds.
 172
 173
 174 Session Liveness
 175 ----------------
 176
 177 The MDS also keeps track of whether sessions are quiescent. If a client session
 178 is not utilizing its capabilities or is otherwise quiet, the MDS will begin
 179 recalling state from the session even if it's not under cache pressure. This
 180 helps the MDS avoid future work when the cluster workload is hot and cache
 181 pressure is forcing the MDS to recall state. The expectation is that a client
 182 not utilizing its capabilities is unlikely to use those capabilities anytime
 183 in the near future.
 184
 185 Determining whether a given session is quiescent is controlled by the following
 186 configuration variables:
 187
 188 .. confval:: mds_session_cache_liveness_magnitude
 189
 190 .. confval:: mds_session_cache_liveness_decay_rate
 191
 192 The configuration ``mds_session_cache_liveness_decay_rate`` indicates the
 193 half-life for the decay counter tracking the use of capabilities by the client.
 194 Each time a client manipulates or acquires a capability, the MDS will increment
 195 the counter. This is a rough but effective way to monitor the utilization of the
 196 client cache.
 197
 198 The ``mds_session_cache_liveness_magnitude`` is a base-2 magnitude difference
 199 of the liveness decay counter and the number of capabilities outstanding for
 200 the session. So if the client has ``1*2^20`` (1M) capabilities outstanding and
 201 only uses **less** than ``1*2^(20-mds_session_cache_liveness_magnitude)`` (1K
 202 using defaults), the MDS will consider the client to be quiescent and begin
 203 recall.
 204
 205
 206 Capability Limit
 207 ----------------
 208
 209 The MDS also tries to prevent a single client from acquiring too many
 210 capabilities. This helps prevent recovery from taking a long time in some
 211 situations.  It is not generally necessary for a client to have such a large
 212 cache. The limit is configured via:
 213
 214 .. confval:: mds_max_caps_per_client
 215
 216 It is not recommended to set this value above 5M but it may be helpful with
 217 some workloads.