]> git.proxmox.com Git - ceph.git/blame - ceph/doc/cephfs/cache-configuration.rst
import quincy beta 17.1.0
[ceph.git] / ceph / doc / cephfs / cache-configuration.rst
CommitLineData
adb31ebb
TL
1=======================
2MDS Cache Configuration
3=======================
4
5The Metadata Server coordinates a distributed cache among all MDS and CephFS
6clients. The cache serves to improve metadata access latency and allow clients
7to safely (coherently) mutate metadata state (e.g. via `chmod`). The MDS issues
8**capabilities** and **directory entry leases** to indicate what state clients
9may cache and what manipulations clients may perform (e.g. writing to a file).
10
11The MDS and clients both try to enforce a cache size. The mechanism for
12specifying the MDS cache size is described below. Note that the MDS cache size
f67539c2
TL
13is not a hard limit. The MDS always allows clients to lookup new metadata
14which is loaded into the cache. This is an essential policy as it avoids
adb31ebb
TL
15deadlock in client requests (some requests may rely on held capabilities before
16capabilities are released).
17
18When the MDS cache is too large, the MDS will **recall** client state so cache
f67539c2 19items become unpinned and eligible to be dropped. The MDS can only drop cache
adb31ebb
TL
20state when no clients refer to the metadata to be dropped. Also described below
21is how to configure the MDS recall settings for your workload's needs. This is
22necessary if the internal throttles on the MDS recall can not keep up with the
23client workload.
24
25
26MDS Cache Size
27--------------
28
29You can limit the size of the Metadata Server (MDS) cache by a byte count. This
20effc67 30is done through the `mds_cache_memory_limit` configuration:
adb31ebb 31
20effc67 32.. confval:: mds_cache_memory_limit
adb31ebb
TL
33
34In addition, you can specify a cache reservation by using the
20effc67
TL
35`mds_cache_reservation` parameter for MDS operations:
36
37.. confval:: mds_cache_reservation
38
39The cache reservation is
adb31ebb
TL
40limited as a percentage of the memory and is set to 5% by default. The intent
41of this parameter is to have the MDS maintain an extra reserve of memory for
42its cache for new metadata operations to use. As a consequence, the MDS should
43in general operate below its memory limit because it will recall old state from
44clients in order to drop unused metadata in its cache.
45
46If the MDS cannot keep its cache under the target size, the MDS will send a
47health alert to the Monitors indicating the cache is too large. This is
48controlled by the `mds_health_cache_threshold` configuration which is by
20effc67
TL
49default 150% of the maximum cache size:
50
51.. confval:: mds_health_cache_threshold
adb31ebb
TL
52
53Because the cache limit is not a hard limit, potential bugs in the CephFS
54client, MDS, or misbehaving applications might cause the MDS to exceed its
55cache size. The health warnings are intended to help the operator detect this
56situation and make necessary adjustments or investigate buggy clients.
57
58MDS Cache Trimming
59------------------
60
61There are two configurations for throttling the rate of cache trimming in the MDS:
62
20effc67 63.. confval:: mds_cache_trim_threshold
adb31ebb 64
20effc67 65.. confval:: mds_cache_trim_decay_rate
adb31ebb
TL
66
67The intent of the throttle is to prevent the MDS from spending too much time
68trimming its cache. This may limit its ability to handle client requests or
69perform other upkeep.
70
71The trim configurations control an internal **decay counter**. Anytime metadata
72is trimmed from the cache, the counter is incremented. The threshold sets the
73maximum size of the counter while the decay rate indicates the exponential half
74life for the counter. If the MDS is continually removing items from its cache,
75it will reach a steady state of ``-ln(0.5)/rate*threshold`` items removed per
76second.
77
20effc67 78.. note:: Increasing the value of the configuration setting
f67539c2
TL
79 ``mds_cache_trim_decay_rate`` leads to the MDS spending less time
80 trimming the cache. To increase the cache trimming rate, set a lower
81 value.
82
83The defaults are conservative and may need to be changed for production MDS with
adb31ebb
TL
84large cache sizes.
85
86
87MDS Recall
88----------
89
90MDS limits its recall of client state (capabilities/leases) to prevent creating
91too much work for itself handling release messages from clients. This is controlled
92via the following configurations:
93
94
95The maximum number of capabilities to recall from a single client in a given recall
20effc67 96event:
adb31ebb 97
20effc67 98.. confval:: mds_recall_max_caps
adb31ebb 99
20effc67 100The threshold and decay rate for the decay counter on a session:
adb31ebb 101
20effc67 102.. confval:: mds_recall_max_decay_threshold
adb31ebb 103
20effc67 104.. confval:: mds_recall_max_decay_rate
adb31ebb
TL
105
106The session decay counter controls the rate of recall for an individual
107session. The behavior of the counter works the same as for cache trimming
108above. Each capability that is recalled increments the counter.
109
20effc67 110There is also a global decay counter that throttles for all session recall:
adb31ebb 111
20effc67 112.. confval:: mds_recall_global_max_decay_threshold
adb31ebb
TL
113
114its decay rate is the same as ``mds_recall_max_decay_rate``. Any recalled
115capability for any session also increments this counter.
116
117If clients are slow to release state, the warning "failing to respond to cache
118pressure" or ``MDS_HEALTH_CLIENT_RECALL`` will be reported. Each session's rate
20effc67 119of release is monitored by another decay counter configured by:
adb31ebb 120
20effc67 121.. confval:: mds_recall_warning_threshold
adb31ebb 122
20effc67 123.. confval:: mds_recall_warning_decay_rate
adb31ebb
TL
124
125Each time a capability is released, the counter is incremented. If clients do
126not release capabilities quickly enough and there is cache pressure, the
127counter will indicate if the client is slow to release state.
128
129Some workloads and client behaviors may require faster recall of client state
130to keep up with capability acquisition. It is recommended to increase the above
131counters as needed to resolve any slow recall warnings in the cluster health
132state.
133
134
f67539c2
TL
135MDS Cap Acquisition Throttle
136----------------------------
137
138A trivial "find" command on a large directory hierarchy will cause the client
139to receive caps significantly faster than it will release. The MDS will try
140to have the client reduce its caps below the ``mds_max_caps_per_client`` limit
141but the recall throttles prevent it from catching up to the pace of acquisition.
142So the readdir is throttled to control cap acquisition via the following
143configurations:
144
145
20effc67 146The threshold and decay rate for the readdir cap acquisition decay counter:
f67539c2 147
20effc67 148.. confval:: mds_session_cap_acquisition_throttle
f67539c2 149
20effc67 150.. confval:: mds_session_cap_acquisition_decay_rate
f67539c2
TL
151
152The cap acquisition decay counter controls the rate of cap acquisition via
153readdir. The behavior of the decay counter is the same as for cache trimming or
154caps recall. Each readdir call increments the counter by the number of files in
155the result.
156
20effc67
TL
157The ratio of ``mds_max_caps_per_client`` that client must exceed before readdir
158maybe throttled by cap acquisition throttle:
f67539c2 159
20effc67 160.. confval:: mds_session_max_caps_throttle_ratio
f67539c2
TL
161
162The timeout in seconds after which a client request is retried due to cap
20effc67 163acquisition throttling:
f67539c2 164
20effc67 165.. confval:: mds_cap_acquisition_throttle_retry_request_timeout
f67539c2
TL
166
167If the number of caps acquired by the client per session is greater than the
168``mds_session_max_caps_throttle_ratio`` and cap acquisition decay counter is
169greater than ``mds_session_cap_acquisition_throttle``, the readdir is throttled.
170The readdir request is retried after ``mds_cap_acquisition_throttle_retry_request_timeout``
171seconds.
172
173
adb31ebb
TL
174Session Liveness
175----------------
176
177The MDS also keeps track of whether sessions are quiescent. If a client session
178is not utilizing its capabilities or is otherwise quiet, the MDS will begin
f67539c2 179recalling state from the session even if it's not under cache pressure. This
adb31ebb
TL
180helps the MDS avoid future work when the cluster workload is hot and cache
181pressure is forcing the MDS to recall state. The expectation is that a client
182not utilizing its capabilities is unlikely to use those capabilities anytime
183in the near future.
184
185Determining whether a given session is quiescent is controlled by the following
20effc67 186configuration variables:
adb31ebb 187
20effc67 188.. confval:: mds_session_cache_liveness_magnitude
adb31ebb 189
20effc67 190.. confval:: mds_session_cache_liveness_decay_rate
adb31ebb
TL
191
192The configuration ``mds_session_cache_liveness_decay_rate`` indicates the
193half-life for the decay counter tracking the use of capabilities by the client.
194Each time a client manipulates or acquires a capability, the MDS will increment
f67539c2 195the counter. This is a rough but effective way to monitor the utilization of the
adb31ebb
TL
196client cache.
197
198The ``mds_session_cache_liveness_magnitude`` is a base-2 magnitude difference
199of the liveness decay counter and the number of capabilities outstanding for
200the session. So if the client has ``1*2^20`` (1M) capabilities outstanding and
201only uses **less** than ``1*2^(20-mds_session_cache_liveness_magnitude)`` (1K
202using defaults), the MDS will consider the client to be quiescent and begin
203recall.
204
205
206Capability Limit
207----------------
208
209The MDS also tries to prevent a single client from acquiring too many
210capabilities. This helps prevent recovery from taking a long time in some
211situations. It is not generally necessary for a client to have such a large
20effc67 212cache. The limit is configured via:
adb31ebb 213
20effc67 214.. confval:: mds_max_caps_per_client
adb31ebb
TL
215
216It is not recommended to set this value above 5M but it may be helpful with
217some workloads.