]>
Commit | Line | Data |
---|---|---|
adb31ebb TL |
1 | ======================= |
2 | MDS Cache Configuration | |
3 | ======================= | |
4 | ||
5 | The Metadata Server coordinates a distributed cache among all MDS and CephFS | |
6 | clients. The cache serves to improve metadata access latency and allow clients | |
7 | to safely (coherently) mutate metadata state (e.g. via `chmod`). The MDS issues | |
8 | **capabilities** and **directory entry leases** to indicate what state clients | |
9 | may cache and what manipulations clients may perform (e.g. writing to a file). | |
10 | ||
11 | The MDS and clients both try to enforce a cache size. The mechanism for | |
12 | specifying the MDS cache size is described below. Note that the MDS cache size | |
f67539c2 TL |
13 | is not a hard limit. The MDS always allows clients to lookup new metadata |
14 | which is loaded into the cache. This is an essential policy as it avoids | |
adb31ebb TL |
15 | deadlock in client requests (some requests may rely on held capabilities before |
16 | capabilities are released). | |
17 | ||
18 | When the MDS cache is too large, the MDS will **recall** client state so cache | |
f67539c2 | 19 | items become unpinned and eligible to be dropped. The MDS can only drop cache |
adb31ebb TL |
20 | state when no clients refer to the metadata to be dropped. Also described below |
21 | is how to configure the MDS recall settings for your workload's needs. This is | |
22 | necessary if the internal throttles on the MDS recall can not keep up with the | |
23 | client workload. | |
24 | ||
25 | ||
26 | MDS Cache Size | |
27 | -------------- | |
28 | ||
29 | You can limit the size of the Metadata Server (MDS) cache by a byte count. This | |
20effc67 | 30 | is done through the `mds_cache_memory_limit` configuration: |
adb31ebb | 31 | |
20effc67 | 32 | .. confval:: mds_cache_memory_limit |
adb31ebb TL |
33 | |
34 | In addition, you can specify a cache reservation by using the | |
20effc67 TL |
35 | `mds_cache_reservation` parameter for MDS operations: |
36 | ||
37 | .. confval:: mds_cache_reservation | |
38 | ||
39 | The cache reservation is | |
adb31ebb TL |
40 | limited as a percentage of the memory and is set to 5% by default. The intent |
41 | of this parameter is to have the MDS maintain an extra reserve of memory for | |
42 | its cache for new metadata operations to use. As a consequence, the MDS should | |
43 | in general operate below its memory limit because it will recall old state from | |
44 | clients in order to drop unused metadata in its cache. | |
45 | ||
46 | If the MDS cannot keep its cache under the target size, the MDS will send a | |
47 | health alert to the Monitors indicating the cache is too large. This is | |
48 | controlled by the `mds_health_cache_threshold` configuration which is by | |
20effc67 TL |
49 | default 150% of the maximum cache size: |
50 | ||
51 | .. confval:: mds_health_cache_threshold | |
adb31ebb TL |
52 | |
53 | Because the cache limit is not a hard limit, potential bugs in the CephFS | |
54 | client, MDS, or misbehaving applications might cause the MDS to exceed its | |
55 | cache size. The health warnings are intended to help the operator detect this | |
56 | situation and make necessary adjustments or investigate buggy clients. | |
57 | ||
58 | MDS Cache Trimming | |
59 | ------------------ | |
60 | ||
61 | There are two configurations for throttling the rate of cache trimming in the MDS: | |
62 | ||
20effc67 | 63 | .. confval:: mds_cache_trim_threshold |
adb31ebb | 64 | |
20effc67 | 65 | .. confval:: mds_cache_trim_decay_rate |
adb31ebb TL |
66 | |
67 | The intent of the throttle is to prevent the MDS from spending too much time | |
68 | trimming its cache. This may limit its ability to handle client requests or | |
69 | perform other upkeep. | |
70 | ||
71 | The trim configurations control an internal **decay counter**. Anytime metadata | |
72 | is trimmed from the cache, the counter is incremented. The threshold sets the | |
73 | maximum size of the counter while the decay rate indicates the exponential half | |
74 | life for the counter. If the MDS is continually removing items from its cache, | |
75 | it will reach a steady state of ``-ln(0.5)/rate*threshold`` items removed per | |
76 | second. | |
77 | ||
20effc67 | 78 | .. note:: Increasing the value of the configuration setting |
f67539c2 TL |
79 | ``mds_cache_trim_decay_rate`` leads to the MDS spending less time |
80 | trimming the cache. To increase the cache trimming rate, set a lower | |
81 | value. | |
82 | ||
83 | The defaults are conservative and may need to be changed for production MDS with | |
adb31ebb TL |
84 | large cache sizes. |
85 | ||
86 | ||
87 | MDS Recall | |
88 | ---------- | |
89 | ||
90 | MDS limits its recall of client state (capabilities/leases) to prevent creating | |
91 | too much work for itself handling release messages from clients. This is controlled | |
92 | via the following configurations: | |
93 | ||
94 | ||
95 | The maximum number of capabilities to recall from a single client in a given recall | |
20effc67 | 96 | event: |
adb31ebb | 97 | |
20effc67 | 98 | .. confval:: mds_recall_max_caps |
adb31ebb | 99 | |
20effc67 | 100 | The threshold and decay rate for the decay counter on a session: |
adb31ebb | 101 | |
20effc67 | 102 | .. confval:: mds_recall_max_decay_threshold |
adb31ebb | 103 | |
20effc67 | 104 | .. confval:: mds_recall_max_decay_rate |
adb31ebb TL |
105 | |
106 | The session decay counter controls the rate of recall for an individual | |
107 | session. The behavior of the counter works the same as for cache trimming | |
108 | above. Each capability that is recalled increments the counter. | |
109 | ||
20effc67 | 110 | There is also a global decay counter that throttles for all session recall: |
adb31ebb | 111 | |
20effc67 | 112 | .. confval:: mds_recall_global_max_decay_threshold |
adb31ebb TL |
113 | |
114 | its decay rate is the same as ``mds_recall_max_decay_rate``. Any recalled | |
115 | capability for any session also increments this counter. | |
116 | ||
117 | If clients are slow to release state, the warning "failing to respond to cache | |
118 | pressure" or ``MDS_HEALTH_CLIENT_RECALL`` will be reported. Each session's rate | |
20effc67 | 119 | of release is monitored by another decay counter configured by: |
adb31ebb | 120 | |
20effc67 | 121 | .. confval:: mds_recall_warning_threshold |
adb31ebb | 122 | |
20effc67 | 123 | .. confval:: mds_recall_warning_decay_rate |
adb31ebb TL |
124 | |
125 | Each time a capability is released, the counter is incremented. If clients do | |
126 | not release capabilities quickly enough and there is cache pressure, the | |
127 | counter will indicate if the client is slow to release state. | |
128 | ||
129 | Some workloads and client behaviors may require faster recall of client state | |
130 | to keep up with capability acquisition. It is recommended to increase the above | |
131 | counters as needed to resolve any slow recall warnings in the cluster health | |
132 | state. | |
133 | ||
134 | ||
f67539c2 TL |
135 | MDS Cap Acquisition Throttle |
136 | ---------------------------- | |
137 | ||
138 | A trivial "find" command on a large directory hierarchy will cause the client | |
139 | to receive caps significantly faster than it will release. The MDS will try | |
140 | to have the client reduce its caps below the ``mds_max_caps_per_client`` limit | |
141 | but the recall throttles prevent it from catching up to the pace of acquisition. | |
142 | So the readdir is throttled to control cap acquisition via the following | |
143 | configurations: | |
144 | ||
145 | ||
20effc67 | 146 | The threshold and decay rate for the readdir cap acquisition decay counter: |
f67539c2 | 147 | |
20effc67 | 148 | .. confval:: mds_session_cap_acquisition_throttle |
f67539c2 | 149 | |
20effc67 | 150 | .. confval:: mds_session_cap_acquisition_decay_rate |
f67539c2 TL |
151 | |
152 | The cap acquisition decay counter controls the rate of cap acquisition via | |
153 | readdir. The behavior of the decay counter is the same as for cache trimming or | |
154 | caps recall. Each readdir call increments the counter by the number of files in | |
155 | the result. | |
156 | ||
20effc67 TL |
157 | The ratio of ``mds_max_caps_per_client`` that client must exceed before readdir |
158 | maybe throttled by cap acquisition throttle: | |
f67539c2 | 159 | |
20effc67 | 160 | .. confval:: mds_session_max_caps_throttle_ratio |
f67539c2 TL |
161 | |
162 | The timeout in seconds after which a client request is retried due to cap | |
20effc67 | 163 | acquisition throttling: |
f67539c2 | 164 | |
20effc67 | 165 | .. confval:: mds_cap_acquisition_throttle_retry_request_timeout |
f67539c2 TL |
166 | |
167 | If the number of caps acquired by the client per session is greater than the | |
168 | ``mds_session_max_caps_throttle_ratio`` and cap acquisition decay counter is | |
169 | greater than ``mds_session_cap_acquisition_throttle``, the readdir is throttled. | |
170 | The readdir request is retried after ``mds_cap_acquisition_throttle_retry_request_timeout`` | |
171 | seconds. | |
172 | ||
173 | ||
adb31ebb TL |
174 | Session Liveness |
175 | ---------------- | |
176 | ||
177 | The MDS also keeps track of whether sessions are quiescent. If a client session | |
178 | is not utilizing its capabilities or is otherwise quiet, the MDS will begin | |
f67539c2 | 179 | recalling state from the session even if it's not under cache pressure. This |
adb31ebb TL |
180 | helps the MDS avoid future work when the cluster workload is hot and cache |
181 | pressure is forcing the MDS to recall state. The expectation is that a client | |
182 | not utilizing its capabilities is unlikely to use those capabilities anytime | |
183 | in the near future. | |
184 | ||
185 | Determining whether a given session is quiescent is controlled by the following | |
20effc67 | 186 | configuration variables: |
adb31ebb | 187 | |
20effc67 | 188 | .. confval:: mds_session_cache_liveness_magnitude |
adb31ebb | 189 | |
20effc67 | 190 | .. confval:: mds_session_cache_liveness_decay_rate |
adb31ebb TL |
191 | |
192 | The configuration ``mds_session_cache_liveness_decay_rate`` indicates the | |
193 | half-life for the decay counter tracking the use of capabilities by the client. | |
194 | Each time a client manipulates or acquires a capability, the MDS will increment | |
f67539c2 | 195 | the counter. This is a rough but effective way to monitor the utilization of the |
adb31ebb TL |
196 | client cache. |
197 | ||
198 | The ``mds_session_cache_liveness_magnitude`` is a base-2 magnitude difference | |
199 | of the liveness decay counter and the number of capabilities outstanding for | |
200 | the session. So if the client has ``1*2^20`` (1M) capabilities outstanding and | |
201 | only uses **less** than ``1*2^(20-mds_session_cache_liveness_magnitude)`` (1K | |
202 | using defaults), the MDS will consider the client to be quiescent and begin | |
203 | recall. | |
204 | ||
205 | ||
206 | Capability Limit | |
207 | ---------------- | |
208 | ||
209 | The MDS also tries to prevent a single client from acquiring too many | |
210 | capabilities. This helps prevent recovery from taking a long time in some | |
211 | situations. It is not generally necessary for a client to have such a large | |
20effc67 | 212 | cache. The limit is configured via: |
adb31ebb | 213 | |
20effc67 | 214 | .. confval:: mds_max_caps_per_client |
adb31ebb TL |
215 | |
216 | It is not recommended to set this value above 5M but it may be helpful with | |
217 | some workloads. |