[ceph.git] / ceph / doc / cephfs / cache-configuration.rst

=======================
MDS Cache Configuration
=======================

The Metadata Server coordinates a distributed cache among all MDS and CephFS
clients. The cache serves to improve metadata access latency and allow clients
to safely (coherently) mutate metadata state (e.g. via `chmod`). The MDS issues
**capabilities** and **directory entry leases** to indicate what state clients
may cache and what manipulations clients may perform (e.g. writing to a file).

The MDS and clients both try to enforce a cache size. The mechanism for
specifying the MDS cache size is described below. Note that the MDS cache size
is not a hard limit. The MDS always allows clients to lookup new metadata
which is loaded into the cache. This is an essential policy as it avoids
deadlock in client requests (some requests may rely on held capabilities before
capabilities are released).

When the MDS cache is too large, the MDS will **recall** client state so cache
items become unpinned and eligible to be dropped. The MDS can only drop cache
state when no clients refer to the metadata to be dropped. Also described below
is how to configure the MDS recall settings for your workload's needs. This is
necessary if the internal throttles on the MDS recall can not keep up with the
client workload.


MDS Cache Size
--------------

You can limit the size of the Metadata Server (MDS) cache by a byte count. This
is done through the `mds_cache_memory_limit` configuration:

.. confval:: mds_cache_memory_limit

In addition, you can specify a cache reservation by using the
`mds_cache_reservation` parameter for MDS operations:

.. confval:: mds_cache_reservation

The cache reservation is
limited as a percentage of the memory and is set to 5% by default. The intent
of this parameter is to have the MDS maintain an extra reserve of memory for
its cache for new metadata operations to use. As a consequence, the MDS should
in general operate below its memory limit because it will recall old state from
clients in order to drop unused metadata in its cache.

If the MDS cannot keep its cache under the target size, the MDS will send a
health alert to the Monitors indicating the cache is too large. This is
controlled by the `mds_health_cache_threshold` configuration which is by
default 150% of the maximum cache size:

.. confval:: mds_health_cache_threshold

Because the cache limit is not a hard limit, potential bugs in the CephFS
client, MDS, or misbehaving applications might cause the MDS to exceed its
cache size. The health warnings are intended to help the operator detect this
situation and make necessary adjustments or investigate buggy clients.

MDS Cache Trimming
------------------

There are two configurations for throttling the rate of cache trimming in the MDS:

.. confval:: mds_cache_trim_threshold

.. confval:: mds_cache_trim_decay_rate

The intent of the throttle is to prevent the MDS from spending too much time
trimming its cache. This may limit its ability to handle client requests or
perform other upkeep.

The trim configurations control an internal **decay counter**. Anytime metadata
is trimmed from the cache, the counter is incremented.  The threshold sets the
maximum size of the counter while the decay rate indicates the exponential half
life for the counter. If the MDS is continually removing items from its cache,
it will reach a steady state of ``-ln(0.5)/rate*threshold`` items removed per
second.

.. note:: Increasing the value of the configuration setting
          ``mds_cache_trim_decay_rate`` leads to the MDS spending less time
          trimming the cache. To increase the cache trimming rate, set a lower
          value.

The defaults are conservative and may need to be changed for production MDS with
large cache sizes.


MDS Recall
----------

MDS limits its recall of client state (capabilities/leases) to prevent creating
too much work for itself handling release messages from clients. This is controlled
via the following configurations:


The maximum number of capabilities to recall from a single client in a given recall
event:

.. confval:: mds_recall_max_caps

The threshold and decay rate for the decay counter on a session:

.. confval:: mds_recall_max_decay_threshold

.. confval:: mds_recall_max_decay_rate

The session decay counter controls the rate of recall for an individual
session. The behavior of the counter works the same as for cache trimming
above. Each capability that is recalled increments the counter.

There is also a global decay counter that throttles for all session recall:

.. confval:: mds_recall_global_max_decay_threshold

its decay rate is the same as ``mds_recall_max_decay_rate``. Any recalled
capability for any session also increments this counter.

If clients are slow to release state, the warning "failing to respond to cache
pressure" or ``MDS_HEALTH_CLIENT_RECALL`` will be reported. Each session's rate
of release is monitored by another decay counter configured by:

.. confval:: mds_recall_warning_threshold

.. confval:: mds_recall_warning_decay_rate

Each time a capability is released, the counter is incremented.  If clients do
not release capabilities quickly enough and there is cache pressure, the
counter will indicate if the client is slow to release state.

Some workloads and client behaviors may require faster recall of client state
to keep up with capability acquisition. It is recommended to increase the above
counters as needed to resolve any slow recall warnings in the cluster health
state.


MDS Cap Acquisition Throttle
----------------------------

A trivial "find" command on a large directory hierarchy will cause the client
to receive caps significantly faster than it will release. The MDS will try
to have the client reduce its caps below the ``mds_max_caps_per_client`` limit
but the recall throttles prevent it from catching up to the pace of acquisition.
So the readdir is throttled to control cap acquisition via the following
configurations:


The threshold and decay rate for the readdir cap acquisition decay counter:

.. confval:: mds_session_cap_acquisition_throttle

.. confval:: mds_session_cap_acquisition_decay_rate

The cap acquisition decay counter controls the rate of cap acquisition via
readdir. The behavior of the decay counter is the same as for cache trimming or
caps recall. Each readdir call increments the counter by the number of files in
the result.

The ratio of ``mds_max_caps_per_client`` that client must exceed before readdir
maybe throttled by cap acquisition throttle:

.. confval:: mds_session_max_caps_throttle_ratio

The timeout in seconds after which a client request is retried due to cap
acquisition throttling:

.. confval:: mds_cap_acquisition_throttle_retry_request_timeout

If the number of caps acquired by the client per session is greater than the
``mds_session_max_caps_throttle_ratio`` and cap acquisition decay counter is
greater than ``mds_session_cap_acquisition_throttle``, the readdir is throttled.
The readdir request is retried after ``mds_cap_acquisition_throttle_retry_request_timeout``
seconds.


Session Liveness
----------------

The MDS also keeps track of whether sessions are quiescent. If a client session
is not utilizing its capabilities or is otherwise quiet, the MDS will begin
recalling state from the session even if it's not under cache pressure. This
helps the MDS avoid future work when the cluster workload is hot and cache
pressure is forcing the MDS to recall state. The expectation is that a client
not utilizing its capabilities is unlikely to use those capabilities anytime
in the near future.

Determining whether a given session is quiescent is controlled by the following
configuration variables:

.. confval:: mds_session_cache_liveness_magnitude

.. confval:: mds_session_cache_liveness_decay_rate

The configuration ``mds_session_cache_liveness_decay_rate`` indicates the
half-life for the decay counter tracking the use of capabilities by the client.
Each time a client manipulates or acquires a capability, the MDS will increment
the counter. This is a rough but effective way to monitor the utilization of the
client cache.

The ``mds_session_cache_liveness_magnitude`` is a base-2 magnitude difference
of the liveness decay counter and the number of capabilities outstanding for
the session. So if the client has ``1*2^20`` (1M) capabilities outstanding and
only uses **less** than ``1*2^(20-mds_session_cache_liveness_magnitude)`` (1K
using defaults), the MDS will consider the client to be quiescent and begin
recall.


Capability Limit
----------------

The MDS also tries to prevent a single client from acquiring too many
capabilities. This helps prevent recovery from taking a long time in some
situations.  It is not generally necessary for a client to have such a large
cache. The limit is configured via:

.. confval:: mds_max_caps_per_client

It is not recommended to set this value above 5M but it may be helpful with
some workloads.
Commit	Line	Data
adb31ebb TL	1	=======================
	2	MDS Cache Configuration
	3	=======================
	4
	5	The Metadata Server coordinates a distributed cache among all MDS and CephFS
	6	clients. The cache serves to improve metadata access latency and allow clients
	7	to safely (coherently) mutate metadata state (e.g. via `chmod`). The MDS issues
	8	capabilities and directory entry leases to indicate what state clients
	9	may cache and what manipulations clients may perform (e.g. writing to a file).
	10
	11	The MDS and clients both try to enforce a cache size. The mechanism for
	12	specifying the MDS cache size is described below. Note that the MDS cache size
f67539c2 TL	13	is not a hard limit. The MDS always allows clients to lookup new metadata
f67539c2 TL	14	which is loaded into the cache. This is an essential policy as it avoids
adb31ebb TL	15	deadlock in client requests (some requests may rely on held capabilities before
	16	capabilities are released).
	17
	18	When the MDS cache is too large, the MDS will recall client state so cache
f67539c2	19	items become unpinned and eligible to be dropped. The MDS can only drop cache
adb31ebb TL	20	state when no clients refer to the metadata to be dropped. Also described below
	21	is how to configure the MDS recall settings for your workload's needs. This is
	22	necessary if the internal throttles on the MDS recall can not keep up with the
	23	client workload.
	24
	25
	26	MDS Cache Size
	27	--------------
	28
	29	You can limit the size of the Metadata Server (MDS) cache by a byte count. This
20effc67	30	is done through the `mds_cache_memory_limit` configuration:
adb31ebb	31
20effc67	32	.. confval:: mds_cache_memory_limit
adb31ebb TL	33
adb31ebb TL	34	In addition, you can specify a cache reservation by using the
20effc67 TL	35	`mds_cache_reservation` parameter for MDS operations:
	36
	37	.. confval:: mds_cache_reservation
	38
	39	The cache reservation is
adb31ebb TL	40	limited as a percentage of the memory and is set to 5% by default. The intent
	41	of this parameter is to have the MDS maintain an extra reserve of memory for
	42	its cache for new metadata operations to use. As a consequence, the MDS should
	43	in general operate below its memory limit because it will recall old state from
	44	clients in order to drop unused metadata in its cache.
	45
	46	If the MDS cannot keep its cache under the target size, the MDS will send a
	47	health alert to the Monitors indicating the cache is too large. This is
	48	controlled by the `mds_health_cache_threshold` configuration which is by
20effc67 TL	49	default 150% of the maximum cache size:
	50
	51	.. confval:: mds_health_cache_threshold
adb31ebb TL	52
	53	Because the cache limit is not a hard limit, potential bugs in the CephFS
	54	client, MDS, or misbehaving applications might cause the MDS to exceed its
	55	cache size. The health warnings are intended to help the operator detect this
	56	situation and make necessary adjustments or investigate buggy clients.
	57
	58	MDS Cache Trimming
	59	------------------
	60
	61	There are two configurations for throttling the rate of cache trimming in the MDS:
	62
20effc67	63	.. confval:: mds_cache_trim_threshold
adb31ebb	64
20effc67	65	.. confval:: mds_cache_trim_decay_rate
adb31ebb TL	66
	67	The intent of the throttle is to prevent the MDS from spending too much time
	68	trimming its cache. This may limit its ability to handle client requests or
	69	perform other upkeep.
	70
	71	The trim configurations control an internal decay counter. Anytime metadata
	72	is trimmed from the cache, the counter is incremented. The threshold sets the
	73	maximum size of the counter while the decay rate indicates the exponential half
	74	life for the counter. If the MDS is continually removing items from its cache,
	75	it will reach a steady state of ``-ln(0.5)/rate*threshold`` items removed per
	76	second.
	77
20effc67	78	.. note:: Increasing the value of the configuration setting
f67539c2 TL	79	``mds_cache_trim_decay_rate`` leads to the MDS spending less time
	80	trimming the cache. To increase the cache trimming rate, set a lower
	81	value.
	82
	83	The defaults are conservative and may need to be changed for production MDS with
adb31ebb TL	84	large cache sizes.
	85
	86
	87	MDS Recall
	88	----------
	89
	90	MDS limits its recall of client state (capabilities/leases) to prevent creating
	91	too much work for itself handling release messages from clients. This is controlled
	92	via the following configurations:
	93
	94
	95	The maximum number of capabilities to recall from a single client in a given recall
20effc67	96	event:
adb31ebb	97
20effc67	98	.. confval:: mds_recall_max_caps
adb31ebb	99
20effc67	100	The threshold and decay rate for the decay counter on a session:
adb31ebb	101
20effc67	102	.. confval:: mds_recall_max_decay_threshold
adb31ebb	103
20effc67	104	.. confval:: mds_recall_max_decay_rate
adb31ebb TL	105
	106	The session decay counter controls the rate of recall for an individual
	107	session. The behavior of the counter works the same as for cache trimming
	108	above. Each capability that is recalled increments the counter.
	109
20effc67	110	There is also a global decay counter that throttles for all session recall:
adb31ebb	111
20effc67	112	.. confval:: mds_recall_global_max_decay_threshold
adb31ebb TL	113
	114	its decay rate is the same as ``mds_recall_max_decay_rate``. Any recalled
	115	capability for any session also increments this counter.
	116
	117	If clients are slow to release state, the warning "failing to respond to cache
	118	pressure" or ``MDS_HEALTH_CLIENT_RECALL`` will be reported. Each session's rate
20effc67	119	of release is monitored by another decay counter configured by:
adb31ebb	120
20effc67	121	.. confval:: mds_recall_warning_threshold
adb31ebb	122
20effc67	123	.. confval:: mds_recall_warning_decay_rate
adb31ebb TL	124
	125	Each time a capability is released, the counter is incremented. If clients do
	126	not release capabilities quickly enough and there is cache pressure, the
	127	counter will indicate if the client is slow to release state.
	128
	129	Some workloads and client behaviors may require faster recall of client state
	130	to keep up with capability acquisition. It is recommended to increase the above
	131	counters as needed to resolve any slow recall warnings in the cluster health
	132	state.
	133
	134
f67539c2 TL	135	MDS Cap Acquisition Throttle
	136	----------------------------
	137
	138	A trivial "find" command on a large directory hierarchy will cause the client
	139	to receive caps significantly faster than it will release. The MDS will try
	140	to have the client reduce its caps below the ``mds_max_caps_per_client`` limit
	141	but the recall throttles prevent it from catching up to the pace of acquisition.
	142	So the readdir is throttled to control cap acquisition via the following
	143	configurations:
	144
	145
20effc67	146	The threshold and decay rate for the readdir cap acquisition decay counter:
f67539c2	147
20effc67	148	.. confval:: mds_session_cap_acquisition_throttle
f67539c2	149
20effc67	150	.. confval:: mds_session_cap_acquisition_decay_rate
f67539c2 TL	151
	152	The cap acquisition decay counter controls the rate of cap acquisition via
	153	readdir. The behavior of the decay counter is the same as for cache trimming or
	154	caps recall. Each readdir call increments the counter by the number of files in
	155	the result.
	156
20effc67 TL	157	The ratio of ``mds_max_caps_per_client`` that client must exceed before readdir
20effc67 TL	158	maybe throttled by cap acquisition throttle:
f67539c2	159
20effc67	160	.. confval:: mds_session_max_caps_throttle_ratio
f67539c2 TL	161
f67539c2 TL	162	The timeout in seconds after which a client request is retried due to cap
20effc67	163	acquisition throttling:
f67539c2	164
20effc67	165	.. confval:: mds_cap_acquisition_throttle_retry_request_timeout
f67539c2 TL	166
	167	If the number of caps acquired by the client per session is greater than the
	168	``mds_session_max_caps_throttle_ratio`` and cap acquisition decay counter is
	169	greater than ``mds_session_cap_acquisition_throttle``, the readdir is throttled.
	170	The readdir request is retried after ``mds_cap_acquisition_throttle_retry_request_timeout``
	171	seconds.
	172
	173
adb31ebb TL	174	Session Liveness
	175	----------------
	176
	177	The MDS also keeps track of whether sessions are quiescent. If a client session
	178	is not utilizing its capabilities or is otherwise quiet, the MDS will begin
f67539c2	179	recalling state from the session even if it's not under cache pressure. This
adb31ebb TL	180	helps the MDS avoid future work when the cluster workload is hot and cache
	181	pressure is forcing the MDS to recall state. The expectation is that a client
	182	not utilizing its capabilities is unlikely to use those capabilities anytime
	183	in the near future.
	184
	185	Determining whether a given session is quiescent is controlled by the following
20effc67	186	configuration variables:
adb31ebb	187
20effc67	188	.. confval:: mds_session_cache_liveness_magnitude
adb31ebb	189
20effc67	190	.. confval:: mds_session_cache_liveness_decay_rate
adb31ebb TL	191
	192	The configuration ``mds_session_cache_liveness_decay_rate`` indicates the
	193	half-life for the decay counter tracking the use of capabilities by the client.
	194	Each time a client manipulates or acquires a capability, the MDS will increment
f67539c2	195	the counter. This is a rough but effective way to monitor the utilization of the
adb31ebb TL	196	client cache.
	197
	198	The ``mds_session_cache_liveness_magnitude`` is a base-2 magnitude difference
	199	of the liveness decay counter and the number of capabilities outstanding for
	200	the session. So if the client has ``1*2^20`` (1M) capabilities outstanding and
	201	only uses less than ``1*2^(20-mds_session_cache_liveness_magnitude)`` (1K
	202	using defaults), the MDS will consider the client to be quiescent and begin
	203	recall.
	204
	205
	206	Capability Limit
	207	----------------
	208
	209	The MDS also tries to prevent a single client from acquiring too many
	210	capabilities. This helps prevent recovery from taking a long time in some
	211	situations. It is not generally necessary for a client to have such a large
20effc67	212	cache. The limit is configured via:
adb31ebb	213
20effc67	214	.. confval:: mds_max_caps_per_client
adb31ebb TL	215
	216	It is not recommended to set this value above 5M but it may be helpful with
	217	some workloads.