[ceph.git] / ceph / doc / cephfs / eviction.rst


================================
Ceph file system client eviction
================================

When a file system client is unresponsive or otherwise misbehaving, it
may be necessary to forcibly terminate its access to the file system.  This
process is called *eviction*.

Evicting a CephFS client prevents it from communicating further with MDS
daemons and OSD daemons.  If a client was doing buffered IO to the file system,
any un-flushed data will be lost.

Clients may either be evicted automatically (if they fail to communicate
promptly with the MDS), or manually (by the system administrator).

The client eviction process applies to clients of all kinds, this includes
FUSE mounts, kernel mounts, nfs-ganesha gateways, and any process using
libcephfs.

Automatic client eviction
=========================

There are three situations in which a client may be evicted automatically.

#. On an active MDS daemon, if a client has not communicated with the MDS for over
   ``session_autoclose`` (a file system variable) seconds (300 seconds by
   default), then it will be evicted automatically.

#. On an active MDS daemon, if a client has not responded to cap revoke messages
   for over ``mds_cap_revoke_eviction_timeout`` (configuration option) seconds.
   This is disabled by default.

#. During MDS startup (including on failover), the MDS passes through a
   state called ``reconnect``.  During this state, it waits for all the
   clients to connect to the new MDS daemon.  If any clients fail to do
   so within the time window (``mds_reconnect_timeout``, 45 seconds by default)
   then they will be evicted.

A warning message is sent to the cluster log if either of these situations
arises.

Manual client eviction
======================

Sometimes, the administrator may want to evict a client manually.  This
could happen if a client has died and the administrator does not
want to wait for its session to time out, or it could happen if
a client is misbehaving and the administrator does not have access to
the client node to unmount it.

It is useful to inspect the list of clients first:

::

    ceph tell mds.0 client ls

    [
        {
            "id": 4305,
            "num_leases": 0,
            "num_caps": 3,
            "state": "open",
            "replay_requests": 0,
            "completed_requests": 0,
            "reconnecting": false,
            "inst": "client.4305 172.21.9.34:0/422650892",
            "client_metadata": {
                "ceph_sha1": "ae81e49d369875ac8b569ff3e3c456a31b8f3af5",
                "ceph_version": "ceph version 12.0.0-1934-gae81e49 (ae81e49d369875ac8b569ff3e3c456a31b8f3af5)",
                "entity_id": "0",
                "hostname": "senta04",
                "mount_point": "/tmp/tmpcMpF1b/mnt.0",
                "pid": "29377",
                "root": "/"
            }
        }
    ]
    

Once you have identified the client you want to evict, you can
do that using its unique ID, or various other attributes to identify it:

::
    
    # These all work
    ceph tell mds.0 client evict id=4305
    ceph tell mds.0 client evict client_metadata.=4305


Advanced: Un-blacklisting a client
==================================

Ordinarily, a blacklisted client may not reconnect to the servers: it
must be unmounted and then mounted anew.

However, in some situations it may be useful to permit a client that
was evicted to attempt to reconnect.

Because CephFS uses the RADOS OSD blacklist to control client eviction,
CephFS clients can be permitted to reconnect by removing them from
the blacklist:

::

    $ ceph osd blacklist ls
    listed 1 entries
    127.0.0.1:0/3710147553 2018-03-19 11:32:24.716146
    $ ceph osd blacklist rm 127.0.0.1:0/3710147553
    un-blacklisting 127.0.0.1:0/3710147553


Doing this may put data integrity at risk if other clients have accessed
files that the blacklisted client was doing buffered IO to.  It is also not
guaranteed to result in a fully functional client -- the best way to get
a fully healthy client back after an eviction is to unmount the client
and do a fresh mount.

If you are trying to reconnect clients in this way, you may also
find it useful to set ``client_reconnect_stale`` to true in the
FUSE client, to prompt the client to try to reconnect.

Advanced: Configuring blacklisting
==================================

If you are experiencing frequent client evictions, due to slow
client hosts or an unreliable network, and you cannot fix the underlying
issue, then you may want to ask the MDS to be less strict.

It is possible to respond to slow clients by simply dropping their
MDS sessions, but permit them to re-open sessions and permit them
to continue talking to OSDs.  To enable this mode, set
``mds_session_blacklist_on_timeout`` to false on your MDS nodes.

For the equivalent behaviour on manual evictions, set
``mds_session_blacklist_on_evict`` to false.

Note that if blacklisting is disabled, then evicting a client will
only have an effect on the MDS you send the command to.  On a system
with multiple active MDS daemons, you would need to send an
eviction command to each active daemon.  When blacklisting is enabled 
(the default), sending an eviction command to just a single
MDS is sufficient, because the blacklist propagates it to the others.

.. _background_blacklisting_and_osd_epoch_barrier:

Background: Blacklisting and OSD epoch barrier
==============================================

After a client is blacklisted, it is necessary to make sure that
other clients and MDS daemons have the latest OSDMap (including
the blacklist entry) before they try to access any data objects
that the blacklisted client might have been accessing.

This is ensured using an internal "osdmap epoch barrier" mechanism.

The purpose of the barrier is to ensure that when we hand out any
capabilities which might allow touching the same RADOS objects, the
clients we hand out the capabilities to must have a sufficiently recent
OSD map to not race with cancelled operations (from ENOSPC) or
blacklisted clients (from evictions).

More specifically, the cases where an epoch barrier is set are:

 * Client eviction (where the client is blacklisted and other clients
   must wait for a post-blacklist epoch to touch the same objects).
 * OSD map full flag handling in the client (where the client may
   cancel some OSD ops from a pre-full epoch, so other clients must
   wait until the full epoch or later before touching the same objects).
 * MDS startup, because we don't persist the barrier epoch, so must
   assume that latest OSD map is always required after a restart.

Note that this is a global value for simplicity. We could maintain this on
a per-inode basis. But we don't, because:

 * It would be more complicated.
 * It would use an extra 4 bytes of memory for every inode.
 * It would not be much more efficient as, almost always, everyone has
   the latest OSD map. And, in most cases everyone will breeze through this
   barrier rather than waiting.
 * This barrier is done in very rare cases, so any benefit from per-inode
   granularity would only very rarely be seen.

The epoch barrier is transmitted along with all capability messages, and
instructs the receiver of the message to avoid sending any more RADOS
operations to OSDs until it has seen this OSD epoch.  This mainly applies
to clients (doing their data writes directly to files), but also applies
to the MDS because things like file size probing and file deletion are
done directly from the MDS.
Commit	Line	Data
7c673cae	1
9f95a23c TL	2	================================
	3	Ceph file system client eviction
	4	================================
7c673cae	5
9f95a23c TL	6	When a file system client is unresponsive or otherwise misbehaving, it
9f95a23c TL	7	may be necessary to forcibly terminate its access to the file system. This
7c673cae FG	8	process is called eviction.
7c673cae FG	9
31f18b77	10	Evicting a CephFS client prevents it from communicating further with MDS
9f95a23c	11	daemons and OSD daemons. If a client was doing buffered IO to the file system,
31f18b77 FG	12	any un-flushed data will be lost.
	13
	14	Clients may either be evicted automatically (if they fail to communicate
	15	promptly with the MDS), or manually (by the system administrator).
	16
	17	The client eviction process applies to clients of all kinds, this includes
	18	FUSE mounts, kernel mounts, nfs-ganesha gateways, and any process using
	19	libcephfs.
	20
	21	Automatic client eviction
	22	=========================
	23
11fdf7f2	24	There are three situations in which a client may be evicted automatically.
31f18b77	25
11fdf7f2 TL	26	#. On an active MDS daemon, if a client has not communicated with the MDS for over
	27	``session_autoclose`` (a file system variable) seconds (300 seconds by
	28	default), then it will be evicted automatically.
31f18b77	29
11fdf7f2 TL	30	#. On an active MDS daemon, if a client has not responded to cap revoke messages
	31	for over ``mds_cap_revoke_eviction_timeout`` (configuration option) seconds.
	32	This is disabled by default.
91327a77	33
11fdf7f2 TL	34	#. During MDS startup (including on failover), the MDS passes through a
	35	state called ``reconnect``. During this state, it waits for all the
	36	clients to connect to the new MDS daemon. If any clients fail to do
	37	so within the time window (``mds_reconnect_timeout``, 45 seconds by default)
	38	then they will be evicted.
31f18b77 FG	39
	40	A warning message is sent to the cluster log if either of these situations
	41	arises.
7c673cae	42
31f18b77 FG	43	Manual client eviction
31f18b77 FG	44	======================
7c673cae	45
31f18b77	46	Sometimes, the administrator may want to evict a client manually. This
11fdf7f2	47	could happen if a client has died and the administrator does not
31f18b77 FG	48	want to wait for its session to time out, or it could happen if
	49	a client is misbehaving and the administrator does not have access to
	50	the client node to unmount it.
7c673cae	51
31f18b77	52	It is useful to inspect the list of clients first:
7c673cae FG	53
	54	::
	55
31f18b77 FG	56	ceph tell mds.0 client ls
31f18b77 FG	57
7c673cae	58	[
31f18b77 FG	59	{
	60	"id": 4305,
	61	"num_leases": 0,
	62	"num_caps": 3,
	63	"state": "open",
	64	"replay_requests": 0,
	65	"completed_requests": 0,
	66	"reconnecting": false,
	67	"inst": "client.4305 172.21.9.34:0/422650892",
	68	"client_metadata": {
	69	"ceph_sha1": "ae81e49d369875ac8b569ff3e3c456a31b8f3af5",
	70	"ceph_version": "ceph version 12.0.0-1934-gae81e49 (ae81e49d369875ac8b569ff3e3c456a31b8f3af5)",
	71	"entity_id": "0",
	72	"hostname": "senta04",
	73	"mount_point": "/tmp/tmpcMpF1b/mnt.0",
	74	"pid": "29377",
	75	"root": "/"
	76	}
	77	}
	78	]
	79
	80
	81
	82	Once you have identified the client you want to evict, you can
	83	do that using its unique ID, or various other attributes to identify it:
7c673cae FG	84
7c673cae FG	85	::
31f18b77 FG	86
	87	# These all work
	88	ceph tell mds.0 client evict id=4305
	89	ceph tell mds.0 client evict client_metadata.=4305
	90
7c673cae	91
31f18b77 FG	92	Advanced: Un-blacklisting a client
31f18b77 FG	93	==================================
7c673cae	94
31f18b77 FG	95	Ordinarily, a blacklisted client may not reconnect to the servers: it
31f18b77 FG	96	must be unmounted and then mounted anew.
7c673cae	97
31f18b77 FG	98	However, in some situations it may be useful to permit a client that
31f18b77 FG	99	was evicted to attempt to reconnect.
7c673cae	100
31f18b77 FG	101	Because CephFS uses the RADOS OSD blacklist to control client eviction,
	102	CephFS clients can be permitted to reconnect by removing them from
	103	the blacklist:
7c673cae FG	104
	105	::
	106
11fdf7f2 TL	107	$ ceph osd blacklist ls
	108	listed 1 entries
	109	127.0.0.1:0/3710147553 2018-03-19 11:32:24.716146
	110	$ ceph osd blacklist rm 127.0.0.1:0/3710147553
	111	un-blacklisting 127.0.0.1:0/3710147553
	112
7c673cae	113
31f18b77 FG	114	Doing this may put data integrity at risk if other clients have accessed
	115	files that the blacklisted client was doing buffered IO to. It is also not
	116	guaranteed to result in a fully functional client -- the best way to get
	117	a fully healthy client back after an eviction is to unmount the client
	118	and do a fresh mount.
7c673cae	119
31f18b77 FG	120	If you are trying to reconnect clients in this way, you may also
	121	find it useful to set ``client_reconnect_stale`` to true in the
	122	FUSE client, to prompt the client to try to reconnect.
7c673cae	123
31f18b77 FG	124	Advanced: Configuring blacklisting
31f18b77 FG	125	==================================
7c673cae	126
31f18b77 FG	127	If you are experiencing frequent client evictions, due to slow
	128	client hosts or an unreliable network, and you cannot fix the underlying
	129	issue, then you may want to ask the MDS to be less strict.
7c673cae	130
31f18b77 FG	131	It is possible to respond to slow clients by simply dropping their
	132	MDS sessions, but permit them to re-open sessions and permit them
	133	to continue talking to OSDs. To enable this mode, set
	134	``mds_session_blacklist_on_timeout`` to false on your MDS nodes.
7c673cae	135
31f18b77 FG	136	For the equivalent behaviour on manual evictions, set
	137	``mds_session_blacklist_on_evict`` to false.
	138
	139	Note that if blacklisting is disabled, then evicting a client will
	140	only have an effect on the MDS you send the command to. On a system
	141	with multiple active MDS daemons, you would need to send an
	142	eviction command to each active daemon. When blacklisting is enabled
b32b8144	143	(the default), sending an eviction command to just a single
31f18b77 FG	144	MDS is sufficient, because the blacklist propagates it to the others.
31f18b77 FG	145
b32b8144	146	.. _background_blacklisting_and_osd_epoch_barrier:
7c673cae	147
b32b8144 FG	148	Background: Blacklisting and OSD epoch barrier
b32b8144 FG	149	==============================================
7c673cae	150
b32b8144 FG	151	After a client is blacklisted, it is necessary to make sure that
	152	other clients and MDS daemons have the latest OSDMap (including
	153	the blacklist entry) before they try to access any data objects
	154	that the blacklisted client might have been accessing.
	155
	156	This is ensured using an internal "osdmap epoch barrier" mechanism.
	157
	158	The purpose of the barrier is to ensure that when we hand out any
	159	capabilities which might allow touching the same RADOS objects, the
	160	clients we hand out the capabilities to must have a sufficiently recent
	161	OSD map to not race with cancelled operations (from ENOSPC) or
	162	blacklisted clients (from evictions).
	163
	164	More specifically, the cases where an epoch barrier is set are:
	165
	166	* Client eviction (where the client is blacklisted and other clients
	167	must wait for a post-blacklist epoch to touch the same objects).
	168	* OSD map full flag handling in the client (where the client may
	169	cancel some OSD ops from a pre-full epoch, so other clients must
	170	wait until the full epoch or later before touching the same objects).
	171	* MDS startup, because we don't persist the barrier epoch, so must
	172	assume that latest OSD map is always required after a restart.
	173
	174	Note that this is a global value for simplicity. We could maintain this on
	175	a per-inode basis. But we don't, because:
	176
	177	* It would be more complicated.
	178	* It would use an extra 4 bytes of memory for every inode.
11fdf7f2 TL	179	* It would not be much more efficient as, almost always, everyone has
	180	the latest OSD map. And, in most cases everyone will breeze through this
	181	barrier rather than waiting.
b32b8144 FG	182	* This barrier is done in very rare cases, so any benefit from per-inode
	183	granularity would only very rarely be seen.
	184
	185	The epoch barrier is transmitted along with all capability messages, and
	186	instructs the receiver of the message to avoid sending any more RADOS
	187	operations to OSDs until it has seen this OSD epoch. This mainly applies
	188	to clients (doing their data writes directly to files), but also applies
	189	to the MDS because things like file size probing and file deletion are
	190	done directly from the MDS.