]>
Commit | Line | Data |
---|---|---|
7c673cae | 1 | |
9f95a23c TL |
2 | ================================ |
3 | Ceph file system client eviction | |
4 | ================================ | |
7c673cae | 5 | |
9f95a23c TL |
6 | When a file system client is unresponsive or otherwise misbehaving, it |
7 | may be necessary to forcibly terminate its access to the file system. This | |
7c673cae FG |
8 | process is called *eviction*. |
9 | ||
31f18b77 | 10 | Evicting a CephFS client prevents it from communicating further with MDS |
9f95a23c | 11 | daemons and OSD daemons. If a client was doing buffered IO to the file system, |
31f18b77 FG |
12 | any un-flushed data will be lost. |
13 | ||
14 | Clients may either be evicted automatically (if they fail to communicate | |
15 | promptly with the MDS), or manually (by the system administrator). | |
16 | ||
17 | The client eviction process applies to clients of all kinds, this includes | |
18 | FUSE mounts, kernel mounts, nfs-ganesha gateways, and any process using | |
19 | libcephfs. | |
20 | ||
21 | Automatic client eviction | |
22 | ========================= | |
23 | ||
11fdf7f2 | 24 | There are three situations in which a client may be evicted automatically. |
31f18b77 | 25 | |
11fdf7f2 TL |
26 | #. On an active MDS daemon, if a client has not communicated with the MDS for over |
27 | ``session_autoclose`` (a file system variable) seconds (300 seconds by | |
28 | default), then it will be evicted automatically. | |
31f18b77 | 29 | |
11fdf7f2 TL |
30 | #. On an active MDS daemon, if a client has not responded to cap revoke messages |
31 | for over ``mds_cap_revoke_eviction_timeout`` (configuration option) seconds. | |
32 | This is disabled by default. | |
91327a77 | 33 | |
11fdf7f2 TL |
34 | #. During MDS startup (including on failover), the MDS passes through a |
35 | state called ``reconnect``. During this state, it waits for all the | |
36 | clients to connect to the new MDS daemon. If any clients fail to do | |
37 | so within the time window (``mds_reconnect_timeout``, 45 seconds by default) | |
38 | then they will be evicted. | |
31f18b77 FG |
39 | |
40 | A warning message is sent to the cluster log if either of these situations | |
41 | arises. | |
7c673cae | 42 | |
31f18b77 FG |
43 | Manual client eviction |
44 | ====================== | |
7c673cae | 45 | |
31f18b77 | 46 | Sometimes, the administrator may want to evict a client manually. This |
11fdf7f2 | 47 | could happen if a client has died and the administrator does not |
31f18b77 FG |
48 | want to wait for its session to time out, or it could happen if |
49 | a client is misbehaving and the administrator does not have access to | |
50 | the client node to unmount it. | |
7c673cae | 51 | |
31f18b77 | 52 | It is useful to inspect the list of clients first: |
7c673cae FG |
53 | |
54 | :: | |
55 | ||
31f18b77 FG |
56 | ceph tell mds.0 client ls |
57 | ||
7c673cae | 58 | [ |
31f18b77 FG |
59 | { |
60 | "id": 4305, | |
61 | "num_leases": 0, | |
62 | "num_caps": 3, | |
63 | "state": "open", | |
64 | "replay_requests": 0, | |
65 | "completed_requests": 0, | |
66 | "reconnecting": false, | |
67 | "inst": "client.4305 172.21.9.34:0/422650892", | |
68 | "client_metadata": { | |
69 | "ceph_sha1": "ae81e49d369875ac8b569ff3e3c456a31b8f3af5", | |
70 | "ceph_version": "ceph version 12.0.0-1934-gae81e49 (ae81e49d369875ac8b569ff3e3c456a31b8f3af5)", | |
71 | "entity_id": "0", | |
72 | "hostname": "senta04", | |
73 | "mount_point": "/tmp/tmpcMpF1b/mnt.0", | |
74 | "pid": "29377", | |
75 | "root": "/" | |
76 | } | |
77 | } | |
78 | ] | |
79 | ||
80 | ||
81 | ||
82 | Once you have identified the client you want to evict, you can | |
83 | do that using its unique ID, or various other attributes to identify it: | |
7c673cae FG |
84 | |
85 | :: | |
31f18b77 FG |
86 | |
87 | # These all work | |
88 | ceph tell mds.0 client evict id=4305 | |
89 | ceph tell mds.0 client evict client_metadata.=4305 | |
90 | ||
7c673cae | 91 | |
31f18b77 FG |
92 | Advanced: Un-blacklisting a client |
93 | ================================== | |
7c673cae | 94 | |
31f18b77 FG |
95 | Ordinarily, a blacklisted client may not reconnect to the servers: it |
96 | must be unmounted and then mounted anew. | |
7c673cae | 97 | |
31f18b77 FG |
98 | However, in some situations it may be useful to permit a client that |
99 | was evicted to attempt to reconnect. | |
7c673cae | 100 | |
31f18b77 FG |
101 | Because CephFS uses the RADOS OSD blacklist to control client eviction, |
102 | CephFS clients can be permitted to reconnect by removing them from | |
103 | the blacklist: | |
7c673cae FG |
104 | |
105 | :: | |
106 | ||
11fdf7f2 TL |
107 | $ ceph osd blacklist ls |
108 | listed 1 entries | |
109 | 127.0.0.1:0/3710147553 2018-03-19 11:32:24.716146 | |
110 | $ ceph osd blacklist rm 127.0.0.1:0/3710147553 | |
111 | un-blacklisting 127.0.0.1:0/3710147553 | |
112 | ||
7c673cae | 113 | |
31f18b77 FG |
114 | Doing this may put data integrity at risk if other clients have accessed |
115 | files that the blacklisted client was doing buffered IO to. It is also not | |
116 | guaranteed to result in a fully functional client -- the best way to get | |
117 | a fully healthy client back after an eviction is to unmount the client | |
118 | and do a fresh mount. | |
7c673cae | 119 | |
31f18b77 FG |
120 | If you are trying to reconnect clients in this way, you may also |
121 | find it useful to set ``client_reconnect_stale`` to true in the | |
122 | FUSE client, to prompt the client to try to reconnect. | |
7c673cae | 123 | |
31f18b77 FG |
124 | Advanced: Configuring blacklisting |
125 | ================================== | |
7c673cae | 126 | |
31f18b77 FG |
127 | If you are experiencing frequent client evictions, due to slow |
128 | client hosts or an unreliable network, and you cannot fix the underlying | |
129 | issue, then you may want to ask the MDS to be less strict. | |
7c673cae | 130 | |
31f18b77 FG |
131 | It is possible to respond to slow clients by simply dropping their |
132 | MDS sessions, but permit them to re-open sessions and permit them | |
133 | to continue talking to OSDs. To enable this mode, set | |
134 | ``mds_session_blacklist_on_timeout`` to false on your MDS nodes. | |
7c673cae | 135 | |
31f18b77 FG |
136 | For the equivalent behaviour on manual evictions, set |
137 | ``mds_session_blacklist_on_evict`` to false. | |
138 | ||
139 | Note that if blacklisting is disabled, then evicting a client will | |
140 | only have an effect on the MDS you send the command to. On a system | |
141 | with multiple active MDS daemons, you would need to send an | |
142 | eviction command to each active daemon. When blacklisting is enabled | |
b32b8144 | 143 | (the default), sending an eviction command to just a single |
31f18b77 FG |
144 | MDS is sufficient, because the blacklist propagates it to the others. |
145 | ||
b32b8144 | 146 | .. _background_blacklisting_and_osd_epoch_barrier: |
7c673cae | 147 | |
b32b8144 FG |
148 | Background: Blacklisting and OSD epoch barrier |
149 | ============================================== | |
7c673cae | 150 | |
b32b8144 FG |
151 | After a client is blacklisted, it is necessary to make sure that |
152 | other clients and MDS daemons have the latest OSDMap (including | |
153 | the blacklist entry) before they try to access any data objects | |
154 | that the blacklisted client might have been accessing. | |
155 | ||
156 | This is ensured using an internal "osdmap epoch barrier" mechanism. | |
157 | ||
158 | The purpose of the barrier is to ensure that when we hand out any | |
159 | capabilities which might allow touching the same RADOS objects, the | |
160 | clients we hand out the capabilities to must have a sufficiently recent | |
161 | OSD map to not race with cancelled operations (from ENOSPC) or | |
162 | blacklisted clients (from evictions). | |
163 | ||
164 | More specifically, the cases where an epoch barrier is set are: | |
165 | ||
166 | * Client eviction (where the client is blacklisted and other clients | |
167 | must wait for a post-blacklist epoch to touch the same objects). | |
168 | * OSD map full flag handling in the client (where the client may | |
169 | cancel some OSD ops from a pre-full epoch, so other clients must | |
170 | wait until the full epoch or later before touching the same objects). | |
171 | * MDS startup, because we don't persist the barrier epoch, so must | |
172 | assume that latest OSD map is always required after a restart. | |
173 | ||
174 | Note that this is a global value for simplicity. We could maintain this on | |
175 | a per-inode basis. But we don't, because: | |
176 | ||
177 | * It would be more complicated. | |
178 | * It would use an extra 4 bytes of memory for every inode. | |
11fdf7f2 TL |
179 | * It would not be much more efficient as, almost always, everyone has |
180 | the latest OSD map. And, in most cases everyone will breeze through this | |
181 | barrier rather than waiting. | |
b32b8144 FG |
182 | * This barrier is done in very rare cases, so any benefit from per-inode |
183 | granularity would only very rarely be seen. | |
184 | ||
185 | The epoch barrier is transmitted along with all capability messages, and | |
186 | instructs the receiver of the message to avoid sending any more RADOS | |
187 | operations to OSDs until it has seen this OSD epoch. This mainly applies | |
188 | to clients (doing their data writes directly to files), but also applies | |
189 | to the MDS because things like file size probing and file deletion are | |
190 | done directly from the MDS. |