]> git.proxmox.com Git - ceph.git/blob - ceph/doc/dev/msgr2.rst
a5a48401803ef40825549ddf9dc8a21a12c02836
[ceph.git] / ceph / doc / dev / msgr2.rst
1 .. _msgr2-protocol:
2
3 msgr2 protocol
4 ==============
5
6 This is a revision of the legacy Ceph on-wire protocol that was
7 implemented by the SimpleMessenger. It addresses performance and
8 security issues.
9
10 Goals
11 -----
12
13 This protocol revision has several goals relative to the original protocol:
14
15 * *Flexible handshaking*. The original protocol did not have a
16 sufficiently flexible protocol negotiation that allows for features
17 that were not required.
18 * *Encryption*. We will incorporate encryption over the wire.
19 * *Performance*. We would like to provide for protocol features
20 (e.g., padding) that keep computation and memory copies out of the
21 fast path where possible.
22 * *Signing*. We will allow for traffic to be signed (but not
23 necessarily encrypted). This may not be implemented in the initial version.
24
25 Definitions
26 -----------
27
28 * *client* (C): the party initiating a (TCP) connection
29 * *server* (S): the party accepting a (TCP) connection
30 * *connection*: an instance of a (TCP) connection between two processes.
31 * *entity*: a ceph entity instantiation, e.g. 'osd.0'. each entity
32 has one or more unique entity_addr_t's by virtue of the 'nonce'
33 field, which is typically a pid or random value.
34 * *session*: a stateful session between two entities in which message
35 exchange is ordered and lossless. A session might span multiple
36 connections if there is an interruption (TCP connection disconnect).
37 * *frame*: a discrete message sent between the peers. Each frame
38 consists of a tag (type code), payload, and (if signing
39 or encryption is enabled) some other fields. See below for the
40 structure.
41 * *tag*: a type code associated with a frame. The tag
42 determines the structure of the payload.
43
44 Phases
45 ------
46
47 A connection has four distinct phases:
48
49 #. banner
50 #. authentication frame exchange
51 #. message flow handshake frame exchange
52 #. message frame exchange
53
54 Banner
55 ------
56
57 Both the client and server, upon connecting, send a banner::
58
59 "ceph %x %x\n", protocol_features_suppored, protocol_features_required
60
61 The protocol features are a new, distinct namespace. Initially no
62 features are defined or required, so this will be "ceph 0 0\n".
63
64 If the remote party advertises required features we don't support, we
65 can disconnect.
66
67
68 .. ditaa:: +---------+ +--------+
69 | Client | | Server |
70 +---------+ +--------+
71 | send banner |
72 |----+ +----|
73 | | | |
74 | +-------+--->|
75 | send banner| |
76 |<-----------+ |
77 | |
78
79 Frame format
80 ------------
81
82 All further data sent or received is contained by a frame. Each frame has
83 the form::
84
85 frame_len (le32)
86 tag (TAG_* le32)
87 frame_header_checksum (le32)
88 payload
89 [payload padding -- only present after stream auth phase]
90 [signature -- only present after stream auth phase]
91
92
93 * The frame_header_checksum is over just the frame_len and tag values (8 bytes).
94
95 * frame_len includes everything after the frame_len le32 up to the end of the
96 frame (all payloads, signatures, and padding).
97
98 * The payload format and length is determined by the tag.
99
100 * The signature portion is only present if the authentication phase
101 has completed (TAG_AUTH_DONE has been sent) and signatures are
102 enabled.
103
104 Hello
105 -----
106
107 * TAG_HELLO: client->server and server->client::
108
109 __u8 entity_type
110 entity_addr_t peer_socket_address
111
112 - We immediately share our entity type and the address of the peer (which can be useful
113 for detecting our effective IP address, especially in the presence of NAT).
114
115
116 Authentication
117 --------------
118
119 * TAG_AUTH_REQUEST: client->server::
120
121 __le32 method; // CEPH_AUTH_{NONE, CEPHX, ...}
122 __le32 num_preferred_modes;
123 list<__le32> mode // CEPH_CON_MODE_*
124 method specific payload
125
126 * TAG_AUTH_BAD_METHOD server -> client: reject client-selected auth method::
127
128 __le32 method
129 __le32 negative error result code
130 __le32 num_methods
131 list<__le32> allowed_methods // CEPH_AUTH_{NONE, CEPHX, ...}
132 __le32 num_modes
133 list<__le32> allowed_modes // CEPH_CON_MODE_*
134
135 - Returns the attempted auth method, and error code (-EOPNOTSUPP if
136 the method is unsupported), and the list of allowed authentication
137 methods.
138
139 * TAG_AUTH_REPLY_MORE: server->client::
140
141 __le32 len;
142 method specific payload
143
144 * TAG_AUTH_REQUEST_MORE: client->server::
145
146 __le32 len;
147 method specific payload
148
149 * TAG_AUTH_DONE: (server->client)::
150
151 __le64 global_id
152 __le32 connection mode // CEPH_CON_MODE_*
153 method specific payload
154
155 - The server is the one to decide authentication has completed and what
156 the final connection mode will be.
157
158
159 Example of authentication phase interaction when the client uses an
160 allowed authentication method:
161
162 .. ditaa:: +---------+ +--------+
163 | Client | | Server |
164 +---------+ +--------+
165 | auth request |
166 |---------------->|
167 |<----------------|
168 | auth more|
169 | |
170 |auth more |
171 |---------------->|
172 |<----------------|
173 | auth done|
174
175
176 Example of authentication phase interaction when the client uses a forbidden
177 authentication method as the first attempt:
178
179 .. ditaa:: +---------+ +--------+
180 | Client | | Server |
181 +---------+ +--------+
182 | auth request |
183 |---------------->|
184 |<----------------|
185 | bad method |
186 | |
187 | auth request |
188 |---------------->|
189 |<----------------|
190 | auth more|
191 | |
192 | auth more |
193 |---------------->|
194 |<----------------|
195 | auth done|
196
197
198 Post-auth frame format
199 ----------------------
200
201 The frame format is fixed (see above), but can take three different
202 forms, depending on the AUTH_DONE flags:
203
204 * If neither FLAG_SIGNED or FLAG_ENCRYPTED is specified, things are simple::
205
206 frame_len
207 tag
208 payload
209 payload_padding (out to auth block_size)
210
211 - The padding is some number of bytes < the auth block_size that
212 brings the total length of the payload + payload_padding to a
213 multiple of block_size. It does not include the frame_len or tag. Padding
214 content can be zeros or (better) random bytes.
215
216 * If FLAG_SIGNED has been specified::
217
218 frame_len
219 tag
220 payload
221 payload_padding (out to auth block_size)
222 signature (sig_size bytes)
223
224 Here the padding just makes life easier for the signature. It can be
225 random data to add additional confounder. Note also that the
226 signature input must include some state from the session key and the
227 previous message.
228
229 * If FLAG_ENCRYPTED has been specified::
230
231 frame_len
232 tag
233 {
234 payload
235 payload_padding (out to auth block_size)
236 } ^ stream cipher
237
238 Note that the padding ensures that the total frame is a multiple of
239 the auth method's block_size so that the message can be sent out over
240 the wire without waiting for the next frame in the stream.
241
242
243 Message flow handshake
244 ----------------------
245
246 In this phase the peers identify each other and (if desired) reconnect to
247 an established session.
248
249 * TAG_CLIENT_IDENT (client->server): identify ourselves::
250
251 __le32 num_addrs
252 entity_addrvec_t*num_addrs entity addrs
253 entity_addr_t target entity addr
254 __le64 gid (numeric part of osd.0, client.123456, ...)
255 __le64 global_seq
256 __le64 features supported (CEPH_FEATURE_* bitmask)
257 __le64 features required (CEPH_FEATURE_* bitmask)
258 __le64 flags (CEPH_MSG_CONNECT_* bitmask)
259 __le64 cookie
260
261 - client will send first, server will reply with same. if this is a
262 new session, the client and server can proceed to the message exchange.
263 - the target addr is who the client is trying to connect *to*, so
264 that the server side can close the connection if the client is
265 talking to the wrong daemon.
266 - type.gid (entity_name_t) is set here, by combinging the type shared in the hello
267 frame with the gid here. this means we don't need it
268 in the header of every message. it also means that we can't send
269 messages "from" other entity_name_t's. the current
270 implementations set this at the top of _send_message etc so this
271 shouldn't break any existing functionality. implementation will
272 likely want to mask this against what the authenticated credential
273 allows.
274 - cookie is the client coookie used to identify a session, and can be used
275 to reconnect to an existing session.
276 - we've dropped the 'protocol_version' field from msgr1
277
278 * TAG_IDENT_MISSING_FEATURES (server->client): complain about a TAG_IDENT
279 with too few features::
280
281 __le64 features we require that the peer didn't advertise
282
283 * TAG_SERVER_IDENT (server->client): accept client ident and identify server::
284
285 __le32 num_addrs
286 entity_addrvec_t*num_addrs entity addrs
287 __le64 gid (numeric part of osd.0, client.123456, ...)
288 __le64 global_seq
289 __le64 features supported (CEPH_FEATURE_* bitmask)
290 __le64 features required (CEPH_FEATURE_* bitmask)
291 __le64 flags (CEPH_MSG_CONNECT_* bitmask)
292 __le64 cookie
293
294 - The server cookie can be used by the client if it is later disconnected
295 and wants to reconnect and resume the session.
296
297 * TAG_RECONNECT (client->server): reconnect to an established session::
298
299 __le32 num_addrs
300 entity_addr_t * num_addrs
301 __le64 client_cookie
302 __le64 server_cookie
303 __le64 global_seq
304 __le64 connect_seq
305 __le64 msg_seq (the last msg seq received)
306
307 * TAG_RECONNECT_OK (server->client): acknowledge a reconnect attempt::
308
309 __le64 msg_seq (last msg seq received)
310
311 - once the client receives this, the client can proceed to message exchange.
312 - once the server sends this, the server can proceed to message exchange.
313
314 * TAG_RECONNECT_RETRY_SESSION (server only): fail reconnect due to stale connect_seq
315
316 * TAG_RECONNECT_RETRY_GLOBAL (server only): fail reconnect due to stale global_seq
317
318 * TAG_RECONNECT_WAIT (server only): fail reconnect due to connect race.
319
320 - Indicates that the server is already connecting to the client, and
321 that direction should win the race. The client should wait for that
322 connection to complete.
323
324 * TAG_RESET_SESSION (server only): ask client to reset session::
325
326 __u8 full
327
328 - full flag indicates whether peer should do a full reset, i.e., drop
329 message queue.
330
331
332 Example of failure scenarios:
333
334 * First client's client_ident message is lost, and then client reconnects.
335
336 .. ditaa:: +---------+ +--------+
337 | Client | | Server |
338 +---------+ +--------+
339 | |
340 c_cookie(a) | client_ident(a) |
341 |-------------X |
342 | |
343 | client_ident(a) |
344 |-------------------->|
345 |<--------------------|
346 | server_ident(b) | s_cookie(b)
347 | |
348 | session established |
349 | |
350
351
352 * Server's server_ident message is lost, and then client reconnects.
353
354 .. ditaa:: +---------+ +--------+
355 | Client | | Server |
356 +---------+ +--------+
357 | |
358 c_cookie(a) | client_ident(a) |
359 |-------------------->|
360 | X------------|
361 | server_ident(b) | s_cookie(b)
362 | |
363 | |
364 | client_ident(a) |
365 |-------------------->|
366 |<--------------------|
367 | server_ident(c) | s_cookie(c)
368 | |
369 | session established |
370 | |
371
372
373 * Server's server_ident message is lost, and then server reconnects.
374
375 .. ditaa:: +---------+ +--------+
376 | Client | | Server |
377 +---------+ +--------+
378 | |
379 c_cookie(a) | client_ident(a) |
380 |-------------------->|
381 | X------------|
382 | server_ident(b) | s_cookie(b)
383 | |
384 | |
385 | reconnect(a, b) |
386 |<--------------------|
387 |-------------------->|
388 | reset_session(F) |
389 | |
390 | client_ident(a) | c_cookie(a)
391 |<--------------------|
392 |-------------------->|
393 s_cookie(c) | server_ident(c) |
394 | |
395
396
397 * Connection failure after session is established, and then client reconnects.
398
399 .. ditaa:: +---------+ +--------+
400 | Client | | Server |
401 +---------+ +--------+
402 | |
403 c_cookie(a) | session established | s_cookie(b)
404 |<------------------->|
405 | X------------|
406 | |
407 | reconnect(a, b) |
408 |-------------------->|
409 |<--------------------|
410 | reconnect_ok |
411 | |
412
413
414 * Connection failure after session is established because server reset,
415 and then client reconnects.
416
417 .. ditaa:: +---------+ +--------+
418 | Client | | Server |
419 +---------+ +--------+
420 | |
421 c_cookie(a) | session established | s_cookie(b)
422 |<------------------->|
423 | X------------| reset
424 | |
425 | reconnect(a, b) |
426 |-------------------->|
427 |<--------------------|
428 | reset_session(RC*) |
429 | |
430 c_cookie(c) | client_ident(c) |
431 |-------------------->|
432 |<--------------------|
433 | server_ident(d) | s_cookie(d)
434 | |
435
436 RC* means that the reset session full flag depends on the policy.resetcheck
437 of the connection.
438
439
440 * Connection failure after session is established because client reset,
441 and then client reconnects.
442
443 .. ditaa:: +---------+ +--------+
444 | Client | | Server |
445 +---------+ +--------+
446 | |
447 c_cookie(a) | session established | s_cookie(b)
448 |<------------------->|
449 reset | X------------|
450 | |
451 c_cookie(c) | client_ident(c) |
452 |-------------------->|
453 |<--------------------| reset if policy.resetcheck
454 | server_ident(d) | s_cookie(d)
455 | |
456
457
458 Message exchange
459 ----------------
460
461 Once a session is established, we can exchange messages.
462
463 * TAG_MSG: a message::
464
465 ceph_msg_header2
466 front
467 middle
468 data_pre_padding
469 data
470
471 - The ceph_msg_header2 is modified from ceph_msg_header:
472 * include an ack_seq. This avoids the need for a TAG_ACK
473 message most of the time.
474 * remove the src field, which we now get from the message flow
475 handshake (TAG_IDENT).
476 * specifies the data_pre_padding length, which can be used to
477 adjust the alignment of the data payload. (NOTE: is this is
478 useful?)
479
480 * TAG_ACK: acknowledge receipt of message(s)::
481
482 __le64 seq
483
484 - This is only used for stateful sessions.
485
486 * TAG_KEEPALIVE2: check for connection liveness::
487
488 ceph_timespec stamp
489
490 - Time stamp is local to sender.
491
492 * TAG_KEEPALIVE2_ACK: reply to a keepalive2::
493
494 ceph_timestamp stamp
495
496 - Time stamp is from the TAG_KEEPALIVE2 we are responding to.
497
498 * TAG_CLOSE: terminate a connection
499
500 Indicates that a connection should be terminated. This is equivalent
501 to a hangup or reset (i.e., should trigger ms_handle_reset). It
502 isn't strictly necessary or useful as we could just disconnect the
503 TCP connection.
504
505
506 Example of protocol interaction (WIP)
507 _____________________________________
508
509
510 .. ditaa:: +---------+ +--------+
511 | Client | | Server |
512 +---------+ +--------+
513 | send banner |
514 |----+ +------|
515 | | | |
516 | +-------+----->|
517 | send banner| |
518 |<-----------+ |
519 | |
520 | send new stream |
521 |------------------>|
522 | auth request |
523 |------------------>|
524 |<------------------|
525 | bad method |
526 | |
527 | auth request |
528 |------------------>|
529 |<------------------|
530 | auth more |
531 | |
532 | auth more |
533 |------------------>|
534 |<------------------|
535 | auth done |
536 | |
537
538