]>
Commit | Line | Data |
---|---|---|
11fdf7f2 TL |
1 | .. _msgr2-protocol: |
2 | ||
7c673cae FG |
3 | msgr2 protocol |
4 | ============== | |
5 | ||
6 | This is a revision of the legacy Ceph on-wire protocol that was | |
7 | implemented by the SimpleMessenger. It addresses performance and | |
8 | security issues. | |
9 | ||
11fdf7f2 TL |
10 | Goals |
11 | ----- | |
12 | ||
13 | This protocol revision has several goals relative to the original protocol: | |
14 | ||
15 | * *Flexible handshaking*. The original protocol did not have a | |
16 | sufficiently flexible protocol negotiation that allows for features | |
17 | that were not required. | |
18 | * *Encryption*. We will incorporate encryption over the wire. | |
19 | * *Performance*. We would like to provide for protocol features | |
20 | (e.g., padding) that keep computation and memory copies out of the | |
21 | fast path where possible. | |
22 | * *Signing*. We will allow for traffic to be signed (but not | |
23 | necessarily encrypted). This may not be implemented in the initial version. | |
24 | ||
7c673cae FG |
25 | Definitions |
26 | ----------- | |
27 | ||
28 | * *client* (C): the party initiating a (TCP) connection | |
29 | * *server* (S): the party accepting a (TCP) connection | |
30 | * *connection*: an instance of a (TCP) connection between two processes. | |
31 | * *entity*: a ceph entity instantiation, e.g. 'osd.0'. each entity | |
32 | has one or more unique entity_addr_t's by virtue of the 'nonce' | |
33 | field, which is typically a pid or random value. | |
7c673cae FG |
34 | * *session*: a stateful session between two entities in which message |
35 | exchange is ordered and lossless. A session might span multiple | |
11fdf7f2 | 36 | connections if there is an interruption (TCP connection disconnect). |
7c673cae | 37 | * *frame*: a discrete message sent between the peers. Each frame |
11fdf7f2 | 38 | consists of a tag (type code), payload, and (if signing |
7c673cae FG |
39 | or encryption is enabled) some other fields. See below for the |
40 | structure. | |
11fdf7f2 | 41 | * *tag*: a type code associated with a frame. The tag |
7c673cae FG |
42 | determines the structure of the payload. |
43 | ||
44 | Phases | |
45 | ------ | |
46 | ||
11fdf7f2 | 47 | A connection has four distinct phases: |
7c673cae FG |
48 | |
49 | #. banner | |
11fdf7f2 TL |
50 | #. authentication frame exchange |
51 | #. message flow handshake frame exchange | |
52 | #. message frame exchange | |
7c673cae FG |
53 | |
54 | Banner | |
55 | ------ | |
56 | ||
57 | Both the client and server, upon connecting, send a banner:: | |
58 | ||
59 | "ceph %x %x\n", protocol_features_suppored, protocol_features_required | |
60 | ||
61 | The protocol features are a new, distinct namespace. Initially no | |
62 | features are defined or required, so this will be "ceph 0 0\n". | |
63 | ||
64 | If the remote party advertises required features we don't support, we | |
65 | can disconnect. | |
66 | ||
11fdf7f2 TL |
67 | |
68 | .. ditaa:: +---------+ +--------+ | |
69 | | Client | | Server | | |
70 | +---------+ +--------+ | |
71 | | send banner | | |
72 | |----+ +----| | |
73 | | | | | | |
74 | | +-------+--->| | |
75 | | send banner| | | |
76 | |<-----------+ | | |
77 | | | | |
78 | ||
7c673cae FG |
79 | Frame format |
80 | ------------ | |
81 | ||
82 | All further data sent or received is contained by a frame. Each frame has | |
83 | the form:: | |
84 | ||
7c673cae | 85 | frame_len (le32) |
11fdf7f2 TL |
86 | tag (TAG_* le32) |
87 | frame_header_checksum (le32) | |
7c673cae FG |
88 | payload |
89 | [payload padding -- only present after stream auth phase] | |
90 | [signature -- only present after stream auth phase] | |
91 | ||
11fdf7f2 TL |
92 | |
93 | * The frame_header_checksum is over just the frame_len and tag values (8 bytes). | |
94 | ||
7c673cae FG |
95 | * frame_len includes everything after the frame_len le32 up to the end of the |
96 | frame (all payloads, signatures, and padding). | |
97 | ||
98 | * The payload format and length is determined by the tag. | |
99 | ||
11fdf7f2 TL |
100 | * The signature portion is only present if the authentication phase |
101 | has completed (TAG_AUTH_DONE has been sent) and signatures are | |
102 | enabled. | |
7c673cae | 103 | |
11fdf7f2 TL |
104 | Hello |
105 | ----- | |
7c673cae | 106 | |
11fdf7f2 TL |
107 | * TAG_HELLO: client->server and server->client:: |
108 | ||
109 | __u8 entity_type | |
110 | entity_addr_t peer_socket_address | |
7c673cae | 111 | |
11fdf7f2 TL |
112 | - We immediately share our entity type and the address of the peer (which can be useful |
113 | for detecting our effective IP address, especially in the presence of NAT). | |
7c673cae | 114 | |
7c673cae | 115 | |
11fdf7f2 TL |
116 | Authentication |
117 | -------------- | |
7c673cae | 118 | |
11fdf7f2 | 119 | * TAG_AUTH_REQUEST: client->server:: |
7c673cae | 120 | |
11fdf7f2 TL |
121 | __le32 method; // CEPH_AUTH_{NONE, CEPHX, ...} |
122 | __le32 num_preferred_modes; | |
123 | list<__le32> mode // CEPH_CON_MODE_* | |
124 | method specific payload | |
7c673cae | 125 | |
11fdf7f2 | 126 | * TAG_AUTH_BAD_METHOD server -> client: reject client-selected auth method:: |
7c673cae FG |
127 | |
128 | __le32 method | |
11fdf7f2 TL |
129 | __le32 negative error result code |
130 | __le32 num_methods | |
131 | list<__le32> allowed_methods // CEPH_AUTH_{NONE, CEPHX, ...} | |
132 | __le32 num_modes | |
133 | list<__le32> allowed_modes // CEPH_CON_MODE_* | |
134 | ||
135 | - Returns the attempted auth method, and error code (-EOPNOTSUPP if | |
136 | the method is unsupported), and the list of allowed authentication | |
137 | methods. | |
7c673cae | 138 | |
11fdf7f2 | 139 | * TAG_AUTH_REPLY_MORE: server->client:: |
7c673cae FG |
140 | |
141 | __le32 len; | |
142 | method specific payload | |
143 | ||
11fdf7f2 TL |
144 | * TAG_AUTH_REQUEST_MORE: client->server:: |
145 | ||
146 | __le32 len; | |
147 | method specific payload | |
7c673cae | 148 | |
11fdf7f2 | 149 | * TAG_AUTH_DONE: (server->client):: |
7c673cae | 150 | |
11fdf7f2 TL |
151 | __le64 global_id |
152 | __le32 connection mode // CEPH_CON_MODE_* | |
153 | method specific payload | |
7c673cae | 154 | |
11fdf7f2 TL |
155 | - The server is the one to decide authentication has completed and what |
156 | the final connection mode will be. | |
157 | ||
158 | ||
159 | Example of authentication phase interaction when the client uses an | |
160 | allowed authentication method: | |
161 | ||
162 | .. ditaa:: +---------+ +--------+ | |
163 | | Client | | Server | | |
164 | +---------+ +--------+ | |
165 | | auth request | | |
166 | |---------------->| | |
167 | |<----------------| | |
168 | | auth more| | |
169 | | | | |
170 | |auth more | | |
171 | |---------------->| | |
172 | |<----------------| | |
173 | | auth done| | |
174 | ||
175 | ||
176 | Example of authentication phase interaction when the client uses a forbidden | |
177 | authentication method as the first attempt: | |
178 | ||
179 | .. ditaa:: +---------+ +--------+ | |
180 | | Client | | Server | | |
181 | +---------+ +--------+ | |
182 | | auth request | | |
183 | |---------------->| | |
184 | |<----------------| | |
185 | | bad method | | |
186 | | | | |
187 | | auth request | | |
188 | |---------------->| | |
189 | |<----------------| | |
190 | | auth more| | |
191 | | | | |
192 | | auth more | | |
193 | |---------------->| | |
194 | |<----------------| | |
195 | | auth done| | |
196 | ||
197 | ||
198 | Post-auth frame format | |
199 | ---------------------- | |
7c673cae FG |
200 | |
201 | The frame format is fixed (see above), but can take three different | |
202 | forms, depending on the AUTH_DONE flags: | |
203 | ||
204 | * If neither FLAG_SIGNED or FLAG_ENCRYPTED is specified, things are simple:: | |
205 | ||
7c673cae FG |
206 | frame_len |
207 | tag | |
208 | payload | |
209 | payload_padding (out to auth block_size) | |
210 | ||
11fdf7f2 TL |
211 | - The padding is some number of bytes < the auth block_size that |
212 | brings the total length of the payload + payload_padding to a | |
213 | multiple of block_size. It does not include the frame_len or tag. Padding | |
214 | content can be zeros or (better) random bytes. | |
215 | ||
7c673cae FG |
216 | * If FLAG_SIGNED has been specified:: |
217 | ||
7c673cae FG |
218 | frame_len |
219 | tag | |
220 | payload | |
221 | payload_padding (out to auth block_size) | |
222 | signature (sig_size bytes) | |
223 | ||
224 | Here the padding just makes life easier for the signature. It can be | |
225 | random data to add additional confounder. Note also that the | |
226 | signature input must include some state from the session key and the | |
227 | previous message. | |
228 | ||
229 | * If FLAG_ENCRYPTED has been specified:: | |
230 | ||
7c673cae | 231 | frame_len |
11fdf7f2 | 232 | tag |
7c673cae | 233 | { |
7c673cae FG |
234 | payload |
235 | payload_padding (out to auth block_size) | |
236 | } ^ stream cipher | |
237 | ||
238 | Note that the padding ensures that the total frame is a multiple of | |
239 | the auth method's block_size so that the message can be sent out over | |
240 | the wire without waiting for the next frame in the stream. | |
241 | ||
11fdf7f2 | 242 | |
7c673cae FG |
243 | Message flow handshake |
244 | ---------------------- | |
245 | ||
246 | In this phase the peers identify each other and (if desired) reconnect to | |
247 | an established session. | |
248 | ||
11fdf7f2 | 249 | * TAG_CLIENT_IDENT (client->server): identify ourselves:: |
7c673cae | 250 | |
11fdf7f2 TL |
251 | __le32 num_addrs |
252 | entity_addrvec_t*num_addrs entity addrs | |
253 | entity_addr_t target entity addr | |
254 | __le64 gid (numeric part of osd.0, client.123456, ...) | |
255 | __le64 global_seq | |
7c673cae FG |
256 | __le64 features supported (CEPH_FEATURE_* bitmask) |
257 | __le64 features required (CEPH_FEATURE_* bitmask) | |
258 | __le64 flags (CEPH_MSG_CONNECT_* bitmask) | |
11fdf7f2 | 259 | __le64 cookie |
7c673cae | 260 | |
11fdf7f2 TL |
261 | - client will send first, server will reply with same. if this is a |
262 | new session, the client and server can proceed to the message exchange. | |
263 | - the target addr is who the client is trying to connect *to*, so | |
264 | that the server side can close the connection if the client is | |
265 | talking to the wrong daemon. | |
266 | - type.gid (entity_name_t) is set here, by combinging the type shared in the hello | |
267 | frame with the gid here. this means we don't need it | |
268 | in the header of every message. it also means that we can't send | |
269 | messages "from" other entity_name_t's. the current | |
270 | implementations set this at the top of _send_message etc so this | |
271 | shouldn't break any existing functionality. implementation will | |
272 | likely want to mask this against what the authenticated credential | |
273 | allows. | |
274 | - cookie is the client coookie used to identify a session, and can be used | |
275 | to reconnect to an existing session. | |
276 | - we've dropped the 'protocol_version' field from msgr1 | |
277 | ||
278 | * TAG_IDENT_MISSING_FEATURES (server->client): complain about a TAG_IDENT | |
279 | with too few features:: | |
280 | ||
281 | __le64 features we require that the peer didn't advertise | |
282 | ||
283 | * TAG_SERVER_IDENT (server->client): accept client ident and identify server:: | |
284 | ||
285 | __le32 num_addrs | |
286 | entity_addrvec_t*num_addrs entity addrs | |
287 | __le64 gid (numeric part of osd.0, client.123456, ...) | |
288 | __le64 global_seq | |
289 | __le64 features supported (CEPH_FEATURE_* bitmask) | |
290 | __le64 features required (CEPH_FEATURE_* bitmask) | |
291 | __le64 flags (CEPH_MSG_CONNECT_* bitmask) | |
292 | __le64 cookie | |
7c673cae | 293 | |
11fdf7f2 TL |
294 | - The server cookie can be used by the client if it is later disconnected |
295 | and wants to reconnect and resume the session. | |
7c673cae | 296 | |
11fdf7f2 | 297 | * TAG_RECONNECT (client->server): reconnect to an established session:: |
7c673cae | 298 | |
11fdf7f2 TL |
299 | __le32 num_addrs |
300 | entity_addr_t * num_addrs | |
301 | __le64 client_cookie | |
302 | __le64 server_cookie | |
7c673cae FG |
303 | __le64 global_seq |
304 | __le64 connect_seq | |
305 | __le64 msg_seq (the last msg seq received) | |
306 | ||
11fdf7f2 | 307 | * TAG_RECONNECT_OK (server->client): acknowledge a reconnect attempt:: |
7c673cae FG |
308 | |
309 | __le64 msg_seq (last msg seq received) | |
310 | ||
11fdf7f2 TL |
311 | - once the client receives this, the client can proceed to message exchange. |
312 | - once the server sends this, the server can proceed to message exchange. | |
313 | ||
7c673cae FG |
314 | * TAG_RECONNECT_RETRY_SESSION (server only): fail reconnect due to stale connect_seq |
315 | ||
316 | * TAG_RECONNECT_RETRY_GLOBAL (server only): fail reconnect due to stale global_seq | |
317 | ||
318 | * TAG_RECONNECT_WAIT (server only): fail reconnect due to connect race. | |
319 | ||
320 | - Indicates that the server is already connecting to the client, and | |
321 | that direction should win the race. The client should wait for that | |
322 | connection to complete. | |
323 | ||
11fdf7f2 TL |
324 | * TAG_RESET_SESSION (server only): ask client to reset session:: |
325 | ||
326 | __u8 full | |
327 | ||
328 | - full flag indicates whether peer should do a full reset, i.e., drop | |
329 | message queue. | |
330 | ||
331 | ||
332 | Example of failure scenarios: | |
333 | ||
334 | * First client's client_ident message is lost, and then client reconnects. | |
335 | ||
336 | .. ditaa:: +---------+ +--------+ | |
337 | | Client | | Server | | |
338 | +---------+ +--------+ | |
339 | | | | |
340 | c_cookie(a) | client_ident(a) | | |
341 | |-------------X | | |
342 | | | | |
343 | | client_ident(a) | | |
344 | |-------------------->| | |
345 | |<--------------------| | |
346 | | server_ident(b) | s_cookie(b) | |
347 | | | | |
348 | | session established | | |
349 | | | | |
350 | ||
351 | ||
352 | * Server's server_ident message is lost, and then client reconnects. | |
353 | ||
354 | .. ditaa:: +---------+ +--------+ | |
355 | | Client | | Server | | |
356 | +---------+ +--------+ | |
357 | | | | |
358 | c_cookie(a) | client_ident(a) | | |
359 | |-------------------->| | |
360 | | X------------| | |
361 | | server_ident(b) | s_cookie(b) | |
362 | | | | |
363 | | | | |
364 | | client_ident(a) | | |
365 | |-------------------->| | |
366 | |<--------------------| | |
367 | | server_ident(c) | s_cookie(c) | |
368 | | | | |
369 | | session established | | |
370 | | | | |
371 | ||
372 | ||
373 | * Server's server_ident message is lost, and then server reconnects. | |
374 | ||
375 | .. ditaa:: +---------+ +--------+ | |
376 | | Client | | Server | | |
377 | +---------+ +--------+ | |
378 | | | | |
379 | c_cookie(a) | client_ident(a) | | |
380 | |-------------------->| | |
381 | | X------------| | |
382 | | server_ident(b) | s_cookie(b) | |
383 | | | | |
384 | | | | |
385 | | reconnect(a, b) | | |
386 | |<--------------------| | |
387 | |-------------------->| | |
388 | | reset_session(F) | | |
389 | | | | |
390 | | client_ident(a) | c_cookie(a) | |
391 | |<--------------------| | |
392 | |-------------------->| | |
393 | s_cookie(c) | server_ident(c) | | |
394 | | | | |
395 | ||
396 | ||
397 | * Connection failure after session is established, and then client reconnects. | |
398 | ||
399 | .. ditaa:: +---------+ +--------+ | |
400 | | Client | | Server | | |
401 | +---------+ +--------+ | |
402 | | | | |
403 | c_cookie(a) | session established | s_cookie(b) | |
404 | |<------------------->| | |
405 | | X------------| | |
406 | | | | |
407 | | reconnect(a, b) | | |
408 | |-------------------->| | |
409 | |<--------------------| | |
410 | | reconnect_ok | | |
411 | | | | |
412 | ||
413 | ||
414 | * Connection failure after session is established because server reseted, | |
415 | and then client reconnects. | |
416 | ||
417 | .. ditaa:: +---------+ +--------+ | |
418 | | Client | | Server | | |
419 | +---------+ +--------+ | |
420 | | | | |
421 | c_cookie(a) | session established | s_cookie(b) | |
422 | |<------------------->| | |
423 | | X------------| reset | |
424 | | | | |
425 | | reconnect(a, b) | | |
426 | |-------------------->| | |
427 | |<--------------------| | |
428 | | reset_session(RC*) | | |
429 | | | | |
430 | c_cookie(c) | client_ident(c) | | |
431 | |-------------------->| | |
432 | |<--------------------| | |
433 | | server_ident(d) | s_cookie(d) | |
434 | | | | |
435 | ||
436 | RC* means that the reset session full flag depends on the policy.resetcheck | |
437 | of the connection. | |
438 | ||
439 | ||
440 | * Connection failure after session is established because client reseted, | |
441 | and then client reconnects. | |
442 | ||
443 | .. ditaa:: +---------+ +--------+ | |
444 | | Client | | Server | | |
445 | +---------+ +--------+ | |
446 | | | | |
447 | c_cookie(a) | session established | s_cookie(b) | |
448 | |<------------------->| | |
449 | reset | X------------| | |
450 | | | | |
451 | c_cookie(c) | client_ident(c) | | |
452 | |-------------------->| | |
453 | |<--------------------| reset if policy.resetcheck | |
454 | | server_ident(d) | s_cookie(d) | |
455 | | | | |
456 | ||
457 | ||
7c673cae FG |
458 | Message exchange |
459 | ---------------- | |
460 | ||
11fdf7f2 | 461 | Once a session is established, we can exchange messages. |
7c673cae FG |
462 | |
463 | * TAG_MSG: a message:: | |
464 | ||
465 | ceph_msg_header2 | |
466 | front | |
467 | middle | |
11fdf7f2 | 468 | data_pre_padding |
7c673cae FG |
469 | data |
470 | ||
11fdf7f2 TL |
471 | - The ceph_msg_header2 is modified from ceph_msg_header: |
472 | * include an ack_seq. This avoids the need for a TAG_ACK | |
473 | message most of the time. | |
474 | * remove the src field, which we now get from the message flow | |
475 | handshake (TAG_IDENT). | |
476 | * specifies the data_pre_padding length, which can be used to | |
477 | adjust the alignment of the data payload. (NOTE: is this is | |
478 | useful?) | |
7c673cae FG |
479 | |
480 | * TAG_ACK: acknowledge receipt of message(s):: | |
481 | ||
482 | __le64 seq | |
483 | ||
484 | - This is only used for stateful sessions. | |
485 | ||
486 | * TAG_KEEPALIVE2: check for connection liveness:: | |
487 | ||
488 | ceph_timespec stamp | |
489 | ||
490 | - Time stamp is local to sender. | |
491 | ||
492 | * TAG_KEEPALIVE2_ACK: reply to a keepalive2:: | |
493 | ||
494 | ceph_timestamp stamp | |
495 | ||
496 | - Time stamp is from the TAG_KEEPALIVE2 we are responding to. | |
497 | ||
11fdf7f2 TL |
498 | * TAG_CLOSE: terminate a connection |
499 | ||
500 | Indicates that a connection should be terminated. This is equivalent | |
501 | to a hangup or reset (i.e., should trigger ms_handle_reset). It | |
502 | isn't strictly necessary or useful as we could just disconnect the | |
503 | TCP connection. | |
504 | ||
505 | ||
506 | Example of protocol interaction (WIP) | |
507 | _____________________________________ | |
508 | ||
509 | ||
510 | .. ditaa:: +---------+ +--------+ | |
511 | | Client | | Server | | |
512 | +---------+ +--------+ | |
513 | | send banner | | |
514 | |----+ +------| | |
515 | | | | | | |
516 | | +-------+----->| | |
517 | | send banner| | | |
518 | |<-----------+ | | |
519 | | | | |
520 | | send new stream | | |
521 | |------------------>| | |
522 | | auth request | | |
523 | |------------------>| | |
524 | |<------------------| | |
525 | | bad method | | |
526 | | | | |
527 | | auth request | | |
528 | |------------------>| | |
529 | |<------------------| | |
530 | | auth more | | |
531 | | | | |
532 | | auth more | | |
533 | |------------------>| | |
534 | |<------------------| | |
535 | | auth done | | |
536 | | | | |
537 | ||
7c673cae | 538 |