3 msgr2 protocol (msgr2.0 and msgr2.1)
4 ====================================
6 This is a revision of the legacy Ceph on-wire protocol that was
7 implemented by the SimpleMessenger. It addresses performance and
13 This protocol revision has several goals relative to the original protocol:
15 * *Flexible handshaking*. The original protocol did not have a
16 sufficiently flexible protocol negotiation that allows for features
17 that were not required.
18 * *Encryption*. We will incorporate encryption over the wire.
19 * *Performance*. We would like to provide for protocol features
20 (e.g., padding) that keep computation and memory copies out of the
21 fast path where possible.
22 * *Signing*. We will allow for traffic to be signed (but not
23 necessarily encrypted). This is not implemented.
28 * *client* (C): the party initiating a (TCP) connection
29 * *server* (S): the party accepting a (TCP) connection
30 * *connection*: an instance of a (TCP) connection between two processes.
31 * *entity*: a ceph entity instantiation, e.g. 'osd.0'. each entity
32 has one or more unique entity_addr_t's by virtue of the 'nonce'
33 field, which is typically a pid or random value.
34 * *session*: a stateful session between two entities in which message
35 exchange is ordered and lossless. A session might span multiple
36 connections if there is an interruption (TCP connection disconnect).
37 * *frame*: a discrete message sent between the peers. Each frame
38 consists of a tag (type code), payload, and (if signing
39 or encryption is enabled) some other fields. See below for the
41 * *tag*: a type code associated with a frame. The tag
42 determines the structure of the payload.
47 A connection has four distinct phases:
50 #. authentication frame exchange
51 #. message flow handshake frame exchange
52 #. message frame exchange
57 Both the client and server, upon connecting, send a banner::
60 __le16 banner payload length
63 A banner payload has the form::
65 __le64 peer_supported_features
66 __le64 peer_required_features
68 This is a new, distinct feature bit namespace (CEPH_MSGR2_*).
69 Currently, only CEPH_MSGR2_FEATURE_REVISION_1 is defined. It is
70 supported but not required, so that msgr2.0 and msgr2.1 peers
71 can talk to each other.
73 If the remote party advertises required features we don't support, we
77 .. ditaa:: +---------+ +--------+
79 +---------+ +--------+
91 After the banners are exchanged, all further communication happens
92 in frames. The exact format of the frame depends on the connection
93 mode (msgr2.0-crc, msgr2.0-secure, msgr2.1-crc or msgr2.1-secure).
94 All connections start in crc mode (either msgr2.0-crc or msgr2.1-crc,
95 depending on peer_supported_features from the banner).
97 Each frame has a 32-byte preamble::
100 __u8 number of segments
102 __le32 segment length
103 __le16 segment alignment
108 An empty frame has one empty segment. A non-empty frame can have
109 between one and four segments, all segments except the last may be
112 If there are less than four segments, unused (trailing) segment
113 length and segment alignment fields are zeroed.
115 The reserved bytes are zeroed.
117 The preamble checksum is CRC32-C. It covers everything up to
118 itself (28 bytes) and is calculated and verified irrespective of
119 the connection mode (i.e. even if the frame is encrypted).
123 A msgr2.0-crc frame has the form::
128 } * number of segments
138 late_flags is used for frame abortion. After transmitting the
139 preamble and the first segment, the sender can fill the remaining
140 segments with zeros and set a flag to indicate that the receiver must
141 drop the frame. This allows the sender to avoid extra buffering
142 when a frame that is being put on the wire is revoked (i.e. yanked
143 out of the messenger): payload buffers can be unpinned and handed
144 back to the user immediately, without making a copy or blocking
145 until the whole frame is transmitted. Currently this is used only
146 by the kernel client, see ceph_msg_revoke().
148 The segment checksum is CRC32-C. For "used" empty segments, it is
149 set to (__le32)-1. For unused (trailing) segments, it is zeroed.
151 The crcs are calculated just to protect against bit errors.
152 No authenticity guarantees are provided, unlike in msgr1 which
153 attempted to provide some authenticity guarantee by optionally
154 signing segment lengths and crcs with the session key.
158 1. As part of introducing a structure for a generic frame with
159 variable number of segments suitable for both control and
160 message frames, msgr2.0 moved the crc of the first segment of
161 the message frame (ceph_msg_header2) into the epilogue.
163 As a result, ceph_msg_header2 can no longer be safely
164 interpreted before the whole frame is read off the wire.
165 This is a regression from msgr1, because in order to scatter
166 the payload directly into user-provided buffers and thus avoid
167 extra buffering and copying when receiving message frames,
168 ceph_msg_header2 must be available in advance -- it stores
169 the transaction id which the user buffers are keyed on.
170 The implementation has to choose between forgoing this
171 optimization or acting on an unverified segment.
173 2. late_flags is not covered by any crc. Since it stores the
174 abort flag, a single bit flip can result in a completed frame
175 being dropped (causing the sender to hang waiting for a reply)
176 or, worse, in an aborted frame with garbage segment payloads
179 This was the case with msgr1 and got carried over to msgr2.0.
183 Differences from msgr2.0-crc:
185 1. The crc of the first segment is stored at the end of the
186 first segment, not in the epilogue. The epilogue stores up to
187 three crcs, not up to four.
189 If the first segment is empty, (__le32)-1 crc is not generated.
191 2. The epilogue is generated only if the frame has more than one
192 segment (i.e. at least one of second to fourth segments is not
193 empty). Rationale: If the frame has only one segment, it cannot
194 be aborted and there are no crcs to store in the epilogue.
196 3. Unchecksummed late_flags is replaced with late_status which
197 builds in bit error detection by using a 4-bit nibble per flag
198 and two code words that are Hamming Distance = 4 apart (and not
199 all zeros or ones). This comes at the expense of having only
200 one reserved flag, of course.
204 * A 0+0+0+0 frame (empty, no epilogue)::
208 * A 20+0+0+0 frame (no epilogue)::
211 segment1 payload (20 bytes)
217 segment2 payload (70 bytes)
220 * A 20+70+0+350 frame::
223 segment1 payload (20 bytes)
225 segment2 payload (70 bytes)
226 segment4 payload (350 bytes)
239 * TAG_HELLO: client->server and server->client::
242 entity_addr_t peer_socket_address
244 - We immediately share our entity type and the address of the peer (which can be useful
245 for detecting our effective IP address, especially in the presence of NAT).
251 * TAG_AUTH_REQUEST: client->server::
253 __le32 method; // CEPH_AUTH_{NONE, CEPHX, ...}
254 __le32 num_preferred_modes;
255 list<__le32> mode // CEPH_CON_MODE_*
256 method specific payload
258 * TAG_AUTH_BAD_METHOD server -> client: reject client-selected auth method::
261 __le32 negative error result code
263 list<__le32> allowed_methods // CEPH_AUTH_{NONE, CEPHX, ...}
265 list<__le32> allowed_modes // CEPH_CON_MODE_*
267 - Returns the attempted auth method, and error code (-EOPNOTSUPP if
268 the method is unsupported), and the list of allowed authentication
271 * TAG_AUTH_REPLY_MORE: server->client::
274 method specific payload
276 * TAG_AUTH_REQUEST_MORE: client->server::
279 method specific payload
281 * TAG_AUTH_DONE: (server->client)::
284 __le32 connection mode // CEPH_CON_MODE_*
285 method specific payload
287 - The server is the one to decide authentication has completed and what
288 the final connection mode will be.
291 Example of authentication phase interaction when the client uses an
292 allowed authentication method:
294 .. ditaa:: +---------+ +--------+
295 | Client | | Server |
296 +---------+ +--------+
308 Example of authentication phase interaction when the client uses a forbidden
309 authentication method as the first attempt:
311 .. ditaa:: +---------+ +--------+
312 | Client | | Server |
313 +---------+ +--------+
330 Post-auth frame format
331 ----------------------
333 Depending on the negotiated connection mode from TAG_AUTH_DONE, the
334 connection either stays in crc mode or switches to the corresponding
335 secure mode (msgr2.0-secure or msgr2.1-secure).
337 ### msgr2.0-secure mode
339 A msgr2.0-secure frame has the form::
345 zero padding (out to 16 bytes)
346 } * number of segments
348 } ^ AES-128-GCM cipher
354 zero padding (15 bytes)
356 late_flags has the same meaning as in msgr2.0-crc mode.
358 Each segment and the epilogue are zero padded out to 16 bytes.
359 Technically, GCM doesn't require any padding because Counter mode
360 (the C in GCM) essentially turns a block cipher into a stream cipher.
361 But, if the overall input length is not a multiple of 16 bytes, some
362 implicit zero padding would occur internally because GHASH function
363 used by GCM for generating auth tags only works on 16-byte blocks.
367 1. The sender encrypts the whole frame using a single nonce
368 and generating a single auth tag. Because segment lengths are
369 stored in the preamble, the receiver has no choice but to decrypt
370 and interpret the preamble without verifying the auth tag -- it
371 can't even tell how much to read off the wire to get the auth tag
372 otherwise! This creates a decryption oracle, which, in conjunction
373 with Counter mode malleability, could lead to recovery of sensitive
376 This issue extends to the first segment of the message frame as
377 well. As in msgr2.0-crc mode, ceph_msg_header2 cannot be safely
378 interpreted before the whole frame is read off the wire.
380 2. Deterministic nonce construction with a 4-byte counter field
381 followed by an 8-byte fixed field is used. The initial values are
382 taken from the connection secret -- a random byte string generated
383 during the authentication phase. Because the counter field is
384 only four bytes long, it can wrap and then repeat in under a day,
385 leading to GCM nonce reuse and therefore a potential complete
386 loss of both authenticity and confidentiality for the connection.
387 This was addressed by disconnecting before the counter repeats
390 ### msgr2.1-secure mode
392 Differences from msgr2.0-secure:
394 1. The preamble, the first segment and the rest of the frame are
395 encrypted separately, using separate nonces and generating
396 separate auth tags. This gets rid of unverified plaintext use
397 and keeps msgr2.1-secure mode close to msgr2.1-crc mode, allowing
398 the implementation to receive message frames in a similar fashion
399 (little to no buffering, same scatter/gather logic, etc).
401 In order to reduce the number of en/decryption operations per
402 frame, the preamble is grown by a fixed size inline buffer (48
403 bytes) that the first segment is inlined into, either fully or
404 partially. The preamble auth tag covers both the preamble and the
405 inline buffer, so if the first segment is small enough to be fully
406 inlined, it becomes available after a single decryption operation.
408 2. As in msgr2.1-crc mode, the epilogue is generated only if the
409 frame has more than one segment. The rationale is even stronger,
410 as it would require an extra en/decryption operation.
412 3. For consistency with msgr2.1-crc mode, late_flags is replaced
413 with late_status (the built-in bit error detection isn't really
414 needed in secure mode).
416 4. In accordance with `NIST Recommendation for GCM`_, deterministic
417 nonce construction with a 4-byte fixed field followed by an 8-byte
418 counter field is used. An 8-byte counter field should never repeat
419 but the nonce reuse protection put in place for msgr2.0-secure mode
422 The initial values are the same as in msgr2.0-secure mode.
424 .. _`NIST Recommendation for GCM`: https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38d.pdf
426 As in msgr2.0-secure mode, each segment is zero padded out to
427 16 bytes. If the first segment is fully inlined, its padding goes
428 to the inline buffer. Otherwise, the padding is on the remainder.
429 The corollary to this is that the inline buffer is consumed in
432 The unused portion of the inline buffer is zeroed.
436 * A 0+0+0+0 frame (empty, nothing to inline, no epilogue)::
440 zero padding (48 bytes)
441 } ^ AES-128-GCM cipher
444 * A 20+0+0+0 frame (first segment fully inlined, no epilogue)::
448 segment1 payload (20 bytes)
449 zero padding (28 bytes)
450 } ^ AES-128-GCM cipher
453 * A 0+70+0+0 frame (nothing to inline)::
457 zero padding (48 bytes)
458 } ^ AES-128-GCM cipher
461 segment2 payload (70 bytes)
462 zero padding (10 bytes)
464 } ^ AES-128-GCM cipher
467 * A 20+70+0+350 frame (first segment fully inlined)::
471 segment1 payload (20 bytes)
472 zero padding (28 bytes)
473 } ^ AES-128-GCM cipher
476 segment2 payload (70 bytes)
477 zero padding (10 bytes)
478 segment4 payload (350 bytes)
479 zero padding (2 bytes)
481 } ^ AES-128-GCM cipher
484 * A 105+0+0+0 frame (first segment partially inlined, no epilogue)::
488 segment1 payload (48 bytes)
489 } ^ AES-128-GCM cipher
492 segment1 payload remainder (57 bytes)
493 zero padding (7 bytes)
494 } ^ AES-128-GCM cipher
497 * A 105+70+0+350 frame (first segment partially inlined)::
501 segment1 payload (48 bytes)
502 } ^ AES-128-GCM cipher
505 segment1 payload remainder (57 bytes)
506 zero padding (7 bytes)
507 } ^ AES-128-GCM cipher
510 segment2 payload (70 bytes)
511 zero padding (10 bytes)
512 segment4 payload (350 bytes)
513 zero padding (2 bytes)
515 } ^ AES-128-GCM cipher
521 zero padding (15 bytes)
523 late_status has the same meaning as in msgr2.1-crc mode.
525 Message flow handshake
526 ----------------------
528 In this phase the peers identify each other and (if desired) reconnect to
529 an established session.
531 * TAG_CLIENT_IDENT (client->server): identify ourselves::
534 entity_addrvec_t*num_addrs entity addrs
535 entity_addr_t target entity addr
536 __le64 gid (numeric part of osd.0, client.123456, ...)
538 __le64 features supported (CEPH_FEATURE_* bitmask)
539 __le64 features required (CEPH_FEATURE_* bitmask)
540 __le64 flags (CEPH_MSG_CONNECT_* bitmask)
543 - client will send first, server will reply with same. if this is a
544 new session, the client and server can proceed to the message exchange.
545 - the target addr is who the client is trying to connect *to*, so
546 that the server side can close the connection if the client is
547 talking to the wrong daemon.
548 - type.gid (entity_name_t) is set here, by combinging the type shared in the hello
549 frame with the gid here. this means we don't need it
550 in the header of every message. it also means that we can't send
551 messages "from" other entity_name_t's. the current
552 implementations set this at the top of _send_message etc so this
553 shouldn't break any existing functionality. implementation will
554 likely want to mask this against what the authenticated credential
556 - cookie is the client coookie used to identify a session, and can be used
557 to reconnect to an existing session.
558 - we've dropped the 'protocol_version' field from msgr1
560 * TAG_IDENT_MISSING_FEATURES (server->client): complain about a TAG_IDENT
561 with too few features::
563 __le64 features we require that the peer didn't advertise
565 * TAG_SERVER_IDENT (server->client): accept client ident and identify server::
568 entity_addrvec_t*num_addrs entity addrs
569 __le64 gid (numeric part of osd.0, client.123456, ...)
571 __le64 features supported (CEPH_FEATURE_* bitmask)
572 __le64 features required (CEPH_FEATURE_* bitmask)
573 __le64 flags (CEPH_MSG_CONNECT_* bitmask)
576 - The server cookie can be used by the client if it is later disconnected
577 and wants to reconnect and resume the session.
579 * TAG_RECONNECT (client->server): reconnect to an established session::
582 entity_addr_t * num_addrs
587 __le64 msg_seq (the last msg seq received)
589 * TAG_RECONNECT_OK (server->client): acknowledge a reconnect attempt::
591 __le64 msg_seq (last msg seq received)
593 - once the client receives this, the client can proceed to message exchange.
594 - once the server sends this, the server can proceed to message exchange.
596 * TAG_RECONNECT_RETRY_SESSION (server only): fail reconnect due to stale connect_seq
598 * TAG_RECONNECT_RETRY_GLOBAL (server only): fail reconnect due to stale global_seq
600 * TAG_RECONNECT_WAIT (server only): fail reconnect due to connect race.
602 - Indicates that the server is already connecting to the client, and
603 that direction should win the race. The client should wait for that
604 connection to complete.
606 * TAG_RESET_SESSION (server only): ask client to reset session::
610 - full flag indicates whether peer should do a full reset, i.e., drop
614 Example of failure scenarios:
616 * First client's client_ident message is lost, and then client reconnects.
618 .. ditaa:: +---------+ +--------+
619 | Client | | Server |
620 +---------+ +--------+
622 c_cookie(a) | client_ident(a) |
626 |-------------------->|
627 |<--------------------|
628 | server_ident(b) | s_cookie(b)
630 | session established |
634 * Server's server_ident message is lost, and then client reconnects.
636 .. ditaa:: +---------+ +--------+
637 | Client | | Server |
638 +---------+ +--------+
640 c_cookie(a) | client_ident(a) |
641 |-------------------->|
643 | server_ident(b) | s_cookie(b)
647 |-------------------->|
648 |<--------------------|
649 | server_ident(c) | s_cookie(c)
651 | session established |
655 * Server's server_ident message is lost, and then server reconnects.
657 .. ditaa:: +---------+ +--------+
658 | Client | | Server |
659 +---------+ +--------+
661 c_cookie(a) | client_ident(a) |
662 |-------------------->|
664 | server_ident(b) | s_cookie(b)
668 |<--------------------|
669 |-------------------->|
672 | client_ident(a) | c_cookie(a)
673 |<--------------------|
674 |-------------------->|
675 s_cookie(c) | server_ident(c) |
679 * Connection failure after session is established, and then client reconnects.
681 .. ditaa:: +---------+ +--------+
682 | Client | | Server |
683 +---------+ +--------+
685 c_cookie(a) | session established | s_cookie(b)
686 |<------------------->|
690 |-------------------->|
691 |<--------------------|
696 * Connection failure after session is established because server reset,
697 and then client reconnects.
699 .. ditaa:: +---------+ +--------+
700 | Client | | Server |
701 +---------+ +--------+
703 c_cookie(a) | session established | s_cookie(b)
704 |<------------------->|
705 | X------------| reset
708 |-------------------->|
709 |<--------------------|
710 | reset_session(RC*) |
712 c_cookie(c) | client_ident(c) |
713 |-------------------->|
714 |<--------------------|
715 | server_ident(d) | s_cookie(d)
718 RC* means that the reset session full flag depends on the policy.resetcheck
722 * Connection failure after session is established because client reset,
723 and then client reconnects.
725 .. ditaa:: +---------+ +--------+
726 | Client | | Server |
727 +---------+ +--------+
729 c_cookie(a) | session established | s_cookie(b)
730 |<------------------->|
731 reset | X------------|
733 c_cookie(c) | client_ident(c) |
734 |-------------------->|
735 |<--------------------| reset if policy.resetcheck
736 | server_ident(d) | s_cookie(d)
743 Once a session is established, we can exchange messages.
745 * TAG_MSG: a message::
753 - The ceph_msg_header2 is modified from ceph_msg_header:
754 * include an ack_seq. This avoids the need for a TAG_ACK
755 message most of the time.
756 * remove the src field, which we now get from the message flow
757 handshake (TAG_IDENT).
758 * specifies the data_pre_padding length, which can be used to
759 adjust the alignment of the data payload. (NOTE: is this is
762 * TAG_ACK: acknowledge receipt of message(s)::
766 - This is only used for stateful sessions.
768 * TAG_KEEPALIVE2: check for connection liveness::
772 - Time stamp is local to sender.
774 * TAG_KEEPALIVE2_ACK: reply to a keepalive2::
778 - Time stamp is from the TAG_KEEPALIVE2 we are responding to.
780 * TAG_CLOSE: terminate a connection
782 Indicates that a connection should be terminated. This is equivalent
783 to a hangup or reset (i.e., should trigger ms_handle_reset). It
784 isn't strictly necessary or useful as we could just disconnect the
788 Example of protocol interaction (WIP)
789 _____________________________________
792 .. ditaa:: +---------+ +--------+
793 | Client | | Server |
794 +---------+ +--------+
803 |------------------>|
805 |------------------>|
806 |<------------------|
810 |------------------>|
811 |<------------------|
815 |------------------>|
816 |<------------------|