.. _msgr2-protocol:
-msgr2 protocol
-==============
+msgr2 protocol (msgr2.0 and msgr2.1)
+====================================
This is a revision of the legacy Ceph on-wire protocol that was
implemented by the SimpleMessenger. It addresses performance and
(e.g., padding) that keep computation and memory copies out of the
fast path where possible.
* *Signing*. We will allow for traffic to be signed (but not
- necessarily encrypted). This may not be implemented in the initial version.
+ necessarily encrypted). This is not implemented.
Definitions
-----------
Both the client and server, upon connecting, send a banner::
- "ceph %x %x\n", protocol_features_suppored, protocol_features_required
+ "ceph v2\n"
+ __le16 banner payload length
+ banner payload
-The protocol features are a new, distinct namespace. Initially no
-features are defined or required, so this will be "ceph 0 0\n".
+A banner payload has the form::
+
+ __le64 peer_supported_features
+ __le64 peer_required_features
+
+This is a new, distinct feature bit namespace (CEPH_MSGR2_*).
+Currently, only CEPH_MSGR2_FEATURE_REVISION_1 is defined. It is
+supported but not required, so that msgr2.0 and msgr2.1 peers
+can talk to each other.
If the remote party advertises required features we don't support, we
can disconnect.
Frame format
------------
-All further data sent or received is contained by a frame. Each frame has
-the form::
+After the banners are exchanged, all further communication happens
+in frames. The exact format of the frame depends on the connection
+mode (msgr2.0-crc, msgr2.0-secure, msgr2.1-crc or msgr2.1-secure).
+All connections start in crc mode (either msgr2.0-crc or msgr2.1-crc,
+depending on peer_supported_features from the banner).
+
+Each frame has a 32-byte preamble::
+
+ __u8 tag
+ __u8 number of segments
+ {
+ __le32 segment length
+ __le16 segment alignment
+ } * 4
+ reserved (2 bytes)
+ __le32 preamble crc
+
+An empty frame has one empty segment. A non-empty frame can have
+between one and four segments, all segments except the last may be
+empty.
+
+If there are less than four segments, unused (trailing) segment
+length and segment alignment fields are zeroed.
+
+The reserved bytes are zeroed.
+
+The preamble checksum is CRC32-C. It covers everything up to
+itself (28 bytes) and is calculated and verified irrespective of
+the connection mode (i.e. even if the frame is encrypted).
+
+### msgr2.0-crc mode
+
+A msgr2.0-crc frame has the form::
+
+ preamble (32 bytes)
+ {
+ segment payload
+ } * number of segments
+ epilogue (17 bytes)
+
+where epilogue is::
+
+ __u8 late_flags
+ {
+ __le32 segment crc
+ } * 4
+
+late_flags is used for frame abortion. After transmitting the
+preamble and the first segment, the sender can fill the remaining
+segments with zeros and set a flag to indicate that the receiver must
+drop the frame. This allows the sender to avoid extra buffering
+when a frame that is being put on the wire is revoked (i.e. yanked
+out of the messenger): payload buffers can be unpinned and handed
+back to the user immediately, without making a copy or blocking
+until the whole frame is transmitted. Currently this is used only
+by the kernel client, see ceph_msg_revoke().
+
+The segment checksum is CRC32-C. For "used" empty segments, it is
+set to (__le32)-1. For unused (trailing) segments, it is zeroed.
+
+The crcs are calculated just to protect against bit errors.
+No authenticity guarantees are provided, unlike in msgr1 which
+attempted to provide some authenticity guarantee by optionally
+signing segment lengths and crcs with the session key.
+
+Issues:
+
+1. As part of introducing a structure for a generic frame with
+ variable number of segments suitable for both control and
+ message frames, msgr2.0 moved the crc of the first segment of
+ the message frame (ceph_msg_header2) into the epilogue.
+
+ As a result, ceph_msg_header2 can no longer be safely
+ interpreted before the whole frame is read off the wire.
+ This is a regression from msgr1, because in order to scatter
+ the payload directly into user-provided buffers and thus avoid
+ extra buffering and copying when receiving message frames,
+ ceph_msg_header2 must be available in advance -- it stores
+ the transaction id which the user buffers are keyed on.
+ The implementation has to choose between forgoing this
+ optimization or acting on an unverified segment.
+
+2. late_flags is not covered by any crc. Since it stores the
+ abort flag, a single bit flip can result in a completed frame
+ being dropped (causing the sender to hang waiting for a reply)
+ or, worse, in an aborted frame with garbage segment payloads
+ being dispatched.
+
+ This was the case with msgr1 and got carried over to msgr2.0.
+
+### msgr2.1-crc mode
+
+Differences from msgr2.0-crc:
+
+1. The crc of the first segment is stored at the end of the
+ first segment, not in the epilogue. The epilogue stores up to
+ three crcs, not up to four.
+
+ If the first segment is empty, (__le32)-1 crc is not generated.
+
+2. The epilogue is generated only if the frame has more than one
+ segment (i.e. at least one of second to fourth segments is not
+ empty). Rationale: If the frame has only one segment, it cannot
+ be aborted and there are no crcs to store in the epilogue.
- frame_len (le32)
- tag (TAG_* le32)
- frame_header_checksum (le32)
- payload
- [payload padding -- only present after stream auth phase]
- [signature -- only present after stream auth phase]
+3. Unchecksummed late_flags is replaced with late_status which
+ builds in bit error detection by using a 4-bit nibble per flag
+ and two code words that are Hamming Distance = 4 apart (and not
+ all zeros or ones). This comes at the expense of having only
+ one reserved flag, of course.
+Some example frames:
-* The frame_header_checksum is over just the frame_len and tag values (8 bytes).
+* A 0+0+0+0 frame (empty, no epilogue)::
-* frame_len includes everything after the frame_len le32 up to the end of the
- frame (all payloads, signatures, and padding).
+ preamble (32 bytes)
-* The payload format and length is determined by the tag.
+* A 20+0+0+0 frame (no epilogue)::
-* The signature portion is only present if the authentication phase
- has completed (TAG_AUTH_DONE has been sent) and signatures are
- enabled.
+ preamble (32 bytes)
+ segment1 payload (20 bytes)
+ __le32 segment1 crc
+
+* A 0+70+0+0 frame::
+
+ preamble (32 bytes)
+ segment2 payload (70 bytes)
+ epilogue (13 bytes)
+
+* A 20+70+0+350 frame::
+
+ preamble (32 bytes)
+ segment1 payload (20 bytes)
+ __le32 segment1 crc
+ segment2 payload (70 bytes)
+ segment4 payload (350 bytes)
+ epilogue (13 bytes)
+
+where epilogue is::
+
+ __u8 late_status
+ {
+ __le32 segment crc
+ } * 3
Hello
-----
Post-auth frame format
----------------------
-The frame format is fixed (see above), but can take three different
-forms, depending on the AUTH_DONE flags:
+Depending on the negotiated connection mode from TAG_AUTH_DONE, the
+connection either stays in crc mode or switches to the corresponding
+secure mode (msgr2.0-secure or msgr2.1-secure).
+
+### msgr2.0-secure mode
+
+A msgr2.0-secure frame has the form::
+
+ {
+ preamble (32 bytes)
+ {
+ segment payload
+ zero padding (out to 16 bytes)
+ } * number of segments
+ epilogue (16 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+
+where epilogue is::
+
+ __u8 late_flags
+ zero padding (15 bytes)
+
+late_flags has the same meaning as in msgr2.0-crc mode.
+
+Each segment and the epilogue are zero padded out to 16 bytes.
+Technically, GCM doesn't require any padding because Counter mode
+(the C in GCM) essentially turns a block cipher into a stream cipher.
+But, if the overall input length is not a multiple of 16 bytes, some
+implicit zero padding would occur internally because GHASH function
+used by GCM for generating auth tags only works on 16-byte blocks.
+
+Issues:
+
+1. The sender encrypts the whole frame using a single nonce
+ and generating a single auth tag. Because segment lengths are
+ stored in the preamble, the receiver has no choice but to decrypt
+ and interpret the preamble without verifying the auth tag -- it
+ can't even tell how much to read off the wire to get the auth tag
+ otherwise! This creates a decryption oracle, which, in conjunction
+ with Counter mode malleability, could lead to recovery of sensitive
+ information.
+
+ This issue extends to the first segment of the message frame as
+ well. As in msgr2.0-crc mode, ceph_msg_header2 cannot be safely
+ interpreted before the whole frame is read off the wire.
+
+2. Deterministic nonce construction with a 4-byte counter field
+ followed by an 8-byte fixed field is used. The initial values are
+ taken from the connection secret -- a random byte string generated
+ during the authentication phase. Because the counter field is
+ only four bytes long, it can wrap and then repeat in under a day,
+ leading to GCM nonce reuse and therefore a potential complete
+ loss of both authenticity and confidentiality for the connection.
+ This was addressed by disconnecting before the counter repeats
+ (CVE-2020-1759).
+
+### msgr2.1-secure mode
+
+Differences from msgr2.0-secure:
+
+1. The preamble, the first segment and the rest of the frame are
+ encrypted separately, using separate nonces and generating
+ separate auth tags. This gets rid of unverified plaintext use
+ and keeps msgr2.1-secure mode close to msgr2.1-crc mode, allowing
+ the implementation to receive message frames in a similar fashion
+ (little to no buffering, same scatter/gather logic, etc).
+
+ In order to reduce the number of en/decryption operations per
+ frame, the preamble is grown by a fixed size inline buffer (48
+ bytes) that the first segment is inlined into, either fully or
+ partially. The preamble auth tag covers both the preamble and the
+ inline buffer, so if the first segment is small enough to be fully
+ inlined, it becomes available after a single decryption operation.
+
+2. As in msgr2.1-crc mode, the epilogue is generated only if the
+ frame has more than one segment. The rationale is even stronger,
+ as it would require an extra en/decryption operation.
+
+3. For consistency with msgr2.1-crc mode, late_flags is replaced
+ with late_status (the built-in bit error detection isn't really
+ needed in secure mode).
+
+4. In accordance with `NIST Recommendation for GCM`_, deterministic
+ nonce construction with a 4-byte fixed field followed by an 8-byte
+ counter field is used. An 8-byte counter field should never repeat
+ but the nonce reuse protection put in place for msgr2.0-secure mode
+ is still there.
+
+ The initial values are the same as in msgr2.0-secure mode.
+
+ .. _`NIST Recommendation for GCM`: https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38d.pdf
+
+As in msgr2.0-secure mode, each segment is zero padded out to
+16 bytes. If the first segment is fully inlined, its padding goes
+to the inline buffer. Otherwise, the padding is on the remainder.
+The corollary to this is that the inline buffer is consumed in
+16-byte chunks.
+
+The unused portion of the inline buffer is zeroed.
+
+Some example frames:
+
+* A 0+0+0+0 frame (empty, nothing to inline, no epilogue)::
+
+ {
+ preamble (32 bytes)
+ zero padding (48 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+
+* A 20+0+0+0 frame (first segment fully inlined, no epilogue)::
+
+ {
+ preamble (32 bytes)
+ segment1 payload (20 bytes)
+ zero padding (28 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
-* If neither FLAG_SIGNED or FLAG_ENCRYPTED is specified, things are simple::
+* A 0+70+0+0 frame (nothing to inline)::
- frame_len
- tag
- payload
- payload_padding (out to auth block_size)
+ {
+ preamble (32 bytes)
+ zero padding (48 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+ {
+ segment2 payload (70 bytes)
+ zero padding (10 bytes)
+ epilogue (16 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
- - The padding is some number of bytes < the auth block_size that
- brings the total length of the payload + payload_padding to a
- multiple of block_size. It does not include the frame_len or tag. Padding
- content can be zeros or (better) random bytes.
+* A 20+70+0+350 frame (first segment fully inlined)::
-* If FLAG_SIGNED has been specified::
+ {
+ preamble (32 bytes)
+ segment1 payload (20 bytes)
+ zero padding (28 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+ {
+ segment2 payload (70 bytes)
+ zero padding (10 bytes)
+ segment4 payload (350 bytes)
+ zero padding (2 bytes)
+ epilogue (16 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
- frame_len
- tag
- payload
- payload_padding (out to auth block_size)
- signature (sig_size bytes)
+* A 105+0+0+0 frame (first segment partially inlined, no epilogue)::
- Here the padding just makes life easier for the signature. It can be
- random data to add additional confounder. Note also that the
- signature input must include some state from the session key and the
- previous message.
+ {
+ preamble (32 bytes)
+ segment1 payload (48 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+ {
+ segment1 payload remainder (57 bytes)
+ zero padding (7 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
-* If FLAG_ENCRYPTED has been specified::
+* A 105+70+0+350 frame (first segment partially inlined)::
- frame_len
- tag
{
- payload
- payload_padding (out to auth block_size)
- } ^ stream cipher
+ preamble (32 bytes)
+ segment1 payload (48 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+ {
+ segment1 payload remainder (57 bytes)
+ zero padding (7 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+ {
+ segment2 payload (70 bytes)
+ zero padding (10 bytes)
+ segment4 payload (350 bytes)
+ zero padding (2 bytes)
+ epilogue (16 bytes)
+ } ^ AES-128-GCM cipher
+ auth tag (16 bytes)
+
+where epilogue is::
- Note that the padding ensures that the total frame is a multiple of
- the auth method's block_size so that the message can be sent out over
- the wire without waiting for the next frame in the stream.
+ __u8 late_status
+ zero padding (15 bytes)
+late_status has the same meaning as in msgr2.1-crc mode.
Message flow handshake
----------------------