ceph/doc/dev/encoding.rst

   1
   2 Serialization (encode/decode)
   3 =============================
   4
   5 When a structure is sent over the network or written to disk, it is
   6 encoded into a string of bytes. Usually (but not always -- multiple
   7 serialization facilities coexist in Ceph) serializable structures
   8 have ``encode`` and ``decode`` methods that write and read from
   9 ``bufferlist`` objects representing byte strings.
  10
  11 Terminology
  12 -----------
  13 It is best to think not in the domain of daemons and clients but
  14 encoders and decoders. An encoder serializes a structure into a bufferlist
  15 while a decoder does the opposite.
  16
  17 Encoders and decoders can be referred collectively as dencoders.
  18
  19 Dencoders (both encoders and docoders) live within daemons and clients.
  20 For instance, when an RBD client issues an IO operation, it prepares
  21 an instance of the ``MOSDOp`` structure and encodes it into a bufferlist
  22 that is put on the wire.
  23 An OSD reads these bytes and decodes them back into an ``MOSDOp`` instance.
  24 Here encoder was used by the client while decoder by the OSD. However,
  25 these roles can swing -- just imagine handling of the response: OSD encodes
  26 the ``MOSDOpReply`` while RBD clients decode.
  27
  28 Encoder and decoder operate accordingly to a format which is defined
  29 by a programmer by implementing the ``encode`` and ``decode`` methods.
  30
  31 Principles for format change
  32 ----------------------------
  33 It is not unusual that the format of serialization changes. This
  34 process requires careful attention from during both development
  35 and review.
  36
  37 The general rule is that a decoder must understand what had been
  38 encoded by an encoder. Most of the problems come from ensuring
  39 that compatibility continues between old decoders and new encoders
  40 as well as new decoders and old decoders. One should assume
  41 that -- if not otherwise derogated -- any mix (old/new) is
  42 possible in a cluster. There are 2 main reasons for that:
  43
  44 1. Upgrades. Although there are recommendations related to the order
  45    of entity types (mons/osds/clients), it is not mandatory and
  46    no assumption should be made about it.
  47 2. Huge variability of client versions. It was always the case
  48    that kernel (and thus kernel clients) upgrades are decoupled
  49    from Ceph upgrades. Moreover, proliferation of containerization
  50    bring the variability even to e.g. ``librbd`` -- now user space
  51    libraries live on the container own.
  52
  53 With this being said, there are few rules limiting the degree
  54 of interoperability between dencoders:
  55
  56 * ``n-2`` for dencoding between daemons,
  57 * ``n-3`` hard requirement for client-involved scenarios,
  58 * ``n-3..``  soft requirements for clinet-involved scenarios. Ideally
  59   every client should be able to talk any version of daemons.
  60
  61 As the underlying reasons are the same, the rules dencoders
  62 follow are virtually the same as for deprecations of our features
  63 bits. See the ``Notes on deprecation`` in ``src/include/ceph_features.h``.
  64
  65 Frameworks
  66 ----------
  67 Currently multiple genres of dencoding helpers co-exist.
  68
  69 * encoding.h (the most proliferated one),
  70 * denc.h (performance optimized, seen mostly in ``BlueStore``),
  71 * the `Message` hierarchy.
  72
  73 Although details vary, the interoperability rules stay the same.
  74
  75 Adding a field to a structure
  76 -----------------------------
  77
  78 You can see examples of this all over the Ceph code, but here's an
  79 example:
  80
  81 .. code-block:: cpp
  82
  83     class AcmeClass
  84     {
  85         int member1;
  86         std::string member2;
  87
  88         void encode(bufferlist &bl)
  89         {
  90             ENCODE_START(1, 1, bl);
  91             ::encode(member1, bl);
  92             ::encode(member2, bl);
  93             ENCODE_FINISH(bl);
  94         }
  95
  96         void decode(bufferlist::iterator &bl)
  97         {
  98             DECODE_START(1, bl);
  99             ::decode(member1, bl);
 100             ::decode(member2, bl);
 101             DECODE_FINISH(bl);
 102         }
 103     };
 104
 105 The ``ENCODE_START`` macro writes a header that specifies a *version* and
 106 a *compat_version* (both initially 1).  The message version is incremented
 107 whenever a change is made to the encoding.  The compat_version is incremented
 108 only if the change will break existing decoders -- decoders are tolerant
 109 of trailing bytes, so changes that add fields at the end of the structure
 110 do not require incrementing compat_version.
 111
 112 The ``DECODE_START`` macro takes an argument specifying the most recent
 113 message version that the code can handle.  This is compared with the
 114 compat_version encoded in the message, and if the message is too new then
 115 an exception will be thrown.  Because changes to compat_version are rare,
 116 this isn't usually something to worry about when adding fields.
 117
 118 In practice, changes to encoding usually involve simply adding the desired fields
 119 at the end of the ``encode`` and ``decode`` functions, and incrementing
 120 the versions in ``ENCODE_START`` and ``DECODE_START``.  For example, here's how
 121 to add a third field to ``AcmeClass``:
 122
 123 .. code-block:: cpp
 124
 125     class AcmeClass
 126     {
 127         int member1;
 128         std::string member2;
 129         std::vector<std::string> member3;
 130
 131         void encode(bufferlist &bl)
 132         {
 133             ENCODE_START(2, 1, bl);
 134             ::encode(member1, bl);
 135             ::encode(member2, bl);
 136             ::encode(member3, bl);
 137             ENCODE_FINISH(bl);
 138         }
 139
 140         void decode(bufferlist::iterator &bl)
 141         {
 142             DECODE_START(2, bl);
 143             ::decode(member1, bl);
 144             ::decode(member2, bl);
 145             if (struct_v >= 2) {
 146                 ::decode(member3, bl);
 147             }
 148             DECODE_FINISH(bl);
 149         }
 150     };
 151
 152 Note that the compat_version did not change because the encoded message
 153 will still be decodable by versions of the code that only understand
 154 version 1 -- they will just ignore the trailing bytes where we encode ``member3``.
 155
 156 In the ``decode`` function, decoding the new field is conditional: this is
 157 because we might still be passed older-versioned messages that do not
 158 have the field.  The ``struct_v`` variable is a local set by the ``DECODE_START``
 159 macro.
 160
 161 # Into the weeeds
 162
 163 The append-extendability of our dencoders is a result of the forward
 164 compatibility that the ``ENCODE_START`` and ``DECODE_FINISH`` macros bring.
 165
 166 They are implementing extendibility facilities. An encoder, when filling
 167 the bufferlist, prepends three fields: version of the current format,
 168 minimal version of a decoder compatible with it and the total size of
 169 all encoded fields.
 170
 171 .. code-block:: cpp
 172
 173         /**
 174          * start encoding block
 175          *
 176          * @param v current (code) version of the encoding
 177          * @param compat oldest code version that can decode it
 178          * @param bl bufferlist to encode to
 179          *
 180          */
 181         #define ENCODE_START(v, compat, bl)                             \
 182           __u8 struct_v = v;                                            \
 183           __u8 struct_compat = compat;                                  \
 184           ceph_le32 struct_len;                                         \
 185           auto filler = (bl).append_hole(sizeof(struct_v) +             \
 186             sizeof(struct_compat) + sizeof(struct_len));                \
 187           const auto starting_bl_len = (bl).length();                   \
 188           using ::ceph::encode;                                         \
 189           do {
 190
 191 The ``struct_len`` field allows the decoder to eat all the bytes that were
 192 left undecoded in the user-provided ``decode`` implementation.
 193 Analogically, decoders tracks how much input has been decoded in the
 194 user-provided ``decode`` methods.
 195
 196 .. code-block:: cpp
 197
 198         #define DECODE_START(bl)                                        \
 199           unsigned struct_end = 0;                                      \
 200           __u32 struct_len;                                             \
 201           decode(struct_len, bl);                                       \
 202           ...                                                           \
 203           struct_end = bl.get_off() + struct_len;                       \
 204           }                                                             \
 205           do {
 206
 207
 208 Decoder uses this information to discard the extra bytes it does not
 209 understand. Advancing bufferlist is critical as dencoders tend to be nested;
 210 just leaving it intact would work only for the very last ``deocde`` call
 211 in a nested structure.
 212
 213 .. code-block:: cpp
 214
 215         #define DECODE_FINISH(bl)                                       \
 216           } while (false);                                              \
 217           if (struct_end) {                                             \
 218             ...                                                         \
 219             if (bl.get_off() < struct_end)                              \
 220               bl += struct_end - bl.get_off();                          \
 221           }
 222
 223
 224 This entire, cooperative mechanism allows encoder (its further revisions)
 225 to generate more byte stream (due to e.g. adding a new field at the end)
 226 and not worry that the residue will crash older decoder revisions.