2 Serialization (encode/decode)
3 =============================
5 When a structure is sent over the network or written to disk, it is
6 encoded into a string of bytes. Usually (but not always -- multiple
7 serialization facilities coexist in Ceph) serializable structures
8 have ``encode`` and ``decode`` methods that write and read from
9 ``bufferlist`` objects representing byte strings.
13 It is best to think not in the domain of daemons and clients but
14 encoders and decoders. An encoder serializes a structure into a bufferlist
15 while a decoder does the opposite.
17 Encoders and decoders can be referred collectively as dencoders.
19 Dencoders (both encoders and docoders) live within daemons and clients.
20 For instance, when an RBD client issues an IO operation, it prepares
21 an instance of the ``MOSDOp`` structure and encodes it into a bufferlist
22 that is put on the wire.
23 An OSD reads these bytes and decodes them back into an ``MOSDOp`` instance.
24 Here encoder was used by the client while decoder by the OSD. However,
25 these roles can swing -- just imagine handling of the response: OSD encodes
26 the ``MOSDOpReply`` while RBD clients decode.
28 Encoder and decoder operate accordingly to a format which is defined
29 by a programmer by implementing the ``encode`` and ``decode`` methods.
31 Principles for format change
32 ----------------------------
33 It is not unusual that the format of serialization changes. This
34 process requires careful attention from during both development
37 The general rule is that a decoder must understand what had been
38 encoded by an encoder. Most of the problems come from ensuring
39 that compatibility continues between old decoders and new encoders
40 as well as new decoders and old decoders. One should assume
41 that -- if not otherwise derogated -- any mix (old/new) is
42 possible in a cluster. There are 2 main reasons for that:
44 1. Upgrades. Although there are recommendations related to the order
45 of entity types (mons/osds/clients), it is not mandatory and
46 no assumption should be made about it.
47 2. Huge variability of client versions. It was always the case
48 that kernel (and thus kernel clients) upgrades are decoupled
49 from Ceph upgrades. Moreover, proliferation of containerization
50 bring the variability even to e.g. ``librbd`` -- now user space
51 libraries live on the container own.
53 With this being said, there are few rules limiting the degree
54 of interoperability between dencoders:
56 * ``n-2`` for dencoding between daemons,
57 * ``n-3`` hard requirement for client-involved scenarios,
58 * ``n-3..`` soft requirements for clinet-involved scenarios. Ideally
59 every client should be able to talk any version of daemons.
61 As the underlying reasons are the same, the rules dencoders
62 follow are virtually the same as for deprecations of our features
63 bits. See the ``Notes on deprecation`` in ``src/include/ceph_features.h``.
67 Currently multiple genres of dencoding helpers co-exist.
69 * encoding.h (the most proliferated one),
70 * denc.h (performance optimized, seen mostly in ``BlueStore``),
71 * the `Message` hierarchy.
73 Although details vary, the interoperability rules stay the same.
75 Adding a field to a structure
76 -----------------------------
78 You can see examples of this all over the Ceph code, but here's an
88 void encode(bufferlist &bl)
90 ENCODE_START(1, 1, bl);
91 ::encode(member1, bl);
92 ::encode(member2, bl);
96 void decode(bufferlist::iterator &bl)
99 ::decode(member1, bl);
100 ::decode(member2, bl);
105 The ``ENCODE_START`` macro writes a header that specifies a *version* and
106 a *compat_version* (both initially 1). The message version is incremented
107 whenever a change is made to the encoding. The compat_version is incremented
108 only if the change will break existing decoders -- decoders are tolerant
109 of trailing bytes, so changes that add fields at the end of the structure
110 do not require incrementing compat_version.
112 The ``DECODE_START`` macro takes an argument specifying the most recent
113 message version that the code can handle. This is compared with the
114 compat_version encoded in the message, and if the message is too new then
115 an exception will be thrown. Because changes to compat_version are rare,
116 this isn't usually something to worry about when adding fields.
118 In practice, changes to encoding usually involve simply adding the desired fields
119 at the end of the ``encode`` and ``decode`` functions, and incrementing
120 the versions in ``ENCODE_START`` and ``DECODE_START``. For example, here's how
121 to add a third field to ``AcmeClass``:
129 std::vector<std::string> member3;
131 void encode(bufferlist &bl)
133 ENCODE_START(2, 1, bl);
134 ::encode(member1, bl);
135 ::encode(member2, bl);
136 ::encode(member3, bl);
140 void decode(bufferlist::iterator &bl)
143 ::decode(member1, bl);
144 ::decode(member2, bl);
146 ::decode(member3, bl);
152 Note that the compat_version did not change because the encoded message
153 will still be decodable by versions of the code that only understand
154 version 1 -- they will just ignore the trailing bytes where we encode ``member3``.
156 In the ``decode`` function, decoding the new field is conditional: this is
157 because we might still be passed older-versioned messages that do not
158 have the field. The ``struct_v`` variable is a local set by the ``DECODE_START``
163 The append-extendability of our dencoders is a result of the forward
164 compatibility that the ``ENCODE_START`` and ``DECODE_FINISH`` macros bring.
166 They are implementing extendibility facilities. An encoder, when filling
167 the bufferlist, prepends three fields: version of the current format,
168 minimal version of a decoder compatible with it and the total size of
174 * start encoding block
176 * @param v current (code) version of the encoding
177 * @param compat oldest code version that can decode it
178 * @param bl bufferlist to encode to
181 #define ENCODE_START(v, compat, bl) \
183 __u8 struct_compat = compat; \
184 ceph_le32 struct_len; \
185 auto filler = (bl).append_hole(sizeof(struct_v) + \
186 sizeof(struct_compat) + sizeof(struct_len)); \
187 const auto starting_bl_len = (bl).length(); \
188 using ::ceph::encode; \
191 The ``struct_len`` field allows the decoder to eat all the bytes that were
192 left undecoded in the user-provided ``decode`` implementation.
193 Analogically, decoders tracks how much input has been decoded in the
194 user-provided ``decode`` methods.
198 #define DECODE_START(bl) \
199 unsigned struct_end = 0; \
201 decode(struct_len, bl); \
203 struct_end = bl.get_off() + struct_len; \
208 Decoder uses this information to discard the extra bytes it does not
209 understand. Advancing bufferlist is critical as dencoders tend to be nested;
210 just leaving it intact would work only for the very last ``deocde`` call
211 in a nested structure.
215 #define DECODE_FINISH(bl) \
219 if (bl.get_off() < struct_end) \
220 bl += struct_end - bl.get_off(); \
224 This entire, cooperative mechanism allows encoder (its further revisions)
225 to generate more byte stream (due to e.g. adding a new field at the end)
226 and not worry that the residue will crash older decoder revisions.