]> git.proxmox.com Git - ceph.git/blob - ceph/doc/dev/encoding.rst
8ec3bb22dd3f5af82b6ce21da24e6a14c06677d1
[ceph.git] / ceph / doc / dev / encoding.rst
1
2 Serialization (encode/decode)
3 =============================
4
5 When a structure is sent over the network or written to disk, it is
6 encoded into a string of bytes. Usually (but not always -- multiple
7 serialization facilities coexist in Ceph) serializable structures
8 have ``encode`` and ``decode`` methods that write and read from
9 ``bufferlist`` objects representing byte strings.
10
11 Terminology
12 -----------
13 It is best to think not in the domain of daemons and clients but
14 encoders and decoders. An encoder serializes a structure into a bufferlist
15 while a decoder does the opposite.
16
17 Encoders and decoders can be referred collectively as dencoders.
18
19 Dencoders (both encoders and docoders) live within daemons and clients.
20 For instance, when an RBD client issues an IO operation, it prepares
21 an instance of the ``MOSDOp`` structure and encodes it into a bufferlist
22 that is put on the wire.
23 An OSD reads these bytes and decodes them back into an ``MOSDOp`` instance.
24 Here encoder was used by the client while decoder by the OSD. However,
25 these roles can swing -- just imagine handling of the response: OSD encodes
26 the ``MOSDOpReply`` while RBD clients decode.
27
28 Encoder and decoder operate accordingly to a format which is defined
29 by a programmer by implementing the ``encode`` and ``decode`` methods.
30
31 Principles for format change
32 ----------------------------
33 It is not unusual that the format of serialization changes. This
34 process requires careful attention from during both development
35 and review.
36
37 The general rule is that a decoder must understand what had been
38 encoded by an encoder. Most of the problems come from ensuring
39 that compatibility continues between old decoders and new encoders
40 as well as new decoders and old decoders. One should assume
41 that -- if not otherwise derogated -- any mix (old/new) is
42 possible in a cluster. There are 2 main reasons for that:
43
44 1. Upgrades. Although there are recommendations related to the order
45 of entity types (mons/osds/clients), it is not mandatory and
46 no assumption should be made about it.
47 2. Huge variability of client versions. It was always the case
48 that kernel (and thus kernel clients) upgrades are decoupled
49 from Ceph upgrades. Moreover, proliferation of containerization
50 bring the variability even to e.g. ``librbd`` -- now user space
51 libraries live on the container own.
52
53 With this being said, there are few rules limiting the degree
54 of interoperability between dencoders:
55
56 * ``n-2`` for dencoding between daemons,
57 * ``n-3`` hard requirement for client-involved scenarios,
58 * ``n-3..`` soft requirements for clinet-involved scenarios. Ideally
59 every client should be able to talk any version of daemons.
60
61 As the underlying reasons are the same, the rules dencoders
62 follow are virtually the same as for deprecations of our features
63 bits. See the ``Notes on deprecation`` in ``src/include/ceph_features.h``.
64
65 Frameworks
66 ----------
67 Currently multiple genres of dencoding helpers co-exist.
68
69 * encoding.h (the most proliferated one),
70 * denc.h (performance optimized, seen mostly in ``BlueStore``),
71 * the `Message` hierarchy.
72
73 Although details vary, the interoperability rules stay the same.
74
75 Adding a field to a structure
76 -----------------------------
77
78 You can see examples of this all over the Ceph code, but here's an
79 example:
80
81 .. code-block:: cpp
82
83 class AcmeClass
84 {
85 int member1;
86 std::string member2;
87
88 void encode(bufferlist &bl)
89 {
90 ENCODE_START(1, 1, bl);
91 ::encode(member1, bl);
92 ::encode(member2, bl);
93 ENCODE_FINISH(bl);
94 }
95
96 void decode(bufferlist::iterator &bl)
97 {
98 DECODE_START(1, bl);
99 ::decode(member1, bl);
100 ::decode(member2, bl);
101 DECODE_FINISH(bl);
102 }
103 };
104
105 The ``ENCODE_START`` macro writes a header that specifies a *version* and
106 a *compat_version* (both initially 1). The message version is incremented
107 whenever a change is made to the encoding. The compat_version is incremented
108 only if the change will break existing decoders -- decoders are tolerant
109 of trailing bytes, so changes that add fields at the end of the structure
110 do not require incrementing compat_version.
111
112 The ``DECODE_START`` macro takes an argument specifying the most recent
113 message version that the code can handle. This is compared with the
114 compat_version encoded in the message, and if the message is too new then
115 an exception will be thrown. Because changes to compat_version are rare,
116 this isn't usually something to worry about when adding fields.
117
118 In practice, changes to encoding usually involve simply adding the desired fields
119 at the end of the ``encode`` and ``decode`` functions, and incrementing
120 the versions in ``ENCODE_START`` and ``DECODE_START``. For example, here's how
121 to add a third field to ``AcmeClass``:
122
123 .. code-block:: cpp
124
125 class AcmeClass
126 {
127 int member1;
128 std::string member2;
129 std::vector<std::string> member3;
130
131 void encode(bufferlist &bl)
132 {
133 ENCODE_START(2, 1, bl);
134 ::encode(member1, bl);
135 ::encode(member2, bl);
136 ::encode(member3, bl);
137 ENCODE_FINISH(bl);
138 }
139
140 void decode(bufferlist::iterator &bl)
141 {
142 DECODE_START(2, bl);
143 ::decode(member1, bl);
144 ::decode(member2, bl);
145 if (struct_v >= 2) {
146 ::decode(member3, bl);
147 }
148 DECODE_FINISH(bl);
149 }
150 };
151
152 Note that the compat_version did not change because the encoded message
153 will still be decodable by versions of the code that only understand
154 version 1 -- they will just ignore the trailing bytes where we encode ``member3``.
155
156 In the ``decode`` function, decoding the new field is conditional: this is
157 because we might still be passed older-versioned messages that do not
158 have the field. The ``struct_v`` variable is a local set by the ``DECODE_START``
159 macro.
160
161 # Into the weeeds
162
163 The append-extendability of our dencoders is a result of the forward
164 compatibility that the ``ENCODE_START`` and ``DECODE_FINISH`` macros bring.
165
166 They are implementing extendibility facilities. An encoder, when filling
167 the bufferlist, prepends three fields: version of the current format,
168 minimal version of a decoder compatible with it and the total size of
169 all encoded fields.
170
171 .. code-block:: cpp
172
173 /**
174 * start encoding block
175 *
176 * @param v current (code) version of the encoding
177 * @param compat oldest code version that can decode it
178 * @param bl bufferlist to encode to
179 *
180 */
181 #define ENCODE_START(v, compat, bl) \
182 __u8 struct_v = v; \
183 __u8 struct_compat = compat; \
184 ceph_le32 struct_len; \
185 auto filler = (bl).append_hole(sizeof(struct_v) + \
186 sizeof(struct_compat) + sizeof(struct_len)); \
187 const auto starting_bl_len = (bl).length(); \
188 using ::ceph::encode; \
189 do {
190
191 The ``struct_len`` field allows the decoder to eat all the bytes that were
192 left undecoded in the user-provided ``decode`` implementation.
193 Analogically, decoders tracks how much input has been decoded in the
194 user-provided ``decode`` methods.
195
196 .. code-block:: cpp
197
198 #define DECODE_START(bl) \
199 unsigned struct_end = 0; \
200 __u32 struct_len; \
201 decode(struct_len, bl); \
202 ... \
203 struct_end = bl.get_off() + struct_len; \
204 } \
205 do {
206
207
208 Decoder uses this information to discard the extra bytes it does not
209 understand. Advancing bufferlist is critical as dencoders tend to be nested;
210 just leaving it intact would work only for the very last ``deocde`` call
211 in a nested structure.
212
213 .. code-block:: cpp
214
215 #define DECODE_FINISH(bl) \
216 } while (false); \
217 if (struct_end) { \
218 ... \
219 if (bl.get_off() < struct_end) \
220 bl += struct_end - bl.get_off(); \
221 }
222
223
224 This entire, cooperative mechanism allows encoder (its further revisions)
225 to generate more byte stream (due to e.g. adding a new field at the end)
226 and not worry that the residue will crash older decoder revisions.