]>
Commit | Line | Data |
---|---|---|
11fdf7f2 TL |
1 | |
2 | Serialization (encode/decode) | |
3 | ============================= | |
4 | ||
5 | When a structure is sent over the network or written to disk, it is | |
05a536ef TL |
6 | encoded into a string of bytes. Usually (but not always -- multiple |
7 | serialization facilities coexist in Ceph) serializable structures | |
8 | have ``encode`` and ``decode`` methods that write and read from | |
9 | ``bufferlist`` objects representing byte strings. | |
10 | ||
11 | Terminology | |
12 | ----------- | |
13 | It is best to think not in the domain of daemons and clients but | |
14 | encoders and decoders. An encoder serializes a structure into a bufferlist | |
15 | while a decoder does the opposite. | |
16 | ||
17 | Encoders and decoders can be referred collectively as dencoders. | |
18 | ||
19 | Dencoders (both encoders and docoders) live within daemons and clients. | |
20 | For instance, when an RBD client issues an IO operation, it prepares | |
21 | an instance of the ``MOSDOp`` structure and encodes it into a bufferlist | |
22 | that is put on the wire. | |
23 | An OSD reads these bytes and decodes them back into an ``MOSDOp`` instance. | |
24 | Here encoder was used by the client while decoder by the OSD. However, | |
25 | these roles can swing -- just imagine handling of the response: OSD encodes | |
26 | the ``MOSDOpReply`` while RBD clients decode. | |
27 | ||
28 | Encoder and decoder operate accordingly to a format which is defined | |
29 | by a programmer by implementing the ``encode`` and ``decode`` methods. | |
30 | ||
31 | Principles for format change | |
32 | ---------------------------- | |
33 | It is not unusual that the format of serialization changes. This | |
34 | process requires careful attention from during both development | |
35 | and review. | |
36 | ||
37 | The general rule is that a decoder must understand what had been | |
38 | encoded by an encoder. Most of the problems come from ensuring | |
39 | that compatibility continues between old decoders and new encoders | |
40 | as well as new decoders and old decoders. One should assume | |
41 | that -- if not otherwise derogated -- any mix (old/new) is | |
42 | possible in a cluster. There are 2 main reasons for that: | |
43 | ||
44 | 1. Upgrades. Although there are recommendations related to the order | |
45 | of entity types (mons/osds/clients), it is not mandatory and | |
46 | no assumption should be made about it. | |
47 | 2. Huge variability of client versions. It was always the case | |
48 | that kernel (and thus kernel clients) upgrades are decoupled | |
49 | from Ceph upgrades. Moreover, proliferation of containerization | |
50 | bring the variability even to e.g. ``librbd`` -- now user space | |
51 | libraries live on the container own. | |
52 | ||
53 | With this being said, there are few rules limiting the degree | |
54 | of interoperability between dencoders: | |
55 | ||
56 | * ``n-2`` for dencoding between daemons, | |
57 | * ``n-3`` hard requirement for client-involved scenarios, | |
58 | * ``n-3..`` soft requirements for clinet-involved scenarios. Ideally | |
59 | every client should be able to talk any version of daemons. | |
60 | ||
61 | As the underlying reasons are the same, the rules dencoders | |
62 | follow are virtually the same as for deprecations of our features | |
63 | bits. See the ``Notes on deprecation`` in ``src/include/ceph_features.h``. | |
64 | ||
65 | Frameworks | |
66 | ---------- | |
67 | Currently multiple genres of dencoding helpers co-exist. | |
68 | ||
69 | * encoding.h (the most proliferated one), | |
70 | * denc.h (performance optimized, seen mostly in ``BlueStore``), | |
71 | * the `Message` hierarchy. | |
72 | ||
73 | Although details vary, the interoperability rules stay the same. | |
11fdf7f2 TL |
74 | |
75 | Adding a field to a structure | |
76 | ----------------------------- | |
77 | ||
78 | You can see examples of this all over the Ceph code, but here's an | |
79 | example: | |
80 | ||
20effc67 | 81 | .. code-block:: cpp |
11fdf7f2 TL |
82 | |
83 | class AcmeClass | |
84 | { | |
85 | int member1; | |
86 | std::string member2; | |
87 | ||
88 | void encode(bufferlist &bl) | |
89 | { | |
90 | ENCODE_START(1, 1, bl); | |
91 | ::encode(member1, bl); | |
92 | ::encode(member2, bl); | |
93 | ENCODE_FINISH(bl); | |
94 | } | |
95 | ||
96 | void decode(bufferlist::iterator &bl) | |
97 | { | |
98 | DECODE_START(1, bl); | |
99 | ::decode(member1, bl); | |
100 | ::decode(member2, bl); | |
101 | DECODE_FINISH(bl); | |
102 | } | |
103 | }; | |
104 | ||
105 | The ``ENCODE_START`` macro writes a header that specifies a *version* and | |
106 | a *compat_version* (both initially 1). The message version is incremented | |
107 | whenever a change is made to the encoding. The compat_version is incremented | |
108 | only if the change will break existing decoders -- decoders are tolerant | |
109 | of trailing bytes, so changes that add fields at the end of the structure | |
110 | do not require incrementing compat_version. | |
111 | ||
112 | The ``DECODE_START`` macro takes an argument specifying the most recent | |
113 | message version that the code can handle. This is compared with the | |
114 | compat_version encoded in the message, and if the message is too new then | |
1e59de90 | 115 | an exception will be thrown. Because changes to compat_version are rare, |
11fdf7f2 TL |
116 | this isn't usually something to worry about when adding fields. |
117 | ||
118 | In practice, changes to encoding usually involve simply adding the desired fields | |
119 | at the end of the ``encode`` and ``decode`` functions, and incrementing | |
120 | the versions in ``ENCODE_START`` and ``DECODE_START``. For example, here's how | |
121 | to add a third field to ``AcmeClass``: | |
122 | ||
20effc67 | 123 | .. code-block:: cpp |
11fdf7f2 TL |
124 | |
125 | class AcmeClass | |
126 | { | |
127 | int member1; | |
128 | std::string member2; | |
129 | std::vector<std::string> member3; | |
130 | ||
131 | void encode(bufferlist &bl) | |
132 | { | |
133 | ENCODE_START(2, 1, bl); | |
134 | ::encode(member1, bl); | |
135 | ::encode(member2, bl); | |
136 | ::encode(member3, bl); | |
137 | ENCODE_FINISH(bl); | |
138 | } | |
139 | ||
140 | void decode(bufferlist::iterator &bl) | |
141 | { | |
142 | DECODE_START(2, bl); | |
143 | ::decode(member1, bl); | |
144 | ::decode(member2, bl); | |
145 | if (struct_v >= 2) { | |
146 | ::decode(member3, bl); | |
147 | } | |
148 | DECODE_FINISH(bl); | |
149 | } | |
150 | }; | |
151 | ||
152 | Note that the compat_version did not change because the encoded message | |
153 | will still be decodable by versions of the code that only understand | |
154 | version 1 -- they will just ignore the trailing bytes where we encode ``member3``. | |
155 | ||
156 | In the ``decode`` function, decoding the new field is conditional: this is | |
157 | because we might still be passed older-versioned messages that do not | |
158 | have the field. The ``struct_v`` variable is a local set by the ``DECODE_START`` | |
159 | macro. | |
160 | ||
05a536ef TL |
161 | # Into the weeeds |
162 | ||
163 | The append-extendability of our dencoders is a result of the forward | |
164 | compatibility that the ``ENCODE_START`` and ``DECODE_FINISH`` macros bring. | |
165 | ||
166 | They are implementing extendibility facilities. An encoder, when filling | |
167 | the bufferlist, prepends three fields: version of the current format, | |
168 | minimal version of a decoder compatible with it and the total size of | |
169 | all encoded fields. | |
170 | ||
171 | .. code-block:: cpp | |
172 | ||
173 | /** | |
174 | * start encoding block | |
175 | * | |
176 | * @param v current (code) version of the encoding | |
177 | * @param compat oldest code version that can decode it | |
178 | * @param bl bufferlist to encode to | |
179 | * | |
180 | */ | |
181 | #define ENCODE_START(v, compat, bl) \ | |
182 | __u8 struct_v = v; \ | |
183 | __u8 struct_compat = compat; \ | |
184 | ceph_le32 struct_len; \ | |
185 | auto filler = (bl).append_hole(sizeof(struct_v) + \ | |
186 | sizeof(struct_compat) + sizeof(struct_len)); \ | |
187 | const auto starting_bl_len = (bl).length(); \ | |
188 | using ::ceph::encode; \ | |
189 | do { | |
190 | ||
191 | The ``struct_len`` field allows the decoder to eat all the bytes that were | |
192 | left undecoded in the user-provided ``decode`` implementation. | |
193 | Analogically, decoders tracks how much input has been decoded in the | |
194 | user-provided ``decode`` methods. | |
195 | ||
196 | .. code-block:: cpp | |
197 | ||
198 | #define DECODE_START(bl) \ | |
199 | unsigned struct_end = 0; \ | |
200 | __u32 struct_len; \ | |
201 | decode(struct_len, bl); \ | |
202 | ... \ | |
203 | struct_end = bl.get_off() + struct_len; \ | |
204 | } \ | |
205 | do { | |
206 | ||
207 | ||
208 | Decoder uses this information to discard the extra bytes it does not | |
209 | understand. Advancing bufferlist is critical as dencoders tend to be nested; | |
210 | just leaving it intact would work only for the very last ``deocde`` call | |
211 | in a nested structure. | |
212 | ||
213 | .. code-block:: cpp | |
214 | ||
215 | #define DECODE_FINISH(bl) \ | |
216 | } while (false); \ | |
217 | if (struct_end) { \ | |
218 | ... \ | |
219 | if (bl.get_off() < struct_end) \ | |
220 | bl += struct_end - bl.get_off(); \ | |
221 | } | |
222 | ||
223 | ||
224 | This entire, cooperative mechanism allows encoder (its further revisions) | |
225 | to generate more byte stream (due to e.g. adding a new field at the end) | |
226 | and not worry that the residue will crash older decoder revisions. |