]> git.proxmox.com Git - ceph.git/blame - ceph/doc/dev/encoding.rst
update ceph source to reef 18.2.1
[ceph.git] / ceph / doc / dev / encoding.rst
CommitLineData
11fdf7f2
TL
1
2Serialization (encode/decode)
3=============================
4
5When a structure is sent over the network or written to disk, it is
05a536ef
TL
6encoded into a string of bytes. Usually (but not always -- multiple
7serialization facilities coexist in Ceph) serializable structures
8have ``encode`` and ``decode`` methods that write and read from
9``bufferlist`` objects representing byte strings.
10
11Terminology
12-----------
13It is best to think not in the domain of daemons and clients but
14encoders and decoders. An encoder serializes a structure into a bufferlist
15while a decoder does the opposite.
16
17Encoders and decoders can be referred collectively as dencoders.
18
19Dencoders (both encoders and docoders) live within daemons and clients.
20For instance, when an RBD client issues an IO operation, it prepares
21an instance of the ``MOSDOp`` structure and encodes it into a bufferlist
22that is put on the wire.
23An OSD reads these bytes and decodes them back into an ``MOSDOp`` instance.
24Here encoder was used by the client while decoder by the OSD. However,
25these roles can swing -- just imagine handling of the response: OSD encodes
26the ``MOSDOpReply`` while RBD clients decode.
27
28Encoder and decoder operate accordingly to a format which is defined
29by a programmer by implementing the ``encode`` and ``decode`` methods.
30
31Principles for format change
32----------------------------
33It is not unusual that the format of serialization changes. This
34process requires careful attention from during both development
35and review.
36
37The general rule is that a decoder must understand what had been
38encoded by an encoder. Most of the problems come from ensuring
39that compatibility continues between old decoders and new encoders
40as well as new decoders and old decoders. One should assume
41that -- if not otherwise derogated -- any mix (old/new) is
42possible in a cluster. There are 2 main reasons for that:
43
441. Upgrades. Although there are recommendations related to the order
45 of entity types (mons/osds/clients), it is not mandatory and
46 no assumption should be made about it.
472. Huge variability of client versions. It was always the case
48 that kernel (and thus kernel clients) upgrades are decoupled
49 from Ceph upgrades. Moreover, proliferation of containerization
50 bring the variability even to e.g. ``librbd`` -- now user space
51 libraries live on the container own.
52
53With this being said, there are few rules limiting the degree
54of interoperability between dencoders:
55
56* ``n-2`` for dencoding between daemons,
57* ``n-3`` hard requirement for client-involved scenarios,
58* ``n-3..`` soft requirements for clinet-involved scenarios. Ideally
59 every client should be able to talk any version of daemons.
60
61As the underlying reasons are the same, the rules dencoders
62follow are virtually the same as for deprecations of our features
63bits. See the ``Notes on deprecation`` in ``src/include/ceph_features.h``.
64
65Frameworks
66----------
67Currently multiple genres of dencoding helpers co-exist.
68
69* encoding.h (the most proliferated one),
70* denc.h (performance optimized, seen mostly in ``BlueStore``),
71* the `Message` hierarchy.
72
73Although details vary, the interoperability rules stay the same.
11fdf7f2
TL
74
75Adding a field to a structure
76-----------------------------
77
78You can see examples of this all over the Ceph code, but here's an
79example:
80
20effc67 81.. code-block:: cpp
11fdf7f2
TL
82
83 class AcmeClass
84 {
85 int member1;
86 std::string member2;
87
88 void encode(bufferlist &bl)
89 {
90 ENCODE_START(1, 1, bl);
91 ::encode(member1, bl);
92 ::encode(member2, bl);
93 ENCODE_FINISH(bl);
94 }
95
96 void decode(bufferlist::iterator &bl)
97 {
98 DECODE_START(1, bl);
99 ::decode(member1, bl);
100 ::decode(member2, bl);
101 DECODE_FINISH(bl);
102 }
103 };
104
105The ``ENCODE_START`` macro writes a header that specifies a *version* and
106a *compat_version* (both initially 1). The message version is incremented
107whenever a change is made to the encoding. The compat_version is incremented
108only if the change will break existing decoders -- decoders are tolerant
109of trailing bytes, so changes that add fields at the end of the structure
110do not require incrementing compat_version.
111
112The ``DECODE_START`` macro takes an argument specifying the most recent
113message version that the code can handle. This is compared with the
114compat_version encoded in the message, and if the message is too new then
1e59de90 115an exception will be thrown. Because changes to compat_version are rare,
11fdf7f2
TL
116this isn't usually something to worry about when adding fields.
117
118In practice, changes to encoding usually involve simply adding the desired fields
119at the end of the ``encode`` and ``decode`` functions, and incrementing
120the versions in ``ENCODE_START`` and ``DECODE_START``. For example, here's how
121to add a third field to ``AcmeClass``:
122
20effc67 123.. code-block:: cpp
11fdf7f2
TL
124
125 class AcmeClass
126 {
127 int member1;
128 std::string member2;
129 std::vector<std::string> member3;
130
131 void encode(bufferlist &bl)
132 {
133 ENCODE_START(2, 1, bl);
134 ::encode(member1, bl);
135 ::encode(member2, bl);
136 ::encode(member3, bl);
137 ENCODE_FINISH(bl);
138 }
139
140 void decode(bufferlist::iterator &bl)
141 {
142 DECODE_START(2, bl);
143 ::decode(member1, bl);
144 ::decode(member2, bl);
145 if (struct_v >= 2) {
146 ::decode(member3, bl);
147 }
148 DECODE_FINISH(bl);
149 }
150 };
151
152Note that the compat_version did not change because the encoded message
153will still be decodable by versions of the code that only understand
154version 1 -- they will just ignore the trailing bytes where we encode ``member3``.
155
156In the ``decode`` function, decoding the new field is conditional: this is
157because we might still be passed older-versioned messages that do not
158have the field. The ``struct_v`` variable is a local set by the ``DECODE_START``
159macro.
160
05a536ef
TL
161# Into the weeeds
162
163The append-extendability of our dencoders is a result of the forward
164compatibility that the ``ENCODE_START`` and ``DECODE_FINISH`` macros bring.
165
166They are implementing extendibility facilities. An encoder, when filling
167the bufferlist, prepends three fields: version of the current format,
168minimal version of a decoder compatible with it and the total size of
169all encoded fields.
170
171.. code-block:: cpp
172
173 /**
174 * start encoding block
175 *
176 * @param v current (code) version of the encoding
177 * @param compat oldest code version that can decode it
178 * @param bl bufferlist to encode to
179 *
180 */
181 #define ENCODE_START(v, compat, bl) \
182 __u8 struct_v = v; \
183 __u8 struct_compat = compat; \
184 ceph_le32 struct_len; \
185 auto filler = (bl).append_hole(sizeof(struct_v) + \
186 sizeof(struct_compat) + sizeof(struct_len)); \
187 const auto starting_bl_len = (bl).length(); \
188 using ::ceph::encode; \
189 do {
190
191The ``struct_len`` field allows the decoder to eat all the bytes that were
192left undecoded in the user-provided ``decode`` implementation.
193Analogically, decoders tracks how much input has been decoded in the
194user-provided ``decode`` methods.
195
196.. code-block:: cpp
197
198 #define DECODE_START(bl) \
199 unsigned struct_end = 0; \
200 __u32 struct_len; \
201 decode(struct_len, bl); \
202 ... \
203 struct_end = bl.get_off() + struct_len; \
204 } \
205 do {
206
207
208Decoder uses this information to discard the extra bytes it does not
209understand. Advancing bufferlist is critical as dencoders tend to be nested;
210just leaving it intact would work only for the very last ``deocde`` call
211in a nested structure.
212
213.. code-block:: cpp
214
215 #define DECODE_FINISH(bl) \
216 } while (false); \
217 if (struct_end) { \
218 ... \
219 if (bl.get_off() < struct_end) \
220 bl += struct_end - bl.get_off(); \
221 }
222
223
224This entire, cooperative mechanism allows encoder (its further revisions)
225to generate more byte stream (due to e.g. adding a new field at the end)
226and not worry that the residue will crash older decoder revisions.