[ceph.git] / ceph / doc / dev / encoding.rst


Serialization (encode/decode)
=============================

When a structure is sent over the network or written to disk, it is
encoded into a string of bytes. Usually (but not always -- multiple
serialization facilities coexist in Ceph) serializable structures
have ``encode`` and ``decode`` methods that write and read from
``bufferlist`` objects representing byte strings.

Terminology
-----------
It is best to think not in the domain of daemons and clients but
encoders and decoders. An encoder serializes a structure into a bufferlist
while a decoder does the opposite.

Encoders and decoders can be referred collectively as dencoders.

Dencoders (both encoders and docoders) live within daemons and clients.
For instance, when an RBD client issues an IO operation, it prepares
an instance of the ``MOSDOp`` structure and encodes it into a bufferlist
that is put on the wire.
An OSD reads these bytes and decodes them back into an ``MOSDOp`` instance.
Here encoder was used by the client while decoder by the OSD. However,
these roles can swing -- just imagine handling of the response: OSD encodes
the ``MOSDOpReply`` while RBD clients decode.

Encoder and decoder operate accordingly to a format which is defined
by a programmer by implementing the ``encode`` and ``decode`` methods.

Principles for format change
----------------------------
It is not unusual that the format of serialization changes. This
process requires careful attention from during both development
and review.

The general rule is that a decoder must understand what had been
encoded by an encoder. Most of the problems come from ensuring
that compatibility continues between old decoders and new encoders
as well as new decoders and old decoders. One should assume
that -- if not otherwise derogated -- any mix (old/new) is
possible in a cluster. There are 2 main reasons for that:

1. Upgrades. Although there are recommendations related to the order
   of entity types (mons/osds/clients), it is not mandatory and
   no assumption should be made about it.
2. Huge variability of client versions. It was always the case
   that kernel (and thus kernel clients) upgrades are decoupled
   from Ceph upgrades. Moreover, proliferation of containerization
   bring the variability even to e.g. ``librbd`` -- now user space
   libraries live on the container own.

With this being said, there are few rules limiting the degree
of interoperability between dencoders:

* ``n-2`` for dencoding between daemons,
* ``n-3`` hard requirement for client-involved scenarios,
* ``n-3..``  soft requirements for clinet-involved scenarios. Ideally
  every client should be able to talk any version of daemons.

As the underlying reasons are the same, the rules dencoders
follow are virtually the same as for deprecations of our features
bits. See the ``Notes on deprecation`` in ``src/include/ceph_features.h``.

Frameworks
----------
Currently multiple genres of dencoding helpers co-exist.

* encoding.h (the most proliferated one),
* denc.h (performance optimized, seen mostly in ``BlueStore``),
* the `Message` hierarchy.

Although details vary, the interoperability rules stay the same.

Adding a field to a structure
-----------------------------

You can see examples of this all over the Ceph code, but here's an
example:

.. code-block:: cpp

    class AcmeClass
    {
        int member1;
        std::string member2;

        void encode(bufferlist &bl)
        {
            ENCODE_START(1, 1, bl);
            ::encode(member1, bl);
            ::encode(member2, bl);
            ENCODE_FINISH(bl);
        }

        void decode(bufferlist::iterator &bl)
        {
            DECODE_START(1, bl);
            ::decode(member1, bl);
            ::decode(member2, bl);
            DECODE_FINISH(bl);
        }
    };

The ``ENCODE_START`` macro writes a header that specifies a *version* and
a *compat_version* (both initially 1).  The message version is incremented
whenever a change is made to the encoding.  The compat_version is incremented
only if the change will break existing decoders -- decoders are tolerant
of trailing bytes, so changes that add fields at the end of the structure
do not require incrementing compat_version.

The ``DECODE_START`` macro takes an argument specifying the most recent
message version that the code can handle.  This is compared with the
compat_version encoded in the message, and if the message is too new then
an exception will be thrown.  Because changes to compat_version are rare,
this isn't usually something to worry about when adding fields.

In practice, changes to encoding usually involve simply adding the desired fields
at the end of the ``encode`` and ``decode`` functions, and incrementing
the versions in ``ENCODE_START`` and ``DECODE_START``.  For example, here's how
to add a third field to ``AcmeClass``:

.. code-block:: cpp

    class AcmeClass
    {
        int member1;
        std::string member2;
        std::vector<std::string> member3;

        void encode(bufferlist &bl)
        {
            ENCODE_START(2, 1, bl);
            ::encode(member1, bl);
            ::encode(member2, bl);
            ::encode(member3, bl);
            ENCODE_FINISH(bl);
        }

        void decode(bufferlist::iterator &bl)
        {
            DECODE_START(2, bl);
            ::decode(member1, bl);
            ::decode(member2, bl);
            if (struct_v >= 2) {
                ::decode(member3, bl);
            }
            DECODE_FINISH(bl);
        }
    };

Note that the compat_version did not change because the encoded message
will still be decodable by versions of the code that only understand
version 1 -- they will just ignore the trailing bytes where we encode ``member3``.

In the ``decode`` function, decoding the new field is conditional: this is
because we might still be passed older-versioned messages that do not
have the field.  The ``struct_v`` variable is a local set by the ``DECODE_START``
macro.

# Into the weeeds

The append-extendability of our dencoders is a result of the forward
compatibility that the ``ENCODE_START`` and ``DECODE_FINISH`` macros bring.

They are implementing extendibility facilities. An encoder, when filling
the bufferlist, prepends three fields: version of the current format,
minimal version of a decoder compatible with it and the total size of
all encoded fields.

.. code-block:: cpp

        /**
         * start encoding block
         *
         * @param v current (code) version of the encoding
         * @param compat oldest code version that can decode it
         * @param bl bufferlist to encode to
         *
         */
        #define ENCODE_START(v, compat, bl)                             \
          __u8 struct_v = v;                                            \
          __u8 struct_compat = compat;                                  \
          ceph_le32 struct_len;                                         \
          auto filler = (bl).append_hole(sizeof(struct_v) +             \
            sizeof(struct_compat) + sizeof(struct_len));                \
          const auto starting_bl_len = (bl).length();                   \
          using ::ceph::encode;                                         \
          do {

The ``struct_len`` field allows the decoder to eat all the bytes that were
left undecoded in the user-provided ``decode`` implementation.
Analogically, decoders tracks how much input has been decoded in the
user-provided ``decode`` methods.

.. code-block:: cpp

        #define DECODE_START(bl)		                        \
          unsigned struct_end = 0;					\
          __u32 struct_len;						\
          decode(struct_len, bl);					\
          ...                                                           \
          struct_end = bl.get_off() + struct_len;			\
          }								\
          do {


Decoder uses this information to discard the extra bytes it does not
understand. Advancing bufferlist is critical as dencoders tend to be nested;
just leaving it intact would work only for the very last ``deocde`` call
in a nested structure.

.. code-block:: cpp

        #define DECODE_FINISH(bl)					\
          } while (false);						\
          if (struct_end) {						\
            ...                                                         \
            if (bl.get_off() < struct_end)				\
              bl += struct_end - bl.get_off();				\
          }


This entire, cooperative mechanism allows encoder (its further revisions)
to generate more byte stream (due to e.g. adding a new field at the end)
and not worry that the residue will crash older decoder revisions.
Commit	Line	Data
11fdf7f2 TL	1
	2	Serialization (encode/decode)
	3	=============================
	4
	5	When a structure is sent over the network or written to disk, it is
05a536ef TL	6	encoded into a string of bytes. Usually (but not always -- multiple
	7	serialization facilities coexist in Ceph) serializable structures
	8	have ``encode`` and ``decode`` methods that write and read from
	9	``bufferlist`` objects representing byte strings.
	10
	11	Terminology
	12	-----------
	13	It is best to think not in the domain of daemons and clients but
	14	encoders and decoders. An encoder serializes a structure into a bufferlist
	15	while a decoder does the opposite.
	16
	17	Encoders and decoders can be referred collectively as dencoders.
	18
	19	Dencoders (both encoders and docoders) live within daemons and clients.
	20	For instance, when an RBD client issues an IO operation, it prepares
	21	an instance of the ``MOSDOp`` structure and encodes it into a bufferlist
	22	that is put on the wire.
	23	An OSD reads these bytes and decodes them back into an ``MOSDOp`` instance.
	24	Here encoder was used by the client while decoder by the OSD. However,
	25	these roles can swing -- just imagine handling of the response: OSD encodes
	26	the ``MOSDOpReply`` while RBD clients decode.
	27
	28	Encoder and decoder operate accordingly to a format which is defined
	29	by a programmer by implementing the ``encode`` and ``decode`` methods.
	30
	31	Principles for format change
	32	----------------------------
	33	It is not unusual that the format of serialization changes. This
	34	process requires careful attention from during both development
	35	and review.
	36
	37	The general rule is that a decoder must understand what had been
	38	encoded by an encoder. Most of the problems come from ensuring
	39	that compatibility continues between old decoders and new encoders
	40	as well as new decoders and old decoders. One should assume
	41	that -- if not otherwise derogated -- any mix (old/new) is
	42	possible in a cluster. There are 2 main reasons for that:
	43
	44	1. Upgrades. Although there are recommendations related to the order
	45	of entity types (mons/osds/clients), it is not mandatory and
	46	no assumption should be made about it.
	47	2. Huge variability of client versions. It was always the case
	48	that kernel (and thus kernel clients) upgrades are decoupled
	49	from Ceph upgrades. Moreover, proliferation of containerization
	50	bring the variability even to e.g. ``librbd`` -- now user space
	51	libraries live on the container own.
	52
	53	With this being said, there are few rules limiting the degree
	54	of interoperability between dencoders:
	55
	56	* ``n-2`` for dencoding between daemons,
	57	* ``n-3`` hard requirement for client-involved scenarios,
	58	* ``n-3..`` soft requirements for clinet-involved scenarios. Ideally
	59	every client should be able to talk any version of daemons.
	60
	61	As the underlying reasons are the same, the rules dencoders
	62	follow are virtually the same as for deprecations of our features
	63	bits. See the ``Notes on deprecation`` in ``src/include/ceph_features.h``.
	64
	65	Frameworks
	66	----------
	67	Currently multiple genres of dencoding helpers co-exist.
	68
	69	* encoding.h (the most proliferated one),
70	* denc.h (performance optimized, seen mostly in ``BlueStore``),
71	* the `Message` hierarchy.
72
73	Although details vary, the interoperability rules stay the same.
11fdf7f2 TL	74
	75	Adding a field to a structure
	76	-----------------------------
	77
	78	You can see examples of this all over the Ceph code, but here's an
	79	example:
	80
20effc67	81	.. code-block:: cpp
11fdf7f2 TL	82
	83	class AcmeClass
	84	{
	85	int member1;
	86	std::string member2;
	87
	88	void encode(bufferlist &bl)
	89	{
	90	ENCODE_START(1, 1, bl);
	91	::encode(member1, bl);
	92	::encode(member2, bl);
	93	ENCODE_FINISH(bl);
	94	}
	95
	96	void decode(bufferlist::iterator &bl)
	97	{
	98	DECODE_START(1, bl);
	99	::decode(member1, bl);
	100	::decode(member2, bl);
	101	DECODE_FINISH(bl);
	102	}
	103	};
	104
	105	The ``ENCODE_START`` macro writes a header that specifies a version and
	106	a compat_version (both initially 1). The message version is incremented
	107	whenever a change is made to the encoding. The compat_version is incremented
	108	only if the change will break existing decoders -- decoders are tolerant
	109	of trailing bytes, so changes that add fields at the end of the structure
	110	do not require incrementing compat_version.
	111
	112	The ``DECODE_START`` macro takes an argument specifying the most recent
	113	message version that the code can handle. This is compared with the
	114	compat_version encoded in the message, and if the message is too new then
1e59de90	115	an exception will be thrown. Because changes to compat_version are rare,
11fdf7f2 TL	116	this isn't usually something to worry about when adding fields.
	117
	118	In practice, changes to encoding usually involve simply adding the desired fields
	119	at the end of the ``encode`` and ``decode`` functions, and incrementing
	120	the versions in ``ENCODE_START`` and ``DECODE_START``. For example, here's how
	121	to add a third field to ``AcmeClass``:
	122
20effc67	123	.. code-block:: cpp
11fdf7f2 TL	124
	125	class AcmeClass
	126	{
	127	int member1;
	128	std::string member2;
	129	std::vector<std::string> member3;
	130
	131	void encode(bufferlist &bl)
	132	{
	133	ENCODE_START(2, 1, bl);
	134	::encode(member1, bl);
	135	::encode(member2, bl);
	136	::encode(member3, bl);
	137	ENCODE_FINISH(bl);
	138	}
	139
	140	void decode(bufferlist::iterator &bl)
	141	{
	142	DECODE_START(2, bl);
	143	::decode(member1, bl);
	144	::decode(member2, bl);
	145	if (struct_v >= 2) {
	146	::decode(member3, bl);
	147	}
	148	DECODE_FINISH(bl);
	149	}
	150	};
	151
	152	Note that the compat_version did not change because the encoded message
	153	will still be decodable by versions of the code that only understand
	154	version 1 -- they will just ignore the trailing bytes where we encode ``member3``.
	155
	156	In the ``decode`` function, decoding the new field is conditional: this is
	157	because we might still be passed older-versioned messages that do not
	158	have the field. The ``struct_v`` variable is a local set by the ``DECODE_START``
	159	macro.
	160
05a536ef TL	161	# Into the weeeds
	162
	163	The append-extendability of our dencoders is a result of the forward
	164	compatibility that the ``ENCODE_START`` and ``DECODE_FINISH`` macros bring.
	165
	166	They are implementing extendibility facilities. An encoder, when filling
	167	the bufferlist, prepends three fields: version of the current format,
	168	minimal version of a decoder compatible with it and the total size of
	169	all encoded fields.
	170
	171	.. code-block:: cpp
	172
	173	/**
	174	* start encoding block
	175	*
	176	* @param v current (code) version of the encoding
	177	* @param compat oldest code version that can decode it
	178	* @param bl bufferlist to encode to
	179	*
	180	*/
	181	#define ENCODE_START(v, compat, bl) \
	182	__u8 struct_v = v; \
	183	__u8 struct_compat = compat; \
	184	ceph_le32 struct_len; \
	185	auto filler = (bl).append_hole(sizeof(struct_v) + \
	186	sizeof(struct_compat) + sizeof(struct_len)); \
	187	const auto starting_bl_len = (bl).length(); \
	188	using ::ceph::encode; \
	189	do {
	190
	191	The ``struct_len`` field allows the decoder to eat all the bytes that were
	192	left undecoded in the user-provided ``decode`` implementation.
	193	Analogically, decoders tracks how much input has been decoded in the
	194	user-provided ``decode`` methods.
	195
	196	.. code-block:: cpp
	197
	198	#define DECODE_START(bl) \
	199	unsigned struct_end = 0; \
	200	__u32 struct_len; \
	201	decode(struct_len, bl); \
	202	... \
	203	struct_end = bl.get_off() + struct_len; \
	204	} \
	205	do {
	206
	207
	208	Decoder uses this information to discard the extra bytes it does not
	209	understand. Advancing bufferlist is critical as dencoders tend to be nested;
	210	just leaving it intact would work only for the very last ``deocde`` call
	211	in a nested structure.
	212
	213	.. code-block:: cpp
	214
	215	#define DECODE_FINISH(bl) \
	216	} while (false); \
	217	if (struct_end) { \
	218	... \
	219	if (bl.get_off() < struct_end) \
	220	bl += struct_end - bl.get_off(); \
	221	}
	222
	223
	224	This entire, cooperative mechanism allows encoder (its further revisions)
225	to generate more byte stream (due to e.g. adding a new field at the end)
226	and not worry that the residue will crash older decoder revisions.