ceph/src/jaegertracing/thrift/doc/specs/thrift-compact-protocol.md

   1 Thrift Compact protocol encoding
   2 ================================
   3
   4 <!--
   5 --------------------------------------------------------------------
   6
   7 Licensed to the Apache Software Foundation (ASF) under one
   8 or more contributor license agreements. See the NOTICE file
   9 distributed with this work for additional information
  10 regarding copyright ownership. The ASF licenses this file
  11 to you under the Apache License, Version 2.0 (the
  12 "License"); you may not use this file except in compliance
  13 with the License. You may obtain a copy of the License at
  14
  15   http://www.apache.org/licenses/LICENSE-2.0
  16
  17 Unless required by applicable law or agreed to in writing,
  18 software distributed under the License is distributed on an
  19 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  20 KIND, either express or implied. See the License for the
  21 specific language governing permissions and limitations
  22 under the License.
  23
  24 --------------------------------------------------------------------
  25 -->
  26
  27 This documents describes the wire encoding for RPC using the Thrift *compact protocol*.
  28
  29 The information here is _mostly_ based on the Java implementation in the Apache thrift library (version 0.9.1) and
  30 [THRIFT-110 A more compact format](https://issues.apache.org/jira/browse/THRIFT-110). Other implementation however,
  31 should behave the same.
  32
  33 For background on Thrift see the [Thrift whitepaper (pdf)](https://thrift.apache.org/static/files/thrift-20070401.pdf).
  34
  35 # Contents
  36
  37 * Compact protocol
  38   * Base types
  39   * Message
  40   * Struct
  41   * List and Set
  42   * Map
  43 * BNF notation used in this document
  44
  45 # Compact protocol
  46
  47 ## Base types
  48
  49 ### Integer encoding
  50
  51 The _compact protocol_ uses multiple encodings for ints: the _zigzag int_, and the _var int_.
  52
  53 Values of type `int32` and `int64` are first transformed to a *zigzag int*. A zigzag int folds positive and negative
  54 numbers into the positive number space. When we read 0, 1, 2, 3, 4 or 5 from the wire, this is translated to 0, -1, 1,
  55 -2 or 2 respectively. Here are the (Scala) formulas to convert from int32/int64 to a zigzag int and back:
  56
  57 ```scala
  58 def intToZigZag(n: Int): Int = (n << 1) ^ (n >> 31)
  59 def zigzagToInt(n: Int): Int = (n >>> 1) ^ - (n & 1)
  60 def longToZigZag(n: Long): Long = (n << 1) ^ (n >> 63)
  61 def zigzagToLong(n: Long): Long = (n >>> 1) ^ - (n & 1)
  62 ```
  63
  64 The zigzag int is then encoded as a *var int*. Var ints take 1 to 5 bytes (int32) or 1 to 10 bytes (int64). The most
  65 significant bit of each byte indicates if more bytes follow. The concatenation of the least significant 7 bits from each
  66 byte form the number, where the first byte has the most significant bits (so they are in big endian or network order).
  67
  68 Var ints are sometimes used directly inside the compact protocol to represent positive numbers.
  69
  70 To encode an `int16` as zigzag int, it is first converted to an `int32` and then encoded as such. The type `int8` simply
  71 uses a single byte as in the binary protocol.
  72
  73 ### Enum encoding
  74
  75 The generated code encodes `Enum`s by taking the ordinal value and then encoding that as an int32.
  76
  77 ### Binary encoding
  78
  79 Binary is sent as follows:
  80
  81 ```
  82 Binary protocol, binary data, 1+ bytes:
  83 +--------+...+--------+--------+...+--------+
  84 | byte length         | bytes               |
  85 +--------+...+--------+--------+...+--------+
  86 ```
  87
  88 Where:
  89
  90 * `byte length` is the length of the byte array, using var int encoding (must be >= 0).
  91 * `bytes` are the bytes of the byte array.
  92
  93 ### String encoding
  94
  95 *String*s are first encoded to UTF-8, and then send as binary.
  96
  97 ### Double encoding
  98
  99 Values of type `double` are first converted to an int64 according to the IEEE 754 floating-point "double format" bit
 100 layout. Most run-times provide a library to make this conversion. Both the binary protocol as the compact protocol then
 101 encode the int64 in 8 bytes in big endian order.
 102
 103 ### Boolean encoding
 104
 105 Booleans are encoded differently depending on whether it is a field value (in a struct) or an element value (in a set,
 106 list or map). Field values are encoded directly in the field header. Element values of type `bool` are sent as an int8;
 107 true as `1` and false as `0`.
 108
 109 ## Message
 110
 111 A `Message` on the wire looks as follows:
 112
 113 ```
 114 Compact protocol Message (4+ bytes):
 115 +--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
 116 |pppppppp|mmmvvvvv| seq id              | name length         | name                |
 117 +--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
 118 ```
 119
 120 Where:
 121
 122 * `pppppppp` is the protocol id, fixed to `1000 0010`, 0x82.
 123 * `mmm` is the message type, an unsigned 3 bit integer.
 124 * `vvvvv` is the version, an unsigned 5 bit integer, fixed to `00001`.
 125 * `seq id` is the sequence id, a signed 32 bit integer encoded as a var int.
 126 * `name length` is the byte length of the name field, a signed 32 bit integer encoded as a var int (must be >= 0).
 127 * `name` is the method name to invoke, a UTF-8 encoded string.
 128
 129 Message types are encoded with the following values:
 130
 131 * _Call_: 1
 132 * _Reply_: 2
 133 * _Exception_: 3
 134 * _Oneway_: 4
 135
 136 ### Struct
 137
 138 A *Struct* is a sequence of zero or more fields, followed by a stop field. Each field starts with a field header and
 139 is followed by the encoded field value. The encoding can be summarized by the following BNF:
 140
 141 ```
 142 struct        ::= ( field-header field-value )* stop-field
 143 field-header  ::= field-type field-id
 144 ```
 145
 146 Because each field header contains the field-id (as defined by the Thrift IDL file), the fields can be encoded in any
 147 order. Thrift's type system is not extensible; you can only encode the primitive types and structs. Therefore is also
 148 possible to handle unknown fields while decoding; these are simply ignored. While decoding the field type can be used to
 149 determine how to decode the field value.
 150
 151 Note that the field name is not encoded so field renames in the IDL do not affect forward and backward compatibility.
 152
 153 The default Java implementation (Apache Thrift 0.9.1) has undefined behavior when it tries to decode a field that has
 154 another field-type than what is expected. Theoretically this could be detected at the cost of some additional checking.
 155 Other implementation may perform this check and then either ignore the field, or return a protocol exception.
 156
 157 A *Union* is encoded exactly the same as a struct with the additional restriction that at most 1 field may be encoded.
 158
 159 An *Exception* is encoded exactly the same as a struct.
 160
 161 ### Struct encoding
 162
 163 ```
 164 Compact protocol field header (short form) and field value:
 165 +--------+--------+...+--------+
 166 |ddddtttt| field value         |
 167 +--------+--------+...+--------+
 168
 169 Compact protocol field header (1 to 3 bytes, long form) and field value:
 170 +--------+--------+...+--------+--------+...+--------+
 171 |0000tttt| field id            | field value         |
 172 +--------+--------+...+--------+--------+...+--------+
 173
 174 Compact protocol stop field:
 175 +--------+
 176 |00000000|
 177 +--------+
 178 ```
 179
 180 Where:
 181
 182 * `dddd` is the field id delta, an unsigned 4 bits integer, strictly positive.
 183 * `tttt` is field-type id, an unsigned 4 bit integer.
 184 * `field id` the field id, a signed 16 bit integer encoded as zigzag int.
 185 * `field-value` the encoded field value.
 186
 187 The field id delta can be computed by `current-field-id - previous-field-id`, or just `current-field-id` if this is the
 188 first of the struct. The short form should be used when the field id delta is in the range 1 - 15 (inclusive).
 189
 190 The following field-types can be encoded:
 191
 192 * `BOOLEAN_TRUE`, encoded as `1`
 193 * `BOOLEAN_FALSE`, encoded as `2`
 194 * `BYTE`, encoded as `3`
 195 * `I16`, encoded as `4`
 196 * `I32`, encoded as `5`
 197 * `I64`, encoded as `6`
 198 * `DOUBLE`, encoded as `7`
 199 * `BINARY`, used for binary and string fields, encoded as `8`
 200 * `LIST`, encoded as `9`
 201 * `SET`, encoded as `10`
 202 * `MAP`, encoded as `11`
 203 * `STRUCT`, used for both structs and union fields, encoded as `12`
 204
 205 Note that because there are 2 specific field types for the boolean values, the encoding of a boolean field value has no
 206 length (0 bytes).
 207
 208 ## List and Set
 209
 210 List and sets are encoded the same: a header indicating the size and the element-type of the elements, followed by the
 211 encoded elements.
 212
 213 ```
 214 Compact protocol list header (1 byte, short form) and elements:
 215 +--------+--------+...+--------+
 216 |sssstttt| elements            |
 217 +--------+--------+...+--------+
 218
 219 Compact protocol list header (2+ bytes, long form) and elements:
 220 +--------+--------+...+--------+--------+...+--------+
 221 |1111tttt| size                | elements            |
 222 +--------+--------+...+--------+--------+...+--------+
 223 ```
 224
 225 Where:
 226
 227 * `ssss` is the size, 4 bit unsigned int, values `0` - `14`
 228 * `tttt` is the element-type, a 4 bit unsigned int
 229 * `size` is the size, a var int (int32), positive values `15` or higher
 230 * `elements` are the encoded elements
 231
 232 The short form should be used when the length is in the range 0 - 14 (inclusive).
 233
 234 The following element-types are used (note that these are _different_ from the field-types):
 235
 236 * `BOOL`, encoded as `2`
 237 * `BYTE`, encoded as `3`
 238 * `DOUBLE`, encoded as `4`
 239 * `I16`, encoded as `6`
 240 * `I32`, encoded as `8`
 241 * `I64`, encoded as `10`
 242 * `STRING`, used for binary and string fields, encoded as `11`
 243 * `STRUCT`, used for structs and union fields, encoded as `12`
 244 * `MAP`, encoded as `13`
 245 * `SET`, encoded as `14`
 246 * `LIST`, encoded as `15`
 247
 248
 249 The maximum list/set size is configurable. By default there is no limit (meaning the limit is the maximum int32 value:
 250 2147483647).
 251
 252 ## Map
 253
 254 Maps are encoded with a header indicating the size, the type of the keys and the element-type of the elements, followed
 255 by the encoded elements. The encoding follows this BNF:
 256
 257 ```
 258 map           ::= empty-map | non-empty-map
 259 empty-map     ::= `0`
 260 non-empty-map ::= size key-element-type value-element-type (key value)+
 261 ```
 262
 263 ```
 264 Compact protocol map header (1 byte, empty map):
 265 +--------+
 266 |00000000|
 267 +--------+
 268
 269 Compact protocol map header (2+ bytes, non empty map) and key value pairs:
 270 +--------+...+--------+--------+--------+...+--------+
 271 | size                |kkkkvvvv| key value pairs     |
 272 +--------+...+--------+--------+--------+...+--------+
 273 ```
 274
 275 Where:
 276
 277 * `size` is the size, a var int (int32), strictly positive values
 278 * `kkkk` is the key element-type, a 4 bit unsigned int
 279 * `vvvv` is the value element-type, a 4 bit unsigned int
 280 * `key value pairs` are the encoded keys and values
 281
 282 The element-types are the same as for lists. The full list is included in the 'List and set' section.
 283
 284 The maximum map size is configurable. By default there is no limit (meaning the limit is the maximum int32 value:
 285 2147483647).
 286
 287 # BNF notation used in this document
 288
 289 The following BNF notation is used:
 290
 291 * a plus `+` appended to an item represents repetition; the item is repeated 1 or more times
 292 * a star `*` appended to an item represents optional repetition; the item is repeated 0 or more times
 293 * a pipe `|` between items represents choice, the first matching item is selected
 294 * parenthesis `(` and `)` are used for grouping multiple items