ceph/src/s3select/rapidjson/doc/dom.md

   1 # DOM
   2
   3 Document Object Model(DOM) is an in-memory representation of JSON for query and manipulation. The basic usage of DOM is described in [Tutorial](doc/tutorial.md). This section will describe some details and more advanced usages.
   4
   5 [TOC]
   6
   7 # Template {#Template}
   8
   9 In the tutorial,  `Value` and `Document` was used. Similarly to `std::string`, these are actually `typedef` of template classes:
  10
  11 ~~~~~~~~~~cpp
  12 namespace rapidjson {
  13
  14 template <typename Encoding, typename Allocator = MemoryPoolAllocator<> >
  15 class GenericValue {
  16     // ...
  17 };
  18
  19 template <typename Encoding, typename Allocator = MemoryPoolAllocator<> >
  20 class GenericDocument : public GenericValue<Encoding, Allocator> {
  21     // ...
  22 };
  23
  24 typedef GenericValue<UTF8<> > Value;
  25 typedef GenericDocument<UTF8<> > Document;
  26
  27 } // namespace rapidjson
  28 ~~~~~~~~~~
  29
  30 User can customize these template parameters.
  31
  32 ## Encoding {#Encoding}
  33
  34 The `Encoding` parameter specifies the encoding of JSON String value in memory. Possible options are `UTF8`, `UTF16`, `UTF32`. Note that, these 3 types are also template class. `UTF8<>` is `UTF8<char>`, which means using char to store the characters. You may refer to [Encoding](doc/encoding.md) for details.
  35
  36 Suppose a Windows application would query localization strings stored in JSON files. Unicode-enabled functions in Windows use UTF-16 (wide character) encoding. No matter what encoding was used in JSON files, we can store the strings in UTF-16 in memory.
  37
  38 ~~~~~~~~~~cpp
  39 using namespace rapidjson;
  40
  41 typedef GenericDocument<UTF16<> > WDocument;
  42 typedef GenericValue<UTF16<> > WValue;
  43
  44 FILE* fp = fopen("localization.json", "rb"); // non-Windows use "r"
  45
  46 char readBuffer[256];
  47 FileReadStream bis(fp, readBuffer, sizeof(readBuffer));
  48
  49 AutoUTFInputStream<unsigned, FileReadStream> eis(bis);  // wraps bis into eis
  50
  51 WDocument d;
  52 d.ParseStream<0, AutoUTF<unsigned> >(eis);
  53
  54 const WValue locale(L"ja"); // Japanese
  55
  56 MessageBoxW(hWnd, d[locale].GetString(), L"Test", MB_OK);
  57 ~~~~~~~~~~
  58
  59 ## Allocator {#Allocator}
  60
  61 The `Allocator` defines which allocator class is used when allocating/deallocating memory for `Document`/`Value`. `Document` owns, or references to an `Allocator` instance. On the other hand, `Value` does not do so, in order to reduce memory consumption.
  62
  63 The default allocator used in `GenericDocument` is `MemoryPoolAllocator`. This allocator actually allocate memory sequentially, and cannot deallocate one by one. This is very suitable when parsing a JSON into a DOM tree.
  64
  65 Another allocator is `CrtAllocator`, of which CRT is short for C RunTime library. This allocator simply calls the standard `malloc()`/`realloc()`/`free()`. When there is a lot of add and remove operations, this allocator may be preferred. But this allocator is far less efficient than `MemoryPoolAllocator`.
  66
  67 # Parsing {#Parsing}
  68
  69 `Document` provides several functions for parsing. In below, (1) is the fundamental function, while the others are helpers which call (1).
  70
  71 ~~~~~~~~~~cpp
  72 using namespace rapidjson;
  73
  74 // (1) Fundamental
  75 template <unsigned parseFlags, typename SourceEncoding, typename InputStream>
  76 GenericDocument& GenericDocument::ParseStream(InputStream& is);
  77
  78 // (2) Using the same Encoding for stream
  79 template <unsigned parseFlags, typename InputStream>
  80 GenericDocument& GenericDocument::ParseStream(InputStream& is);
  81
  82 // (3) Using default parse flags
  83 template <typename InputStream>
  84 GenericDocument& GenericDocument::ParseStream(InputStream& is);
  85
  86 // (4) In situ parsing
  87 template <unsigned parseFlags>
  88 GenericDocument& GenericDocument::ParseInsitu(Ch* str);
  89
  90 // (5) In situ parsing, using default parse flags
  91 GenericDocument& GenericDocument::ParseInsitu(Ch* str);
  92
  93 // (6) Normal parsing of a string
  94 template <unsigned parseFlags, typename SourceEncoding>
  95 GenericDocument& GenericDocument::Parse(const Ch* str);
  96
  97 // (7) Normal parsing of a string, using same Encoding of Document
  98 template <unsigned parseFlags>
  99 GenericDocument& GenericDocument::Parse(const Ch* str);
 100
 101 // (8) Normal parsing of a string, using default parse flags
 102 GenericDocument& GenericDocument::Parse(const Ch* str);
 103 ~~~~~~~~~~
 104
 105 The examples of [tutorial](doc/tutorial.md) uses (8) for normal parsing of string. The examples of [stream](doc/stream.md) uses the first three. *In situ* parsing will be described soon.
 106
 107 The `parseFlags` are combination of the following bit-flags:
 108
 109 Parse flags                   | Meaning
 110 ------------------------------|-----------------------------------
 111 `kParseNoFlags`               | No flag is set.
 112 `kParseDefaultFlags`          | Default parse flags. It is equal to macro `RAPIDJSON_PARSE_DEFAULT_FLAGS`, which is defined as `kParseNoFlags`.
 113 `kParseInsituFlag`            | In-situ(destructive) parsing.
 114 `kParseValidateEncodingFlag`  | Validate encoding of JSON strings.
 115 `kParseIterativeFlag`         | Iterative(constant complexity in terms of function call stack size) parsing.
 116 `kParseStopWhenDoneFlag`      | After parsing a complete JSON root from stream, stop further processing the rest of stream. When this flag is used, parser will not generate `kParseErrorDocumentRootNotSingular` error. Using this flag for parsing multiple JSONs in the same stream.
 117 `kParseFullPrecisionFlag`     | Parse number in full precision (slower). If this flag is not set, the normal precision (faster) is used. Normal precision has maximum 3 [ULP](http://en.wikipedia.org/wiki/Unit_in_the_last_place) error.
 118 `kParseCommentsFlag`          | Allow one-line `// ...` and multi-line `/* ... */` comments (relaxed JSON syntax).
 119 `kParseNumbersAsStringsFlag`  | Parse numerical type values as strings.
 120 `kParseTrailingCommasFlag`    | Allow trailing commas at the end of objects and arrays (relaxed JSON syntax).
 121 `kParseNanAndInfFlag`         | Allow parsing `NaN`, `Inf`, `Infinity`, `-Inf` and `-Infinity` as `double` values (relaxed JSON syntax).
 122 `kParseEscapedApostropheFlag` | Allow escaped apostrophe `\'` in strings (relaxed JSON syntax).
 123
 124 By using a non-type template parameter, instead of a function parameter, C++ compiler can generate code which is optimized for specified combinations, improving speed, and reducing code size (if only using a single specialization). The downside is the flags needed to be determined in compile-time.
 125
 126 The `SourceEncoding` parameter defines what encoding is in the stream. This can be differed to the `Encoding` of the `Document`. See [Transcoding and Validation](#TranscodingAndValidation) section for details.
 127
 128 And the `InputStream` is type of input stream.
 129
 130 ## Parse Error {#ParseError}
 131
 132 When the parse processing succeeded, the `Document` contains the parse results. When there is an error, the original DOM is *unchanged*. And the error state of parsing can be obtained by `bool HasParseError()`,  `ParseErrorCode GetParseError()` and `size_t GetErrorOffset()`.
 133
 134 Parse Error Code                            | Description
 135 --------------------------------------------|---------------------------------------------------
 136 `kParseErrorNone`                           | No error.
 137 `kParseErrorDocumentEmpty`                  | The document is empty.
 138 `kParseErrorDocumentRootNotSingular`        | The document root must not follow by other values.
 139 `kParseErrorValueInvalid`                   | Invalid value.
 140 `kParseErrorObjectMissName`                 | Missing a name for object member.
 141 `kParseErrorObjectMissColon`                | Missing a colon after a name of object member.
 142 `kParseErrorObjectMissCommaOrCurlyBracket`  | Missing a comma or `}` after an object member.
 143 `kParseErrorArrayMissCommaOrSquareBracket`  | Missing a comma or `]` after an array element.
 144 `kParseErrorStringUnicodeEscapeInvalidHex`  | Incorrect hex digit after `\\u` escape in string.
 145 `kParseErrorStringUnicodeSurrogateInvalid`  | The surrogate pair in string is invalid.
 146 `kParseErrorStringEscapeInvalid`            | Invalid escape character in string.
 147 `kParseErrorStringMissQuotationMark`        | Missing a closing quotation mark in string.
 148 `kParseErrorStringInvalidEncoding`          | Invalid encoding in string.
 149 `kParseErrorNumberTooBig`                   | Number too big to be stored in `double`.
 150 `kParseErrorNumberMissFraction`             | Miss fraction part in number.
 151 `kParseErrorNumberMissExponent`             | Miss exponent in number.
 152
 153 The offset of error is defined as the character number from beginning of stream. Currently RapidJSON does not keep track of line number.
 154
 155 To get an error message, RapidJSON provided a English messages in `rapidjson/error/en.h`. User can customize it for other locales, or use a custom localization system.
 156
 157 Here shows an example of parse error handling.
 158
 159 ~~~~~~~~~~cpp
 160 #include "rapidjson/document.h"
 161 #include "rapidjson/error/en.h"
 162
 163 // ...
 164 Document d;
 165 if (d.Parse(json).HasParseError()) {
 166     fprintf(stderr, "\nError(offset %u): %s\n",
 167         (unsigned)d.GetErrorOffset(),
 168         GetParseError_En(d.GetParseError()));
 169     // ...
 170 }
 171 ~~~~~~~~~~
 172
 173 ## In Situ Parsing {#InSituParsing}
 174
 175 From [Wikipedia](http://en.wikipedia.org/wiki/In_situ):
 176
 177 > *In situ* ... is a Latin phrase that translates literally to "on site" or "in position". It means "locally", "on site", "on the premises" or "in place" to describe an event where it takes place, and is used in many different contexts.
 178 > ...
 179 > (In computer science) An algorithm is said to be an in situ algorithm, or in-place algorithm, if the extra amount of memory required to execute the algorithm is O(1), that is, does not exceed a constant no matter how large the input. For example, heapsort is an in situ sorting algorithm.
 180
 181 In normal parsing process, a large overhead is to decode JSON strings and copy them to other buffers. *In situ* parsing decodes those JSON string at the place where it is stored. It is possible in JSON because the length of decoded string is always shorter than or equal to the one in JSON. In this context, decoding a JSON string means to process the escapes, such as `"\n"`, `"\u1234"`, etc., and add a null terminator (`'\0'`)at the end of string.
 182
 183 The following diagrams compare normal and *in situ* parsing. The JSON string values contain pointers to the decoded string.
 184
 185 ![normal parsing](diagram/normalparsing.png)
 186
 187 In normal parsing, the decoded string are copied to freshly allocated buffers. `"\\n"` (2 characters) is decoded as `"\n"` (1 character). `"\\u0073"` (6 characters) is decoded as `"s"` (1 character).
 188
 189 ![instiu parsing](diagram/insituparsing.png)
 190
 191 *In situ* parsing just modified the original JSON. Updated characters are highlighted in the diagram. If the JSON string does not contain escape character, such as `"msg"`, the parsing process merely replace the closing double quotation mark with a null character.
 192
 193 Since *in situ* parsing modify the input, the parsing API needs `char*` instead of `const char*`.
 194
 195 ~~~~~~~~~~cpp
 196 // Read whole file into a buffer
 197 FILE* fp = fopen("test.json", "r");
 198 fseek(fp, 0, SEEK_END);
 199 size_t filesize = (size_t)ftell(fp);
 200 fseek(fp, 0, SEEK_SET);
 201 char* buffer = (char*)malloc(filesize + 1);
 202 size_t readLength = fread(buffer, 1, filesize, fp);
 203 buffer[readLength] = '\0';
 204 fclose(fp);
 205
 206 // In situ parsing the buffer into d, buffer will also be modified
 207 Document d;
 208 d.ParseInsitu(buffer);
 209
 210 // Query/manipulate the DOM here...
 211
 212 free(buffer);
 213 // Note: At this point, d may have dangling pointers pointed to the deallocated buffer.
 214 ~~~~~~~~~~
 215
 216 The JSON strings are marked as const-string. But they may not be really "constant". The life cycle of it depends on the JSON buffer.
 217
 218 In situ parsing minimizes allocation overheads and memory copying. Generally this improves cache coherence, which is an important factor of performance in modern computer.
 219
 220 There are some limitations of *in situ* parsing:
 221
 222 1. The whole JSON is in memory.
 223 2. The source encoding in stream and target encoding in document must be the same.
 224 3. The buffer need to be retained until the document is no longer used.
 225 4. If the DOM need to be used for long period after parsing, and there are few JSON strings in the DOM, retaining the buffer may be a memory waste.
 226
 227 *In situ* parsing is mostly suitable for short-term JSON that only need to be processed once, and then be released from memory. In practice, these situation is very common, for example, deserializing JSON to C++ objects, processing web requests represented in JSON, etc.
 228
 229 ## Transcoding and Validation {#TranscodingAndValidation}
 230
 231 RapidJSON supports conversion between Unicode formats (officially termed UCS Transformation Format) internally. During DOM parsing, the source encoding of the stream can be different from the encoding of the DOM. For example, the source stream contains a UTF-8 JSON, while the DOM is using UTF-16 encoding. There is an example code in [EncodedInputStream](doc/stream.md).
 232
 233 When writing a JSON from DOM to output stream, transcoding can also be used. An example is in [EncodedOutputStream](doc/stream.md).
 234
 235 During transcoding, the source string is decoded to into Unicode code points, and then the code points are encoded in the target format. During decoding, it will validate the byte sequence in the source string. If it is not a valid sequence, the parser will be stopped with `kParseErrorStringInvalidEncoding` error.
 236
 237 When the source encoding of stream is the same as encoding of DOM, by default, the parser will *not* validate the sequence. User may use `kParseValidateEncodingFlag` to force validation.
 238
 239 # Techniques {#Techniques}
 240
 241 Some techniques about using DOM API is discussed here.
 242
 243 ## DOM as SAX Event Publisher
 244
 245 In RapidJSON, stringifying a DOM with `Writer` may be look a little bit weird.
 246
 247 ~~~~~~~~~~cpp
 248 // ...
 249 Writer<StringBuffer> writer(buffer);
 250 d.Accept(writer);
 251 ~~~~~~~~~~
 252
 253 Actually, `Value::Accept()` is responsible for publishing SAX events about the value to the handler. With this design, `Value` and `Writer` are decoupled. `Value` can generate SAX events, and `Writer` can handle those events.
 254
 255 User may create custom handlers for transforming the DOM into other formats. For example, a handler which converts the DOM into XML.
 256
 257 For more about SAX events and handler, please refer to [SAX](doc/sax.md).
 258
 259 ## User Buffer {#UserBuffer}
 260
 261 Some applications may try to avoid memory allocations whenever possible.
 262
 263 `MemoryPoolAllocator` can support this by letting user to provide a buffer. The buffer can be on the program stack, or a "scratch buffer" which is statically allocated (a static/global array) for storing temporary data.
 264
 265 `MemoryPoolAllocator` will use the user buffer to satisfy allocations. When the user buffer is used up, it will allocate a chunk of memory from the base allocator (by default the `CrtAllocator`).
 266
 267 Here is an example of using stack memory. The first allocator is for storing values, while the second allocator is for storing temporary data during parsing.
 268
 269 ~~~~~~~~~~cpp
 270 typedef GenericDocument<UTF8<>, MemoryPoolAllocator<>, MemoryPoolAllocator<>> DocumentType;
 271 char valueBuffer[4096];
 272 char parseBuffer[1024];
 273 MemoryPoolAllocator<> valueAllocator(valueBuffer, sizeof(valueBuffer));
 274 MemoryPoolAllocator<> parseAllocator(parseBuffer, sizeof(parseBuffer));
 275 DocumentType d(&valueAllocator, sizeof(parseBuffer), &parseAllocator);
 276 d.Parse(json);
 277 ~~~~~~~~~~
 278
 279 If the total size of allocation is less than 4096+1024 bytes during parsing, this code does not invoke any heap allocation (via `new` or `malloc()`) at all.
 280
 281 User can query the current memory consumption in bytes via `MemoryPoolAllocator::Size()`. And then user can determine a suitable size of user buffer.