]> git.proxmox.com Git - ceph.git/blame - ceph/src/arrow/docs/source/format/Integration.rst
import quincy 17.2.0
[ceph.git] / ceph / src / arrow / docs / source / format / Integration.rst
CommitLineData
1d09f67e
TL
1.. Licensed to the Apache Software Foundation (ASF) under one
2.. or more contributor license agreements. See the NOTICE file
3.. distributed with this work for additional information
4.. regarding copyright ownership. The ASF licenses this file
5.. to you under the Apache License, Version 2.0 (the
6.. "License"); you may not use this file except in compliance
7.. with the License. You may obtain a copy of the License at
8
9.. http://www.apache.org/licenses/LICENSE-2.0
10
11.. Unless required by applicable law or agreed to in writing,
12.. software distributed under the License is distributed on an
13.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14.. KIND, either express or implied. See the License for the
15.. specific language governing permissions and limitations
16.. under the License.
17
18.. _format_integration_testing:
19
20Integration Testing
21===================
22
23Our strategy for integration testing between Arrow implementations is:
24
25* Test datasets are specified in a custom human-readable, JSON-based format
26 designed exclusively for Arrow's integration tests
27* Each implementation provides a testing executable capable of converting
28 between the JSON and the binary Arrow file representation
29* The test executable is also capable of validating the contents of a binary
30 file against a corresponding JSON file
31
32Running integration tests
33-------------------------
34
35The integration test data generator and runner are implemented inside
36the :ref:`Archery <archery>` utility.
37
38The integration tests are run using the ``archery integration`` command.
39
40.. code-block:: shell
41
42 archery integration --help
43
44In order to run integration tests, you'll first need to build each component
45you want to include. See the respective developer docs for C++, Java, etc.
46for instructions on building those.
47
48Some languages may require additional build options to enable integration
49testing. For C++, for example, you need to add ``-DARROW_BUILD_INTEGRATION=ON``
50to your cmake command.
51
52Depending on which components you have built, you can enable and add them to
53the archery test run. For example, if you only have the C++ project built, run:
54
55.. code-block:: shell
56
57 archery integration --with-cpp=1
58
59
60For Java, it may look like:
61
62.. code-block:: shell
63
64 VERSION=0.11.0-SNAPSHOT
65 export ARROW_JAVA_INTEGRATION_JAR=$JAVA_DIR/tools/target/arrow-tools-$VERSION-jar-with-dependencies.jar
66 archery integration --with-cpp=1 --with-java=1
67
68To run all tests, including Flight integration tests, do:
69
70.. code-block:: shell
71
72 archery integration --with-all --run-flight
73
74Note that we run these tests in continuous integration, and the CI job uses
75docker-compose. You may also run the docker-compose job locally, or at least
76refer to it if you have questions about how to build other languages or enable
77certain tests.
78
79See :ref:`docker-builds` for more information about the project's
80``docker-compose`` configuration.
81
82JSON test data format
83---------------------
84
85A JSON representation of Arrow columnar data is provided for
86cross-language integration testing purposes.
87This representation is `not canonical <https://lists.apache.org/thread.html/6947fb7666a0f9cc27d9677d2dad0fb5990f9063b7cf3d80af5e270f%40%3Cdev.arrow.apache.org%3E>`_
88but it provides a human-readable way of verifying language implementations.
89
90See `here <https://github.com/apache/arrow/tree/master/docs/source/format/integration_json_examples>`_
91for some examples of this JSON data.
92
93.. can we check in more examples, e.g. from the generated_*.json test files?
94
95The high level structure of a JSON integration test files is as follows:
96
97**Data file** ::
98
99 {
100 "schema": /*Schema*/,
101 "batches": [ /*RecordBatch*/ ],
102 "dictionaries": [ /*DictionaryBatch*/ ],
103 }
104
105All files contain ``schema`` and ``batches``, while ``dictionaries`` is only
106present if there are dictionary type fields in the schema.
107
108**Schema** ::
109
110 {
111 "fields" : [
112 /* Field */
113 ],
114 "metadata" : /* Metadata */
115 }
116
117**Field** ::
118
119 {
120 "name" : "name_of_the_field",
121 "nullable" : /* boolean */,
122 "type" : /* Type */,
123 "children" : [ /* Field */ ],
124 "dictionary": {
125 "id": /* integer */,
126 "indexType": /* Type */,
127 "isOrdered": /* boolean */
128 },
129 "metadata" : /* Metadata */
130 }
131
132The ``dictionary`` attribute is present if and only if the ``Field`` corresponds to a
133dictionary type, and its ``id`` maps onto a column in the ``DictionaryBatch``. In this
134case the ``type`` attribute describes the value type of the dictionary.
135
136For primitive types, ``children`` is an empty array.
137
138**Metadata** ::
139
140 null |
141 [ {
142 "key": /* string */,
143 "value": /* string */
144 } ]
145
146A key-value mapping of custom metadata. It may be omitted or null, in which case it is
147considered equivalent to ``[]`` (no metadata). Duplicated keys are not forbidden here.
148
149**Type**: ::
150
151 {
152 "name" : "null|struct|list|largelist|fixedsizelist|union|int|floatingpoint|utf8|largeutf8|binary|largebinary|fixedsizebinary|bool|decimal|date|time|timestamp|interval|duration|map"
153 }
154
155A ``Type`` will have other fields as defined in
156`Schema.fbs <https://github.com/apache/arrow/tree/master/format/Schema.fbs>`_
157depending on its name.
158
159Int: ::
160
161 {
162 "name" : "int",
163 "bitWidth" : /* integer */,
164 "isSigned" : /* boolean */
165 }
166
167FloatingPoint: ::
168
169 {
170 "name" : "floatingpoint",
171 "precision" : "HALF|SINGLE|DOUBLE"
172 }
173
174FixedSizeBinary: ::
175
176 {
177 "name" : "fixedsizebinary",
178 "byteWidth" : /* byte width */
179 }
180
181Decimal: ::
182
183 {
184 "name" : "decimal",
185 "precision" : /* integer */,
186 "scale" : /* integer */
187 }
188
189Timestamp: ::
190
191 {
192 "name" : "timestamp",
193 "unit" : "$TIME_UNIT",
194 "timezone": "$timezone"
195 }
196
197``$TIME_UNIT`` is one of ``"SECOND|MILLISECOND|MICROSECOND|NANOSECOND"``
198
199"timezone" is an optional string.
200
201Duration: ::
202
203 {
204 "name" : "duration",
205 "unit" : "$TIME_UNIT"
206 }
207
208Date: ::
209
210 {
211 "name" : "date",
212 "unit" : "DAY|MILLISECOND"
213 }
214
215Time: ::
216
217 {
218 "name" : "time",
219 "unit" : "$TIME_UNIT",
220 "bitWidth": /* integer: 32 or 64 */
221 }
222
223Interval: ::
224
225 {
226 "name" : "interval",
227 "unit" : "YEAR_MONTH|DAY_TIME"
228 }
229
230Union: ::
231
232 {
233 "name" : "union",
234 "mode" : "SPARSE|DENSE",
235 "typeIds" : [ /* integer */ ]
236 }
237
238The ``typeIds`` field in ``Union`` are the codes used to denote which member of
239the union is active in each array slot. Note that in general these discriminants are not identical
240to the index of the corresponding child array.
241
242List: ::
243
244 {
245 "name": "list"
246 }
247
248The type that the list is a "list of" will be included in the ``Field``'s
249"children" member, as a single ``Field`` there. For example, for a list of
250``int32``, ::
251
252 {
253 "name": "list_nullable",
254 "type": {
255 "name": "list"
256 },
257 "nullable": true,
258 "children": [
259 {
260 "name": "item",
261 "type": {
262 "name": "int",
263 "isSigned": true,
264 "bitWidth": 32
265 },
266 "nullable": true,
267 "children": []
268 }
269 ]
270 }
271
272FixedSizeList: ::
273
274 {
275 "name": "fixedsizelist",
276 "listSize": /* integer */
277 }
278
279This type likewise comes with a length-1 "children" array.
280
281Struct: ::
282
283 {
284 "name": "struct"
285 }
286
287The ``Field``'s "children" contains an array of ``Fields`` with meaningful
288names and types.
289
290Map: ::
291
292 {
293 "name": "map",
294 "keysSorted": /* boolean */
295 }
296
297The ``Field``'s "children" contains a single ``struct`` field, which itself
298contains 2 children, named "key" and "value".
299
300Null: ::
301
302 {
303 "name": "null"
304 }
305
306Extension types are, as in the IPC format, represented as their underlying
307storage type plus some dedicated field metadata to reconstruct the extension
308type. For example, assuming a "uuid" extension type backed by a
309FixedSizeBinary(16) storage, here is how a "uuid" field would be represented::
310
311 {
312 "name" : "name_of_the_field",
313 "nullable" : /* boolean */,
314 "type" : {
315 "name" : "fixedsizebinary",
316 "byteWidth" : 16
317 },
318 "children" : [],
319 "metadata" : [
320 {"key": "ARROW:extension:name", "value": "uuid"},
321 {"key": "ARROW:extension:metadata", "value": "uuid-serialized"}
322 ]
323 }
324
325**RecordBatch**::
326
327 {
328 "count": /* integer number of rows */,
329 "columns": [ /* FieldData */ ]
330 }
331
332**DictionaryBatch**::
333
334 {
335 "id": /* integer */,
336 "data": [ /* RecordBatch */ ]
337 }
338
339**FieldData**::
340
341 {
342 "name": "field_name",
343 "count" "field_length",
344 "$BUFFER_TYPE": /* BufferData */
345 ...
346 "$BUFFER_TYPE": /* BufferData */
347 "children": [ /* FieldData */ ]
348 }
349
350The "name" member of a ``Field`` in the ``Schema`` corresponds to the "name"
351of a ``FieldData`` contained in the "columns" of a ``RecordBatch``.
352For nested types (list, struct, etc.), ``Field``'s "children" each have a
353"name" that corresponds to the "name" of a ``FieldData`` inside the
354"children" of that ``FieldData``.
355For ``FieldData`` inside of a ``DictionaryBatch``, the "name" field does not
356correspond to anything.
357
358Here ``$BUFFER_TYPE`` is one of ``VALIDITY``, ``OFFSET`` (for
359variable-length types, such as strings and lists), ``TYPE_ID`` (for unions),
360or ``DATA``.
361
362``BufferData`` is encoded based on the type of buffer:
363
364* ``VALIDITY``: a JSON array of 1 (valid) and 0 (null). Data for non-nullable
365 ``Field`` still has a ``VALIDITY`` array, even though all values are 1.
366* ``OFFSET``: a JSON array of integers for 32-bit offsets or
367 string-formatted integers for 64-bit offsets
368* ``TYPE_ID``: a JSON array of integers
369* ``DATA``: a JSON array of encoded values
370
371The value encoding for ``DATA`` is different depending on the logical
372type:
373
374* For boolean type: an array of 1 (true) and 0 (false).
375* For integer-based types (including timestamps): an array of JSON numbers.
376* For 64-bit integers: an array of integers formatted as JSON strings,
377 so as to avoid loss of precision.
378* For floating point types: an array of JSON numbers. Values are limited
379 to 3 decimal places to avoid loss of precision.
380* For binary types, an array of uppercase hex-encoded strings, so as
381 to represent arbitrary binary data.
382* For UTF-8 string types, an array of JSON strings.
383
384For "list" and "largelist" types, ``BufferData`` has ``VALIDITY`` and
385``OFFSET``, and the rest of the data is inside "children". These child
386``FieldData`` contain all of the same attributes as non-child data, so in
387the example of a list of ``int32``, the child data has ``VALIDITY`` and
388``DATA``.
389
390For "fixedsizelist", there is no ``OFFSET`` member because the offsets are
391implied by the field's "listSize".
392
393Note that the "count" for these child data may not match the parent "count".
394For example, if a ``RecordBatch`` has 7 rows and contains a ``FixedSizeList``
395of ``listSize`` 4, then the data inside the "children" of that ``FieldData``
396will have count 28.
397
398For "null" type, ``BufferData`` does not contain any buffers.