]>
Commit | Line | Data |
---|---|---|
1d09f67e TL |
1 | .. Licensed to the Apache Software Foundation (ASF) under one |
2 | .. or more contributor license agreements. See the NOTICE file | |
3 | .. distributed with this work for additional information | |
4 | .. regarding copyright ownership. The ASF licenses this file | |
5 | .. to you under the Apache License, Version 2.0 (the | |
6 | .. "License"); you may not use this file except in compliance | |
7 | .. with the License. You may obtain a copy of the License at | |
8 | ||
9 | .. http://www.apache.org/licenses/LICENSE-2.0 | |
10 | ||
11 | .. Unless required by applicable law or agreed to in writing, | |
12 | .. software distributed under the License is distributed on an | |
13 | .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
14 | .. KIND, either express or implied. See the License for the | |
15 | .. specific language governing permissions and limitations | |
16 | .. under the License. | |
17 | ||
18 | .. _format_integration_testing: | |
19 | ||
20 | Integration Testing | |
21 | =================== | |
22 | ||
23 | Our strategy for integration testing between Arrow implementations is: | |
24 | ||
25 | * Test datasets are specified in a custom human-readable, JSON-based format | |
26 | designed exclusively for Arrow's integration tests | |
27 | * Each implementation provides a testing executable capable of converting | |
28 | between the JSON and the binary Arrow file representation | |
29 | * The test executable is also capable of validating the contents of a binary | |
30 | file against a corresponding JSON file | |
31 | ||
32 | Running integration tests | |
33 | ------------------------- | |
34 | ||
35 | The integration test data generator and runner are implemented inside | |
36 | the :ref:`Archery <archery>` utility. | |
37 | ||
38 | The integration tests are run using the ``archery integration`` command. | |
39 | ||
40 | .. code-block:: shell | |
41 | ||
42 | archery integration --help | |
43 | ||
44 | In order to run integration tests, you'll first need to build each component | |
45 | you want to include. See the respective developer docs for C++, Java, etc. | |
46 | for instructions on building those. | |
47 | ||
48 | Some languages may require additional build options to enable integration | |
49 | testing. For C++, for example, you need to add ``-DARROW_BUILD_INTEGRATION=ON`` | |
50 | to your cmake command. | |
51 | ||
52 | Depending on which components you have built, you can enable and add them to | |
53 | the archery test run. For example, if you only have the C++ project built, run: | |
54 | ||
55 | .. code-block:: shell | |
56 | ||
57 | archery integration --with-cpp=1 | |
58 | ||
59 | ||
60 | For Java, it may look like: | |
61 | ||
62 | .. code-block:: shell | |
63 | ||
64 | VERSION=0.11.0-SNAPSHOT | |
65 | export ARROW_JAVA_INTEGRATION_JAR=$JAVA_DIR/tools/target/arrow-tools-$VERSION-jar-with-dependencies.jar | |
66 | archery integration --with-cpp=1 --with-java=1 | |
67 | ||
68 | To run all tests, including Flight integration tests, do: | |
69 | ||
70 | .. code-block:: shell | |
71 | ||
72 | archery integration --with-all --run-flight | |
73 | ||
74 | Note that we run these tests in continuous integration, and the CI job uses | |
75 | docker-compose. You may also run the docker-compose job locally, or at least | |
76 | refer to it if you have questions about how to build other languages or enable | |
77 | certain tests. | |
78 | ||
79 | See :ref:`docker-builds` for more information about the project's | |
80 | ``docker-compose`` configuration. | |
81 | ||
82 | JSON test data format | |
83 | --------------------- | |
84 | ||
85 | A JSON representation of Arrow columnar data is provided for | |
86 | cross-language integration testing purposes. | |
87 | This representation is `not canonical <https://lists.apache.org/thread.html/6947fb7666a0f9cc27d9677d2dad0fb5990f9063b7cf3d80af5e270f%40%3Cdev.arrow.apache.org%3E>`_ | |
88 | but it provides a human-readable way of verifying language implementations. | |
89 | ||
90 | See `here <https://github.com/apache/arrow/tree/master/docs/source/format/integration_json_examples>`_ | |
91 | for some examples of this JSON data. | |
92 | ||
93 | .. can we check in more examples, e.g. from the generated_*.json test files? | |
94 | ||
95 | The high level structure of a JSON integration test files is as follows: | |
96 | ||
97 | **Data file** :: | |
98 | ||
99 | { | |
100 | "schema": /*Schema*/, | |
101 | "batches": [ /*RecordBatch*/ ], | |
102 | "dictionaries": [ /*DictionaryBatch*/ ], | |
103 | } | |
104 | ||
105 | All files contain ``schema`` and ``batches``, while ``dictionaries`` is only | |
106 | present if there are dictionary type fields in the schema. | |
107 | ||
108 | **Schema** :: | |
109 | ||
110 | { | |
111 | "fields" : [ | |
112 | /* Field */ | |
113 | ], | |
114 | "metadata" : /* Metadata */ | |
115 | } | |
116 | ||
117 | **Field** :: | |
118 | ||
119 | { | |
120 | "name" : "name_of_the_field", | |
121 | "nullable" : /* boolean */, | |
122 | "type" : /* Type */, | |
123 | "children" : [ /* Field */ ], | |
124 | "dictionary": { | |
125 | "id": /* integer */, | |
126 | "indexType": /* Type */, | |
127 | "isOrdered": /* boolean */ | |
128 | }, | |
129 | "metadata" : /* Metadata */ | |
130 | } | |
131 | ||
132 | The ``dictionary`` attribute is present if and only if the ``Field`` corresponds to a | |
133 | dictionary type, and its ``id`` maps onto a column in the ``DictionaryBatch``. In this | |
134 | case the ``type`` attribute describes the value type of the dictionary. | |
135 | ||
136 | For primitive types, ``children`` is an empty array. | |
137 | ||
138 | **Metadata** :: | |
139 | ||
140 | null | | |
141 | [ { | |
142 | "key": /* string */, | |
143 | "value": /* string */ | |
144 | } ] | |
145 | ||
146 | A key-value mapping of custom metadata. It may be omitted or null, in which case it is | |
147 | considered equivalent to ``[]`` (no metadata). Duplicated keys are not forbidden here. | |
148 | ||
149 | **Type**: :: | |
150 | ||
151 | { | |
152 | "name" : "null|struct|list|largelist|fixedsizelist|union|int|floatingpoint|utf8|largeutf8|binary|largebinary|fixedsizebinary|bool|decimal|date|time|timestamp|interval|duration|map" | |
153 | } | |
154 | ||
155 | A ``Type`` will have other fields as defined in | |
156 | `Schema.fbs <https://github.com/apache/arrow/tree/master/format/Schema.fbs>`_ | |
157 | depending on its name. | |
158 | ||
159 | Int: :: | |
160 | ||
161 | { | |
162 | "name" : "int", | |
163 | "bitWidth" : /* integer */, | |
164 | "isSigned" : /* boolean */ | |
165 | } | |
166 | ||
167 | FloatingPoint: :: | |
168 | ||
169 | { | |
170 | "name" : "floatingpoint", | |
171 | "precision" : "HALF|SINGLE|DOUBLE" | |
172 | } | |
173 | ||
174 | FixedSizeBinary: :: | |
175 | ||
176 | { | |
177 | "name" : "fixedsizebinary", | |
178 | "byteWidth" : /* byte width */ | |
179 | } | |
180 | ||
181 | Decimal: :: | |
182 | ||
183 | { | |
184 | "name" : "decimal", | |
185 | "precision" : /* integer */, | |
186 | "scale" : /* integer */ | |
187 | } | |
188 | ||
189 | Timestamp: :: | |
190 | ||
191 | { | |
192 | "name" : "timestamp", | |
193 | "unit" : "$TIME_UNIT", | |
194 | "timezone": "$timezone" | |
195 | } | |
196 | ||
197 | ``$TIME_UNIT`` is one of ``"SECOND|MILLISECOND|MICROSECOND|NANOSECOND"`` | |
198 | ||
199 | "timezone" is an optional string. | |
200 | ||
201 | Duration: :: | |
202 | ||
203 | { | |
204 | "name" : "duration", | |
205 | "unit" : "$TIME_UNIT" | |
206 | } | |
207 | ||
208 | Date: :: | |
209 | ||
210 | { | |
211 | "name" : "date", | |
212 | "unit" : "DAY|MILLISECOND" | |
213 | } | |
214 | ||
215 | Time: :: | |
216 | ||
217 | { | |
218 | "name" : "time", | |
219 | "unit" : "$TIME_UNIT", | |
220 | "bitWidth": /* integer: 32 or 64 */ | |
221 | } | |
222 | ||
223 | Interval: :: | |
224 | ||
225 | { | |
226 | "name" : "interval", | |
227 | "unit" : "YEAR_MONTH|DAY_TIME" | |
228 | } | |
229 | ||
230 | Union: :: | |
231 | ||
232 | { | |
233 | "name" : "union", | |
234 | "mode" : "SPARSE|DENSE", | |
235 | "typeIds" : [ /* integer */ ] | |
236 | } | |
237 | ||
238 | The ``typeIds`` field in ``Union`` are the codes used to denote which member of | |
239 | the union is active in each array slot. Note that in general these discriminants are not identical | |
240 | to the index of the corresponding child array. | |
241 | ||
242 | List: :: | |
243 | ||
244 | { | |
245 | "name": "list" | |
246 | } | |
247 | ||
248 | The type that the list is a "list of" will be included in the ``Field``'s | |
249 | "children" member, as a single ``Field`` there. For example, for a list of | |
250 | ``int32``, :: | |
251 | ||
252 | { | |
253 | "name": "list_nullable", | |
254 | "type": { | |
255 | "name": "list" | |
256 | }, | |
257 | "nullable": true, | |
258 | "children": [ | |
259 | { | |
260 | "name": "item", | |
261 | "type": { | |
262 | "name": "int", | |
263 | "isSigned": true, | |
264 | "bitWidth": 32 | |
265 | }, | |
266 | "nullable": true, | |
267 | "children": [] | |
268 | } | |
269 | ] | |
270 | } | |
271 | ||
272 | FixedSizeList: :: | |
273 | ||
274 | { | |
275 | "name": "fixedsizelist", | |
276 | "listSize": /* integer */ | |
277 | } | |
278 | ||
279 | This type likewise comes with a length-1 "children" array. | |
280 | ||
281 | Struct: :: | |
282 | ||
283 | { | |
284 | "name": "struct" | |
285 | } | |
286 | ||
287 | The ``Field``'s "children" contains an array of ``Fields`` with meaningful | |
288 | names and types. | |
289 | ||
290 | Map: :: | |
291 | ||
292 | { | |
293 | "name": "map", | |
294 | "keysSorted": /* boolean */ | |
295 | } | |
296 | ||
297 | The ``Field``'s "children" contains a single ``struct`` field, which itself | |
298 | contains 2 children, named "key" and "value". | |
299 | ||
300 | Null: :: | |
301 | ||
302 | { | |
303 | "name": "null" | |
304 | } | |
305 | ||
306 | Extension types are, as in the IPC format, represented as their underlying | |
307 | storage type plus some dedicated field metadata to reconstruct the extension | |
308 | type. For example, assuming a "uuid" extension type backed by a | |
309 | FixedSizeBinary(16) storage, here is how a "uuid" field would be represented:: | |
310 | ||
311 | { | |
312 | "name" : "name_of_the_field", | |
313 | "nullable" : /* boolean */, | |
314 | "type" : { | |
315 | "name" : "fixedsizebinary", | |
316 | "byteWidth" : 16 | |
317 | }, | |
318 | "children" : [], | |
319 | "metadata" : [ | |
320 | {"key": "ARROW:extension:name", "value": "uuid"}, | |
321 | {"key": "ARROW:extension:metadata", "value": "uuid-serialized"} | |
322 | ] | |
323 | } | |
324 | ||
325 | **RecordBatch**:: | |
326 | ||
327 | { | |
328 | "count": /* integer number of rows */, | |
329 | "columns": [ /* FieldData */ ] | |
330 | } | |
331 | ||
332 | **DictionaryBatch**:: | |
333 | ||
334 | { | |
335 | "id": /* integer */, | |
336 | "data": [ /* RecordBatch */ ] | |
337 | } | |
338 | ||
339 | **FieldData**:: | |
340 | ||
341 | { | |
342 | "name": "field_name", | |
343 | "count" "field_length", | |
344 | "$BUFFER_TYPE": /* BufferData */ | |
345 | ... | |
346 | "$BUFFER_TYPE": /* BufferData */ | |
347 | "children": [ /* FieldData */ ] | |
348 | } | |
349 | ||
350 | The "name" member of a ``Field`` in the ``Schema`` corresponds to the "name" | |
351 | of a ``FieldData`` contained in the "columns" of a ``RecordBatch``. | |
352 | For nested types (list, struct, etc.), ``Field``'s "children" each have a | |
353 | "name" that corresponds to the "name" of a ``FieldData`` inside the | |
354 | "children" of that ``FieldData``. | |
355 | For ``FieldData`` inside of a ``DictionaryBatch``, the "name" field does not | |
356 | correspond to anything. | |
357 | ||
358 | Here ``$BUFFER_TYPE`` is one of ``VALIDITY``, ``OFFSET`` (for | |
359 | variable-length types, such as strings and lists), ``TYPE_ID`` (for unions), | |
360 | or ``DATA``. | |
361 | ||
362 | ``BufferData`` is encoded based on the type of buffer: | |
363 | ||
364 | * ``VALIDITY``: a JSON array of 1 (valid) and 0 (null). Data for non-nullable | |
365 | ``Field`` still has a ``VALIDITY`` array, even though all values are 1. | |
366 | * ``OFFSET``: a JSON array of integers for 32-bit offsets or | |
367 | string-formatted integers for 64-bit offsets | |
368 | * ``TYPE_ID``: a JSON array of integers | |
369 | * ``DATA``: a JSON array of encoded values | |
370 | ||
371 | The value encoding for ``DATA`` is different depending on the logical | |
372 | type: | |
373 | ||
374 | * For boolean type: an array of 1 (true) and 0 (false). | |
375 | * For integer-based types (including timestamps): an array of JSON numbers. | |
376 | * For 64-bit integers: an array of integers formatted as JSON strings, | |
377 | so as to avoid loss of precision. | |
378 | * For floating point types: an array of JSON numbers. Values are limited | |
379 | to 3 decimal places to avoid loss of precision. | |
380 | * For binary types, an array of uppercase hex-encoded strings, so as | |
381 | to represent arbitrary binary data. | |
382 | * For UTF-8 string types, an array of JSON strings. | |
383 | ||
384 | For "list" and "largelist" types, ``BufferData`` has ``VALIDITY`` and | |
385 | ``OFFSET``, and the rest of the data is inside "children". These child | |
386 | ``FieldData`` contain all of the same attributes as non-child data, so in | |
387 | the example of a list of ``int32``, the child data has ``VALIDITY`` and | |
388 | ``DATA``. | |
389 | ||
390 | For "fixedsizelist", there is no ``OFFSET`` member because the offsets are | |
391 | implied by the field's "listSize". | |
392 | ||
393 | Note that the "count" for these child data may not match the parent "count". | |
394 | For example, if a ``RecordBatch`` has 7 rows and contains a ``FixedSizeList`` | |
395 | of ``listSize`` 4, then the data inside the "children" of that ``FieldData`` | |
396 | will have count 28. | |
397 | ||
398 | For "null" type, ``BufferData`` does not contain any buffers. |