]> git.proxmox.com Git - ceph.git/blame - ceph/src/arrow/docs/source/cpp/arrays.rst
import quincy 17.2.0
[ceph.git] / ceph / src / arrow / docs / source / cpp / arrays.rst
CommitLineData
1d09f67e
TL
1.. Licensed to the Apache Software Foundation (ASF) under one
2.. or more contributor license agreements. See the NOTICE file
3.. distributed with this work for additional information
4.. regarding copyright ownership. The ASF licenses this file
5.. to you under the Apache License, Version 2.0 (the
6.. "License"); you may not use this file except in compliance
7.. with the License. You may obtain a copy of the License at
8
9.. http://www.apache.org/licenses/LICENSE-2.0
10
11.. Unless required by applicable law or agreed to in writing,
12.. software distributed under the License is distributed on an
13.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14.. KIND, either express or implied. See the License for the
15.. specific language governing permissions and limitations
16.. under the License.
17
18.. default-domain:: cpp
19.. highlight:: cpp
20
21======
22Arrays
23======
24
25.. seealso::
26 :doc:`Array API reference <api/array>`
27
28The central type in Arrow is the class :class:`arrow::Array`. An array
29represents a known-length sequence of values all having the same type.
30Internally, those values are represented by one or several buffers, the
31number and meaning of which depend on the array's data type, as documented
32in :ref:`the Arrow data layout specification <format_layout>`.
33
34Those buffers consist of the value data itself and an optional bitmap buffer
35that indicates which array entries are null values. The bitmap buffer
36can be entirely omitted if the array is known to have zero null values.
37
38There are concrete subclasses of :class:`arrow::Array` for each data type,
39that help you access individual values of the array.
40
41Building an array
42=================
43
44Available strategies
45--------------------
46
47As Arrow objects are immutable, they cannot be populated directly like for
48example a ``std::vector``. Instead, several strategies can be used:
49
50* if the data already exists in memory with the right layout, you can wrap
51 said memory inside :class:`arrow::Buffer` instances and then construct
52 a :class:`arrow::ArrowData` describing the array;
53
54 .. seealso:: :ref:`cpp_memory_management`
55
56* otherwise, the :class:`arrow::ArrayBuilder` base class and its concrete
57 subclasses help building up array data incrementally, without having to
58 deal with details of the Arrow format yourself.
59
60Using ArrayBuilder and its subclasses
61-------------------------------------
62
63To build an ``Int64`` Arrow array, we can use the :class:`arrow::Int64Builder`
64class. In the following example, we build an array of the range 1 to 8 where
65the element that should hold the value 4 is nulled::
66
67 arrow::Int64Builder builder;
68 builder.Append(1);
69 builder.Append(2);
70 builder.Append(3);
71 builder.AppendNull();
72 builder.Append(5);
73 builder.Append(6);
74 builder.Append(7);
75 builder.Append(8);
76
77 auto maybe_array = builder.Finish();
78 if (!maybe_array.ok()) {
79 // ... do something on array building failure
80 }
81 std::shared_ptr<arrow::Array> array = *maybe_array;
82
83The resulting Array (which can be casted to the concrete :class:`arrow::Int64Array`
84subclass if you want to access its values) then consists of two
85:class:`arrow::Buffer`\s.
86The first buffer holds the null bitmap, which consists here of a single byte with
87the bits ``1|1|1|1|0|1|1|1``. As we use `least-significant bit (LSB) numbering`_.
88this indicates that the fourth entry in the array is null. The second
89buffer is simply an ``int64_t`` array containing all the above values.
90As the fourth entry is null, the value at that position in the buffer is
91undefined.
92
93Here is how you could access the concrete array's contents::
94
95 // Cast the Array to its actual type to access its data
96 auto int64_array = std::static_pointer_cast<arrow::Int64Array>(array);
97
98 // Get the pointer to the null bitmap.
99 const uint8_t* null_bitmap = int64_array->null_bitmap_data();
100
101 // Get the pointer to the actual data
102 const int64_t* data = int64_array->raw_values();
103
104 // Alternatively, given an array index, query its null bit and value directly
105 int64_t index = 2;
106 if (!int64_array->IsNull(index)) {
107 int64_t value = int64_array->Value(index);
108 }
109
110.. note::
111 :class:`arrow::Int64Array` (respectively :class:`arrow::Int64Builder`) is
112 just a ``typedef``, provided for convenience, of ``arrow::NumericArray<Int64Type>``
113 (respectively ``arrow::NumericBuilder<Int64Type>``).
114
115.. _least-significant bit (LSB) numbering: https://en.wikipedia.org/wiki/Bit_numbering
116
117Performance
118-----------
119
120While it is possible to build an array value-by-value as in the example above,
121to attain highest performance it is recommended to use the bulk appending
122methods (usually named ``AppendValues``) in the concrete :class:`arrow::ArrayBuilder`
123subclasses.
124
125If you know the number of elements in advance, it is also recommended to
126presize the working area by calling the :func:`~arrow::ArrayBuilder::Resize`
127or :func:`~arrow::ArrayBuilder::Reserve` methods.
128
129Here is how one could rewrite the above example to take advantage of those
130APIs::
131
132 arrow::Int64Builder builder;
133 // Make place for 8 values in total
134 builder.Reserve(8);
135 // Bulk append the given values (with a null in 4th place as indicated by the
136 // validity vector)
137 std::vector<bool> validity = {true, true, true, false, true, true, true, true};
138 std::vector<int64_t> values = {1, 2, 3, 0, 5, 6, 7, 8};
139 builder.AppendValues(values, validity);
140
141 auto maybe_array = builder.Finish();
142
143If you still must append values one by one, some concrete builder subclasses
144have methods marked "Unsafe" that assume the working area has been correctly
145presized, and offer higher performance in exchange::
146
147 arrow::Int64Builder builder;
148 // Make place for 8 values in total
149 builder.Reserve(8);
150 builder.UnsafeAppend(1);
151 builder.UnsafeAppend(2);
152 builder.UnsafeAppend(3);
153 builder.UnsafeAppendNull();
154 builder.UnsafeAppend(5);
155 builder.UnsafeAppend(6);
156 builder.UnsafeAppend(7);
157 builder.UnsafeAppend(8);
158
159 auto maybe_array = builder.Finish();
160
161Size Limitations and Recommendations
162====================================
163
164Some array types are structurally limited to 32-bit sizes. This is the case
165for list arrays (which can hold up to 2^31 elements), string arrays and binary
166arrays (which can hold up to 2GB of binary data), at least. Some other array
167types can hold up to 2^63 elements in the C++ implementation, but other Arrow
168implementations can have a 32-bit size limitation for those array types as well.
169
170For these reasons, it is recommended that huge data be chunked in subsets of
171more reasonable size.
172
173Chunked Arrays
174==============
175
176A :class:`arrow::ChunkedArray` is, like an array, a logical sequence of values;
177but unlike a simple array, a chunked array does not require the entire sequence
178to be physically contiguous in memory. Also, the constituents of a chunked array
179need not have the same size, but they must all have the same data type.
180
181A chunked array is constructed by aggregating any number of arrays. Here we'll
182build a chunked array with the same logical values as in the example above,
183but in two separate chunks::
184
185 std::vector<std::shared_ptr<arrow::Array>> chunks;
186 std::shared_ptr<arrow::Array> array;
187
188 // Build first chunk
189 arrow::Int64Builder builder;
190 builder.Append(1);
191 builder.Append(2);
192 builder.Append(3);
193 if (!builder.Finish(&array).ok()) {
194 // ... do something on array building failure
195 }
196 chunks.push_back(std::move(array));
197
198 // Build second chunk
199 builder.Reset();
200 builder.AppendNull();
201 builder.Append(5);
202 builder.Append(6);
203 builder.Append(7);
204 builder.Append(8);
205 if (!builder.Finish(&array).ok()) {
206 // ... do something on array building failure
207 }
208 chunks.push_back(std::move(array));
209
210 auto chunked_array = std::make_shared<arrow::ChunkedArray>(std::move(chunks));
211
212 assert(chunked_array->num_chunks() == 2);
213 // Logical length in number of values
214 assert(chunked_array->length() == 8);
215 assert(chunked_array->null_count() == 1);
216
217Slicing
218=======
219
220Like for physical memory buffers, it is possible to make zero-copy slices
221of arrays and chunked arrays, to obtain an array or chunked array referring
222to some logical subsequence of the data. This is done by calling the
223:func:`arrow::Array::Slice` and :func:`arrow::ChunkedArray::Slice` methods,
224respectively.
225