]>
Commit | Line | Data |
---|---|---|
1d09f67e TL |
1 | .. Licensed to the Apache Software Foundation (ASF) under one |
2 | .. or more contributor license agreements. See the NOTICE file | |
3 | .. distributed with this work for additional information | |
4 | .. regarding copyright ownership. The ASF licenses this file | |
5 | .. to you under the Apache License, Version 2.0 (the | |
6 | .. "License"); you may not use this file except in compliance | |
7 | .. with the License. You may obtain a copy of the License at | |
8 | ||
9 | .. http://www.apache.org/licenses/LICENSE-2.0 | |
10 | ||
11 | .. Unless required by applicable law or agreed to in writing, | |
12 | .. software distributed under the License is distributed on an | |
13 | .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
14 | .. KIND, either express or implied. See the License for the | |
15 | .. specific language governing permissions and limitations | |
16 | .. under the License. | |
17 | ||
18 | .. default-domain:: cpp | |
19 | .. highlight:: cpp | |
20 | ||
21 | ====== | |
22 | Arrays | |
23 | ====== | |
24 | ||
25 | .. seealso:: | |
26 | :doc:`Array API reference <api/array>` | |
27 | ||
28 | The central type in Arrow is the class :class:`arrow::Array`. An array | |
29 | represents a known-length sequence of values all having the same type. | |
30 | Internally, those values are represented by one or several buffers, the | |
31 | number and meaning of which depend on the array's data type, as documented | |
32 | in :ref:`the Arrow data layout specification <format_layout>`. | |
33 | ||
34 | Those buffers consist of the value data itself and an optional bitmap buffer | |
35 | that indicates which array entries are null values. The bitmap buffer | |
36 | can be entirely omitted if the array is known to have zero null values. | |
37 | ||
38 | There are concrete subclasses of :class:`arrow::Array` for each data type, | |
39 | that help you access individual values of the array. | |
40 | ||
41 | Building an array | |
42 | ================= | |
43 | ||
44 | Available strategies | |
45 | -------------------- | |
46 | ||
47 | As Arrow objects are immutable, they cannot be populated directly like for | |
48 | example a ``std::vector``. Instead, several strategies can be used: | |
49 | ||
50 | * if the data already exists in memory with the right layout, you can wrap | |
51 | said memory inside :class:`arrow::Buffer` instances and then construct | |
52 | a :class:`arrow::ArrowData` describing the array; | |
53 | ||
54 | .. seealso:: :ref:`cpp_memory_management` | |
55 | ||
56 | * otherwise, the :class:`arrow::ArrayBuilder` base class and its concrete | |
57 | subclasses help building up array data incrementally, without having to | |
58 | deal with details of the Arrow format yourself. | |
59 | ||
60 | Using ArrayBuilder and its subclasses | |
61 | ------------------------------------- | |
62 | ||
63 | To build an ``Int64`` Arrow array, we can use the :class:`arrow::Int64Builder` | |
64 | class. In the following example, we build an array of the range 1 to 8 where | |
65 | the element that should hold the value 4 is nulled:: | |
66 | ||
67 | arrow::Int64Builder builder; | |
68 | builder.Append(1); | |
69 | builder.Append(2); | |
70 | builder.Append(3); | |
71 | builder.AppendNull(); | |
72 | builder.Append(5); | |
73 | builder.Append(6); | |
74 | builder.Append(7); | |
75 | builder.Append(8); | |
76 | ||
77 | auto maybe_array = builder.Finish(); | |
78 | if (!maybe_array.ok()) { | |
79 | // ... do something on array building failure | |
80 | } | |
81 | std::shared_ptr<arrow::Array> array = *maybe_array; | |
82 | ||
83 | The resulting Array (which can be casted to the concrete :class:`arrow::Int64Array` | |
84 | subclass if you want to access its values) then consists of two | |
85 | :class:`arrow::Buffer`\s. | |
86 | The first buffer holds the null bitmap, which consists here of a single byte with | |
87 | the bits ``1|1|1|1|0|1|1|1``. As we use `least-significant bit (LSB) numbering`_. | |
88 | this indicates that the fourth entry in the array is null. The second | |
89 | buffer is simply an ``int64_t`` array containing all the above values. | |
90 | As the fourth entry is null, the value at that position in the buffer is | |
91 | undefined. | |
92 | ||
93 | Here is how you could access the concrete array's contents:: | |
94 | ||
95 | // Cast the Array to its actual type to access its data | |
96 | auto int64_array = std::static_pointer_cast<arrow::Int64Array>(array); | |
97 | ||
98 | // Get the pointer to the null bitmap. | |
99 | const uint8_t* null_bitmap = int64_array->null_bitmap_data(); | |
100 | ||
101 | // Get the pointer to the actual data | |
102 | const int64_t* data = int64_array->raw_values(); | |
103 | ||
104 | // Alternatively, given an array index, query its null bit and value directly | |
105 | int64_t index = 2; | |
106 | if (!int64_array->IsNull(index)) { | |
107 | int64_t value = int64_array->Value(index); | |
108 | } | |
109 | ||
110 | .. note:: | |
111 | :class:`arrow::Int64Array` (respectively :class:`arrow::Int64Builder`) is | |
112 | just a ``typedef``, provided for convenience, of ``arrow::NumericArray<Int64Type>`` | |
113 | (respectively ``arrow::NumericBuilder<Int64Type>``). | |
114 | ||
115 | .. _least-significant bit (LSB) numbering: https://en.wikipedia.org/wiki/Bit_numbering | |
116 | ||
117 | Performance | |
118 | ----------- | |
119 | ||
120 | While it is possible to build an array value-by-value as in the example above, | |
121 | to attain highest performance it is recommended to use the bulk appending | |
122 | methods (usually named ``AppendValues``) in the concrete :class:`arrow::ArrayBuilder` | |
123 | subclasses. | |
124 | ||
125 | If you know the number of elements in advance, it is also recommended to | |
126 | presize the working area by calling the :func:`~arrow::ArrayBuilder::Resize` | |
127 | or :func:`~arrow::ArrayBuilder::Reserve` methods. | |
128 | ||
129 | Here is how one could rewrite the above example to take advantage of those | |
130 | APIs:: | |
131 | ||
132 | arrow::Int64Builder builder; | |
133 | // Make place for 8 values in total | |
134 | builder.Reserve(8); | |
135 | // Bulk append the given values (with a null in 4th place as indicated by the | |
136 | // validity vector) | |
137 | std::vector<bool> validity = {true, true, true, false, true, true, true, true}; | |
138 | std::vector<int64_t> values = {1, 2, 3, 0, 5, 6, 7, 8}; | |
139 | builder.AppendValues(values, validity); | |
140 | ||
141 | auto maybe_array = builder.Finish(); | |
142 | ||
143 | If you still must append values one by one, some concrete builder subclasses | |
144 | have methods marked "Unsafe" that assume the working area has been correctly | |
145 | presized, and offer higher performance in exchange:: | |
146 | ||
147 | arrow::Int64Builder builder; | |
148 | // Make place for 8 values in total | |
149 | builder.Reserve(8); | |
150 | builder.UnsafeAppend(1); | |
151 | builder.UnsafeAppend(2); | |
152 | builder.UnsafeAppend(3); | |
153 | builder.UnsafeAppendNull(); | |
154 | builder.UnsafeAppend(5); | |
155 | builder.UnsafeAppend(6); | |
156 | builder.UnsafeAppend(7); | |
157 | builder.UnsafeAppend(8); | |
158 | ||
159 | auto maybe_array = builder.Finish(); | |
160 | ||
161 | Size Limitations and Recommendations | |
162 | ==================================== | |
163 | ||
164 | Some array types are structurally limited to 32-bit sizes. This is the case | |
165 | for list arrays (which can hold up to 2^31 elements), string arrays and binary | |
166 | arrays (which can hold up to 2GB of binary data), at least. Some other array | |
167 | types can hold up to 2^63 elements in the C++ implementation, but other Arrow | |
168 | implementations can have a 32-bit size limitation for those array types as well. | |
169 | ||
170 | For these reasons, it is recommended that huge data be chunked in subsets of | |
171 | more reasonable size. | |
172 | ||
173 | Chunked Arrays | |
174 | ============== | |
175 | ||
176 | A :class:`arrow::ChunkedArray` is, like an array, a logical sequence of values; | |
177 | but unlike a simple array, a chunked array does not require the entire sequence | |
178 | to be physically contiguous in memory. Also, the constituents of a chunked array | |
179 | need not have the same size, but they must all have the same data type. | |
180 | ||
181 | A chunked array is constructed by aggregating any number of arrays. Here we'll | |
182 | build a chunked array with the same logical values as in the example above, | |
183 | but in two separate chunks:: | |
184 | ||
185 | std::vector<std::shared_ptr<arrow::Array>> chunks; | |
186 | std::shared_ptr<arrow::Array> array; | |
187 | ||
188 | // Build first chunk | |
189 | arrow::Int64Builder builder; | |
190 | builder.Append(1); | |
191 | builder.Append(2); | |
192 | builder.Append(3); | |
193 | if (!builder.Finish(&array).ok()) { | |
194 | // ... do something on array building failure | |
195 | } | |
196 | chunks.push_back(std::move(array)); | |
197 | ||
198 | // Build second chunk | |
199 | builder.Reset(); | |
200 | builder.AppendNull(); | |
201 | builder.Append(5); | |
202 | builder.Append(6); | |
203 | builder.Append(7); | |
204 | builder.Append(8); | |
205 | if (!builder.Finish(&array).ok()) { | |
206 | // ... do something on array building failure | |
207 | } | |
208 | chunks.push_back(std::move(array)); | |
209 | ||
210 | auto chunked_array = std::make_shared<arrow::ChunkedArray>(std::move(chunks)); | |
211 | ||
212 | assert(chunked_array->num_chunks() == 2); | |
213 | // Logical length in number of values | |
214 | assert(chunked_array->length() == 8); | |
215 | assert(chunked_array->null_count() == 1); | |
216 | ||
217 | Slicing | |
218 | ======= | |
219 | ||
220 | Like for physical memory buffers, it is possible to make zero-copy slices | |
221 | of arrays and chunked arrays, to obtain an array or chunked array referring | |
222 | to some logical subsequence of the data. This is done by calling the | |
223 | :func:`arrow::Array::Slice` and :func:`arrow::ChunkedArray::Slice` methods, | |
224 | respectively. | |
225 |