]> git.proxmox.com Git - ceph.git/blame - ceph/src/arrow/docs/source/cpp/overview.rst
import quincy 17.2.0
[ceph.git] / ceph / src / arrow / docs / source / cpp / overview.rst
CommitLineData
1d09f67e
TL
1.. Licensed to the Apache Software Foundation (ASF) under one
2.. or more contributor license agreements. See the NOTICE file
3.. distributed with this work for additional information
4.. regarding copyright ownership. The ASF licenses this file
5.. to you under the Apache License, Version 2.0 (the
6.. "License"); you may not use this file except in compliance
7.. with the License. You may obtain a copy of the License at
8
9.. http://www.apache.org/licenses/LICENSE-2.0
10
11.. Unless required by applicable law or agreed to in writing,
12.. software distributed under the License is distributed on an
13.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14.. KIND, either express or implied. See the License for the
15.. specific language governing permissions and limitations
16.. under the License.
17
18.. default-domain:: cpp
19.. highlight:: cpp
20
21High-Level Overview
22===================
23
24The Arrow C++ library is comprised of different parts, each of which serves
25a specific purpose.
26
27The physical layer
28------------------
29
30**Memory management** abstractions provide a uniform API over memory that
31may be allocated through various means, such as heap allocation, the memory
32mapping of a file or a static memory area. In particular, the **buffer**
33abstraction represents a contiguous area of physical data.
34
35The one-dimensional layer
36-------------------------
37
38**Data types** govern the *logical* interpretation of *physical* data.
39Many operations in Arrow are parametered, at compile-time or at runtime,
40by a data type.
41
42**Arrays** assemble one or several buffers with a data type, allowing to
43view them as a logical contiguous sequence of values (possibly nested).
44
45**Chunked arrays** are a generalization of arrays, comprising several same-type
46arrays into a longer logical sequence of values.
47
48The two-dimensional layer
49-------------------------
50
51**Schemas** describe a logical collection of several pieces of data,
52each with a distinct name and type, and optional metadata.
53
54**Tables** are collections of chunked array in accordance to a schema. They
55are the most capable dataset-providing abstraction in Arrow.
56
57**Record batches** are collections of contiguous arrays, described
58by a schema. They allow incremental construction or serialization of tables.
59
60The compute layer
61-----------------
62
63**Datums** are flexible dataset references, able to hold for example an array or table
64reference.
65
66**Kernels** are specialized computation functions running in a loop over a
67given set of datums representing input and output parameters to the functions.
68
69The IO layer
70------------
71
72**Streams** allow untyped sequential or seekable access over external data
73of various kinds (for example compressed or memory-mapped).
74
75The Inter-Process Communication (IPC) layer
76-------------------------------------------
77
78A **messaging format** allows interchange of Arrow data between processes, using
79as few copies as possible.
80
81The file formats layer
82----------------------
83
84Reading and writing Arrow data from/to various file formats is possible, for
85example **Parquet**, **CSV**, **Orc** or the Arrow-specific **Feather** format.
86
87The devices layer
88-----------------
89
90Basic **CUDA** integration is provided, allowing to describe Arrow data backed
91by GPU-allocated memory.
92
93The filesystem layer
94--------------------
95
96A filesystem abstraction allows reading and writing data from different storage
97backends, such as the local filesystem or a S3 bucket.