]> git.proxmox.com Git - ceph.git/blame - ceph/src/arrow/docs/source/java/algorithm.rst
import quincy 17.2.0
[ceph.git] / ceph / src / arrow / docs / source / java / algorithm.rst
CommitLineData
1d09f67e
TL
1.. Licensed to the Apache Software Foundation (ASF) under one
2.. or more contributor license agreements. See the NOTICE file
3.. distributed with this work for additional information
4.. regarding copyright ownership. The ASF licenses this file
5.. to you under the Apache License, Version 2.0 (the
6.. "License"); you may not use this file except in compliance
7.. with the License. You may obtain a copy of the License at
8
9.. http://www.apache.org/licenses/LICENSE-2.0
10
11.. Unless required by applicable law or agreed to in writing,
12.. software distributed under the License is distributed on an
13.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14.. KIND, either express or implied. See the License for the
15.. specific language governing permissions and limitations
16.. under the License.
17
18Java Algorithms
19===============
20
21Arrow's Java library provides algorithms for some commonly-used
22functionalities. The algorithms are provided in the ``org.apache.arrow.algorithm``
23package of the ``algorithm`` module.
24
25Comparing Vector Elements
26-------------------------
27
28Comparing vector elements is the basic for many algorithms. Vector
29elements can be compared in one of the two ways:
30
311. **Equality comparison**: there are two possible results for this type of comparisons: ``equal`` and ``unequal``.
32Currently, this type of comparison is supported through the ``org.apache.arrow.vector.compare.VectorValueEqualizer``
33interface.
34
352. **Ordering comparison**: there are three possible results for this type of comparisons: ``less than``, ``equal to ``
36and ``greater than``. This comparison is supported by the abstract class ``org.apache.arrow.algorithm.sort.VectorValueComparator``.
37
38We provide default implementations to compare vector elements. However, users can also define ways
39for customized comparisons.
40
41Vector Element Search
42---------------------
43
44A search algorithm tries to find a particular value in a vector. When successful, a vector index is
45returned; otherwise, a ``-1`` is returned. The following search algorithms are provided:
46
471. **Linear search**: this algorithm simply traverses the vector from the beginning, until a match is
48found, or the end of the vector is reached. So it takes ``O(n)`` time, where ``n`` is the number of elements
49in the vector. This algorithm is implemented in ``org.apache.arrow.algorithm.search.VectorSearcher#linearSearch``.
50
512. **Binary search**: this represents a more efficient search algorithm, as it runs in ``O(log(n))`` time.
52However, it is only applicable to sorted vectors. To get a sorted vector,
53one can use one of our sorting algorithms, which will be discussed in the next section. This algorithm
54is implemented in ``org.apache.arrow.algorithm.search.VectorSearcher#binarySearch``.
55
563. **Parallel search**: when the vector is large, it takes a long time to traverse the elements to search
57for a value. To make this process faster, one can split the vector into multiple partitions, and perform the
58search for each partition in parallel. This is supported by ``org.apache.arrow.algorithm.search.ParallelSearcher``.
59
604. **Range search**: for many scenarios, there can be multiple matching values in the vector.
61If the vector is sorted, the matching values reside in a contiguous region in the vector. The
62range search algorithm tries to find the upper/lower bound of the region in ``O(log(n))`` time.
63An implementation is provided in ``org.apache.arrow.algorithm.search.VectorRangeSearcher``.
64
65Vector Sorting
66--------------
67
68Given a vector, a sorting algorithm turns it into a sorted one. The sorting criteria must
69be specified by some ordering comparison operation. The sorting algorithms can be
70classified into the following categories:
71
721. **In-place sorter**: an in-place sorter performs the sorting by manipulating the original
73vector, without creating any new vector. So it just returns the original vector after the sorting operations.
74Currently, we have ``org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter`` for in-place
75sorting in ``O(nlog(n))`` time. As the name suggests, it only supports fixed width vectors.
76
772. **Out-of-place sorter**: an out-of-place sorter does not mutate the original vector. Instead,
78it copies vector elements to a new vector in sorted order, and returns the new vector.
79We have ``org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter.FixedWidthOutOfPlaceVectorSorter``
80and ``org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter.VariableWidthOutOfPlaceVectorSorter``
81for fixed width and variable width vectors, respectively. Both algorithms run in ``O(nlog(n))`` time.
82
833. **Index sorter**: this sorter does not actually sort the vector. Instead, it returns an integer
84vector, which correspond to indices of vector elements in sorted order. With the index vector, one can
85easily construct a sorted vector. In addition, some other tasks can be easily achieved, like finding the ``k``th
86smallest value in the vector. Index sorting is supported by ``org.apache.arrow.algorithm.sort.IndexSorter``,
87which runs in ``O(nlog(n))`` time. It is applicable to vectors of any type.
88
89Other Algorithms
90----------------
91
92Other algorithms include vector deduplication, dictionary encoding, etc., in the ``algorithm`` module.