]>
Commit | Line | Data |
---|---|---|
1d09f67e TL |
1 | .. Licensed to the Apache Software Foundation (ASF) under one |
2 | .. or more contributor license agreements. See the NOTICE file | |
3 | .. distributed with this work for additional information | |
4 | .. regarding copyright ownership. The ASF licenses this file | |
5 | .. to you under the Apache License, Version 2.0 (the | |
6 | .. "License"); you may not use this file except in compliance | |
7 | .. with the License. You may obtain a copy of the License at | |
8 | ||
9 | .. http://www.apache.org/licenses/LICENSE-2.0 | |
10 | ||
11 | .. Unless required by applicable law or agreed to in writing, | |
12 | .. software distributed under the License is distributed on an | |
13 | .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
14 | .. KIND, either express or implied. See the License for the | |
15 | .. specific language governing permissions and limitations | |
16 | .. under the License. | |
17 | ||
18 | Java Algorithms | |
19 | =============== | |
20 | ||
21 | Arrow's Java library provides algorithms for some commonly-used | |
22 | functionalities. The algorithms are provided in the ``org.apache.arrow.algorithm`` | |
23 | package of the ``algorithm`` module. | |
24 | ||
25 | Comparing Vector Elements | |
26 | ------------------------- | |
27 | ||
28 | Comparing vector elements is the basic for many algorithms. Vector | |
29 | elements can be compared in one of the two ways: | |
30 | ||
31 | 1. **Equality comparison**: there are two possible results for this type of comparisons: ``equal`` and ``unequal``. | |
32 | Currently, this type of comparison is supported through the ``org.apache.arrow.vector.compare.VectorValueEqualizer`` | |
33 | interface. | |
34 | ||
35 | 2. **Ordering comparison**: there are three possible results for this type of comparisons: ``less than``, ``equal to `` | |
36 | and ``greater than``. This comparison is supported by the abstract class ``org.apache.arrow.algorithm.sort.VectorValueComparator``. | |
37 | ||
38 | We provide default implementations to compare vector elements. However, users can also define ways | |
39 | for customized comparisons. | |
40 | ||
41 | Vector Element Search | |
42 | --------------------- | |
43 | ||
44 | A search algorithm tries to find a particular value in a vector. When successful, a vector index is | |
45 | returned; otherwise, a ``-1`` is returned. The following search algorithms are provided: | |
46 | ||
47 | 1. **Linear search**: this algorithm simply traverses the vector from the beginning, until a match is | |
48 | found, or the end of the vector is reached. So it takes ``O(n)`` time, where ``n`` is the number of elements | |
49 | in the vector. This algorithm is implemented in ``org.apache.arrow.algorithm.search.VectorSearcher#linearSearch``. | |
50 | ||
51 | 2. **Binary search**: this represents a more efficient search algorithm, as it runs in ``O(log(n))`` time. | |
52 | However, it is only applicable to sorted vectors. To get a sorted vector, | |
53 | one can use one of our sorting algorithms, which will be discussed in the next section. This algorithm | |
54 | is implemented in ``org.apache.arrow.algorithm.search.VectorSearcher#binarySearch``. | |
55 | ||
56 | 3. **Parallel search**: when the vector is large, it takes a long time to traverse the elements to search | |
57 | for a value. To make this process faster, one can split the vector into multiple partitions, and perform the | |
58 | search for each partition in parallel. This is supported by ``org.apache.arrow.algorithm.search.ParallelSearcher``. | |
59 | ||
60 | 4. **Range search**: for many scenarios, there can be multiple matching values in the vector. | |
61 | If the vector is sorted, the matching values reside in a contiguous region in the vector. The | |
62 | range search algorithm tries to find the upper/lower bound of the region in ``O(log(n))`` time. | |
63 | An implementation is provided in ``org.apache.arrow.algorithm.search.VectorRangeSearcher``. | |
64 | ||
65 | Vector Sorting | |
66 | -------------- | |
67 | ||
68 | Given a vector, a sorting algorithm turns it into a sorted one. The sorting criteria must | |
69 | be specified by some ordering comparison operation. The sorting algorithms can be | |
70 | classified into the following categories: | |
71 | ||
72 | 1. **In-place sorter**: an in-place sorter performs the sorting by manipulating the original | |
73 | vector, without creating any new vector. So it just returns the original vector after the sorting operations. | |
74 | Currently, we have ``org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter`` for in-place | |
75 | sorting in ``O(nlog(n))`` time. As the name suggests, it only supports fixed width vectors. | |
76 | ||
77 | 2. **Out-of-place sorter**: an out-of-place sorter does not mutate the original vector. Instead, | |
78 | it copies vector elements to a new vector in sorted order, and returns the new vector. | |
79 | We have ``org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter.FixedWidthOutOfPlaceVectorSorter`` | |
80 | and ``org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter.VariableWidthOutOfPlaceVectorSorter`` | |
81 | for fixed width and variable width vectors, respectively. Both algorithms run in ``O(nlog(n))`` time. | |
82 | ||
83 | 3. **Index sorter**: this sorter does not actually sort the vector. Instead, it returns an integer | |
84 | vector, which correspond to indices of vector elements in sorted order. With the index vector, one can | |
85 | easily construct a sorted vector. In addition, some other tasks can be easily achieved, like finding the ``k``th | |
86 | smallest value in the vector. Index sorting is supported by ``org.apache.arrow.algorithm.sort.IndexSorter``, | |
87 | which runs in ``O(nlog(n))`` time. It is applicable to vectors of any type. | |
88 | ||
89 | Other Algorithms | |
90 | ---------------- | |
91 | ||
92 | Other algorithms include vector deduplication, dictionary encoding, etc., in the ``algorithm`` module. |