]>
Commit | Line | Data |
---|---|---|
1d09f67e TL |
1 | .. Licensed to the Apache Software Foundation (ASF) under one |
2 | .. or more contributor license agreements. See the NOTICE file | |
3 | .. distributed with this work for additional information | |
4 | .. regarding copyright ownership. The ASF licenses this file | |
5 | .. to you under the Apache License, Version 2.0 (the | |
6 | .. "License"); you may not use this file except in compliance | |
7 | .. with the License. You may obtain a copy of the License at | |
8 | ||
9 | .. http://www.apache.org/licenses/LICENSE-2.0 | |
10 | ||
11 | .. Unless required by applicable law or agreed to in writing, | |
12 | .. software distributed under the License is distributed on an | |
13 | .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
14 | .. KIND, either express or implied. See the License for the | |
15 | .. specific language governing permissions and limitations | |
16 | .. under the License. | |
17 | ||
18 | ================= | |
19 | Fuzzing Arrow C++ | |
20 | ================= | |
21 | ||
22 | To make the handling of invalid input more robust, we have enabled | |
23 | fuzz testing on several parts of the Arrow C++ feature set, currently: | |
24 | ||
25 | * the IPC stream format | |
26 | * the IPC file format | |
27 | * the Parquet file format | |
28 | ||
29 | We welcome any contribution to expand the scope of fuzz testing and cover | |
30 | areas ingesting potentially invalid or malicious data. | |
31 | ||
32 | Fuzz Targets and Utilities | |
33 | ========================== | |
34 | ||
35 | By passing the ``-DARROW_FUZZING=ON`` CMake option, you will build | |
36 | the fuzz targets corresponding to the aforementioned Arrow features, as well | |
37 | as additional related utilities. | |
38 | ||
39 | Generating the seed corpus | |
40 | -------------------------- | |
41 | ||
42 | Fuzzing essentially explores the domain space by randomly mutating previously | |
43 | tested inputs, without having any high-level understanding of the area being | |
44 | fuzz-tested. However, the domain space is so huge that this strategy alone | |
45 | may fail to actually produce any "interesting" inputs. | |
46 | ||
47 | To guide the process, it is therefore important to provide a *seed corpus* | |
48 | of valid (or invalid, but remarkable) inputs from which the fuzzing | |
49 | infrastructure can derive new inputs for testing. A script is provided | |
50 | to automate that task. Assuming the fuzzing executables can be found in | |
51 | ``build/debug``, the seed corpus can be generated thusly: | |
52 | ||
53 | .. code-block:: shell | |
54 | ||
55 | $ ./build-support/fuzzing/generate_corpuses.sh build/debug | |
56 | ||
57 | Continuous fuzzing infrastructure | |
58 | ================================= | |
59 | ||
60 | The process of fuzz testing is computationally intensive and therefore | |
61 | benefits from dedicated computing facilities. Arrow C++ is exercised by | |
62 | the `OSS-Fuzz`_ continuous fuzzing infrastructure operated by Google. | |
63 | ||
64 | Issues found by OSS-Fuzz are notified and available to a limited set of | |
65 | `core developers <https://github.com/google/oss-fuzz/blob/master/projects/arrow/project.yaml>`_. | |
66 | If you are a Arrow core developer and want to be added to that list, you can | |
67 | ask on the :ref:`mailing-list <contributing>`. | |
68 | ||
69 | .. _OSS-Fuzz: https://google.github.io/oss-fuzz/ | |
70 | ||
71 | Reproducing locally | |
72 | =================== | |
73 | ||
74 | When a crash is found by fuzzing, it is often useful to download the data | |
75 | used to produce the crash, and use it to reproduce the crash so as to debug | |
76 | and investigate. | |
77 | ||
78 | Assuming you are in a subdirectory inside ``cpp``, the following command | |
79 | would allow you to build the fuzz targets with debug information and the | |
80 | various sanitizer checks enabled. | |
81 | ||
82 | .. code-block:: shell | |
83 | ||
84 | $ cmake .. -GNinja \ | |
85 | -DCMAKE_BUILD_TYPE=Debug \ | |
86 | -DARROW_USE_ASAN=on \ | |
87 | -DARROW_USE_UBSAN=on \ | |
88 | -DARROW_FUZZING=on | |
89 | ||
90 | Then, assuming you have downloaded the crashing data file (let's call it | |
91 | ``testcase-arrow-ipc-file-fuzz-123465``), you can reproduce the crash | |
92 | by running the affected fuzz target on that file: | |
93 | ||
94 | .. code-block:: shell | |
95 | ||
96 | $ build/debug/arrow-ipc-file-fuzz testcase-arrow-ipc-file-fuzz-123465 | |
97 | ||
98 | (you may want to run that command under a debugger so as to inspect the | |
99 | program state more closely) |