]> git.proxmox.com Git - ceph.git/blame - ceph/src/arrow/docs/source/developers/cpp/fuzzing.rst
import quincy 17.2.0
[ceph.git] / ceph / src / arrow / docs / source / developers / cpp / fuzzing.rst
CommitLineData
1d09f67e
TL
1.. Licensed to the Apache Software Foundation (ASF) under one
2.. or more contributor license agreements. See the NOTICE file
3.. distributed with this work for additional information
4.. regarding copyright ownership. The ASF licenses this file
5.. to you under the Apache License, Version 2.0 (the
6.. "License"); you may not use this file except in compliance
7.. with the License. You may obtain a copy of the License at
8
9.. http://www.apache.org/licenses/LICENSE-2.0
10
11.. Unless required by applicable law or agreed to in writing,
12.. software distributed under the License is distributed on an
13.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14.. KIND, either express or implied. See the License for the
15.. specific language governing permissions and limitations
16.. under the License.
17
18=================
19Fuzzing Arrow C++
20=================
21
22To make the handling of invalid input more robust, we have enabled
23fuzz testing on several parts of the Arrow C++ feature set, currently:
24
25* the IPC stream format
26* the IPC file format
27* the Parquet file format
28
29We welcome any contribution to expand the scope of fuzz testing and cover
30areas ingesting potentially invalid or malicious data.
31
32Fuzz Targets and Utilities
33==========================
34
35By passing the ``-DARROW_FUZZING=ON`` CMake option, you will build
36the fuzz targets corresponding to the aforementioned Arrow features, as well
37as additional related utilities.
38
39Generating the seed corpus
40--------------------------
41
42Fuzzing essentially explores the domain space by randomly mutating previously
43tested inputs, without having any high-level understanding of the area being
44fuzz-tested. However, the domain space is so huge that this strategy alone
45may fail to actually produce any "interesting" inputs.
46
47To guide the process, it is therefore important to provide a *seed corpus*
48of valid (or invalid, but remarkable) inputs from which the fuzzing
49infrastructure can derive new inputs for testing. A script is provided
50to automate that task. Assuming the fuzzing executables can be found in
51``build/debug``, the seed corpus can be generated thusly:
52
53.. code-block:: shell
54
55 $ ./build-support/fuzzing/generate_corpuses.sh build/debug
56
57Continuous fuzzing infrastructure
58=================================
59
60The process of fuzz testing is computationally intensive and therefore
61benefits from dedicated computing facilities. Arrow C++ is exercised by
62the `OSS-Fuzz`_ continuous fuzzing infrastructure operated by Google.
63
64Issues found by OSS-Fuzz are notified and available to a limited set of
65`core developers <https://github.com/google/oss-fuzz/blob/master/projects/arrow/project.yaml>`_.
66If you are a Arrow core developer and want to be added to that list, you can
67ask on the :ref:`mailing-list <contributing>`.
68
69.. _OSS-Fuzz: https://google.github.io/oss-fuzz/
70
71Reproducing locally
72===================
73
74When a crash is found by fuzzing, it is often useful to download the data
75used to produce the crash, and use it to reproduce the crash so as to debug
76and investigate.
77
78Assuming you are in a subdirectory inside ``cpp``, the following command
79would allow you to build the fuzz targets with debug information and the
80various sanitizer checks enabled.
81
82.. code-block:: shell
83
84 $ cmake .. -GNinja \
85 -DCMAKE_BUILD_TYPE=Debug \
86 -DARROW_USE_ASAN=on \
87 -DARROW_USE_UBSAN=on \
88 -DARROW_FUZZING=on
89
90Then, assuming you have downloaded the crashing data file (let's call it
91``testcase-arrow-ipc-file-fuzz-123465``), you can reproduce the crash
92by running the affected fuzz target on that file:
93
94.. code-block:: shell
95
96 $ build/debug/arrow-ipc-file-fuzz testcase-arrow-ipc-file-fuzz-123465
97
98(you may want to run that command under a debugger so as to inspect the
99program state more closely)