|
| 1 | +## Fuzzing Checklist |
| 2 | + |
| 3 | +The following checklist is intended for authors and reviewers of fuzzing papers. It lists a number of recommendations for a fair and reproducibile evaluation that should be followed in general, even though we stress that individual points may not apply in specific scenarios, requiring human consideration on a case-by-case basis. This checklist is not exhaustive but aims to point out common pitfalls with the goal of avoiding these. |
| 4 | + |
| 5 | +More details regarding each category of checkmarks are available in Section 5 of our [paper](http://filled-in-after-paper-is-published) |
| 6 | + |
| 7 | +- Artifact |
| 8 | + - Documentation |
| 9 | + - [ ] Is the process of building/preparing the artifact documented? |
| 10 | + - [ ] Is the interface used for interaction with the fuzzer documented? |
| 11 | + - [ ] If the approach extends another fuzzer, are the interfaces (i.e., hooked functions, added compiler passes) documented? |
| 12 | + - [ ] Are the individual experiments documented and how to compare their results against the paper? |
| 13 | + - [ ] If the new approach is based on another fuzzer, is the version of the baseline documented? |
| 14 | + - [ ] Are the versions of the targets and other fuzzers used during evaluation specified? |
| 15 | + - Completeness |
| 16 | + - [ ] Are the other fuzzers, or instructions on setting them up, provided? |
| 17 | + - [ ] Are all experiments required to back the paper's main claims included? |
| 18 | + - Reusability |
| 19 | + - [ ] Can the fuzzer be executed independently of the underlying system, e.g., through virtualization or container engines? |
| 20 | + - [ ] Are there external dependencies (e.g., tarballs downloaded via https), that may be unavailable in the future? |
| 21 | + - [ ] Is the commit history of projects the fuzzer is based on available and not squashed? |
| 22 | + |
| 23 | +- Targets used for the evaluation |
| 24 | + - [ ] Are the targets suitable to show the strengths of the approach? For example, are parsers targeted if the approach is related to grammar fuzzing? |
| 25 | + - [ ] Is it documented how the targets need to be prepared for fuzzing? |
| 26 | + - [ ] Are all modifications (e.g., patches, changes to the runtime environment) applied to targets documented? |
| 27 | + - [ ] Are targets used that are also tested by related work (to allow comparability)? |
| 28 | + - [ ] If applicable for the approach, are benchmarks such as FuzzBench used? |
| 29 | + - [ ] If using benchmarks, are benchmarks injecting artificial bugs avoided? |
| 30 | + |
| 31 | +- Competitors used for comparison |
| 32 | + - [ ] Is the approach compared against state-of-the-art tools in the respective field? |
| 33 | + - [ ] Is the baseline, i.e., the fuzzer the new approach is based on (if any), compared against? |
| 34 | + - [ ] If some of the state-of-the-art fuzzers failed on some targets, are the reasons sufficiently documented? |
| 35 | + |
| 36 | +- Setup used for evaluation |
| 37 | + - [ ] Is the used hardware (e.g., CPU, RAM) documented? |
| 38 | + - [ ] Is a sufficiently long (>=24h) runtime used for comparison? |
| 39 | + - [ ] Is the number of repetitions documented and sufficiently large (>= 10)? |
| 40 | + - [ ] Had all fuzzers access to the same amount of computation time? This requires particular thought if a tool requires precompuation(s). |
| 41 | + - [ ] Is the runtime sufficient such that all fuzzers are flatlining towards the end of the fuzzing runs? |
| 42 | + - [ ] Is it documented how many instances of each fuzzer have been run in parallel? |
| 43 | + - [ ] Is it documented how many CPUs have been available to each fuzzing process (e.g., via CPU pinning)? |
| 44 | + - [ ] Is the setup (e.g., available cores, hardware) for fuzzers compared to in the evaluation suitable according to their requirements? |
| 45 | + - [ ] Seeds: |
| 46 | + - [ ] Are the used seeds documented? |
| 47 | + - [ ] Are the used seeds uninformed? |
| 48 | + - [ ] Are the seeds publicly available? |
| 49 | + - [ ] Were all fuzzers provided with the same set of seeds? |
| 50 | + - [ ] If informed seeds are used, is the initial coverage achieved by those visible from, e.g., plots? |
| 51 | + |
| 52 | +- Evaluation Metrics |
| 53 | + - [ ] Are standard metrics, such as coverage over time or the number of (deduplicated) bugs, used? |
| 54 | + - [ ] Is it specified how coverage was collected? |
| 55 | + - [ ] Is a collision-free encoding used for coverage collection? |
| 56 | + - [ ] If collecting coverage for JIT-based emulation targets, are basic blocks reported instead of translation blocks (jitted blocks)? |
| 57 | + - [ ] Is no metric used the new approach is naturally optimized for? |
| 58 | + - [ ] Bug finding: |
| 59 | + - [ ] Are targets used that are relevant (have a sufficient user base, forks, GitHub stars, ...)? |
| 60 | + - [ ] Are the targets not deprecated and still receive updates? |
| 61 | + - [ ] Are the targets known for having unfixed bugs? |
| 62 | + - [ ] If reproducing existing CVEs, are the CVEs themselves valid and not duplicates or rejected? |
| 63 | + - [ ] Is an uninstrumented binary used for crash deduplication and reproduction? |
| 64 | + - [ ] Is the process of deduplication and triaging crashes described? |
| 65 | + - [ ] Are CVEs IDs provided (if possible anonymously)? |
| 66 | + - [ ] Are the CVEs referencing different bugs and not duplicates? |
| 67 | + |
| 68 | +- Statistical Evaluation |
| 69 | + - [ ] Are measures of uncertainty, such as intervals in plots, used? |
| 70 | + - [ ] Is a sufficient number of trials used (>= 10)? |
| 71 | + - [ ] Are reasonable statistical tests used to check for the significance of the results? |
| 72 | + - [ ] Are reasonable statistical tests used to study the effect size? |
| 73 | + |
| 74 | +- Other |
| 75 | + - [ ] Are threats to validity considered and discussed? |
| 76 | + |
0 commit comments