Skip to content

Commit eba4d6c

Browse files
author
Fuzz Evaluator
committed
Initial commit
0 parents  commit eba4d6c

1 file changed

Lines changed: 76 additions & 0 deletions

File tree

README.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
## Fuzzing Checklist
2+
3+
The following checklist is intended for authors and reviewers of fuzzing papers. It lists a number of recommendations for a fair and reproducibile evaluation that should be followed in general, even though we stress that individual points may not apply in specific scenarios, requiring human consideration on a case-by-case basis. This checklist is not exhaustive but aims to point out common pitfalls with the goal of avoiding these.
4+
5+
More details regarding each category of checkmarks are available in Section 5 of our [paper](http://filled-in-after-paper-is-published)
6+
7+
- Artifact
8+
- Documentation
9+
- [ ] Is the process of building/preparing the artifact documented?
10+
- [ ] Is the interface used for interaction with the fuzzer documented?
11+
- [ ] If the approach extends another fuzzer, are the interfaces (i.e., hooked functions, added compiler passes) documented?
12+
- [ ] Are the individual experiments documented and how to compare their results against the paper?
13+
- [ ] If the new approach is based on another fuzzer, is the version of the baseline documented?
14+
- [ ] Are the versions of the targets and other fuzzers used during evaluation specified?
15+
- Completeness
16+
- [ ] Are the other fuzzers, or instructions on setting them up, provided?
17+
- [ ] Are all experiments required to back the paper's main claims included?
18+
- Reusability
19+
- [ ] Can the fuzzer be executed independently of the underlying system, e.g., through virtualization or container engines?
20+
- [ ] Are there external dependencies (e.g., tarballs downloaded via https), that may be unavailable in the future?
21+
- [ ] Is the commit history of projects the fuzzer is based on available and not squashed?
22+
23+
- Targets used for the evaluation
24+
- [ ] Are the targets suitable to show the strengths of the approach? For example, are parsers targeted if the approach is related to grammar fuzzing?
25+
- [ ] Is it documented how the targets need to be prepared for fuzzing?
26+
- [ ] Are all modifications (e.g., patches, changes to the runtime environment) applied to targets documented?
27+
- [ ] Are targets used that are also tested by related work (to allow comparability)?
28+
- [ ] If applicable for the approach, are benchmarks such as FuzzBench used?
29+
- [ ] If using benchmarks, are benchmarks injecting artificial bugs avoided?
30+
31+
- Competitors used for comparison
32+
- [ ] Is the approach compared against state-of-the-art tools in the respective field?
33+
- [ ] Is the baseline, i.e., the fuzzer the new approach is based on (if any), compared against?
34+
- [ ] If some of the state-of-the-art fuzzers failed on some targets, are the reasons sufficiently documented?
35+
36+
- Setup used for evaluation
37+
- [ ] Is the used hardware (e.g., CPU, RAM) documented?
38+
- [ ] Is a sufficiently long (>=24h) runtime used for comparison?
39+
- [ ] Is the number of repetitions documented and sufficiently large (>= 10)?
40+
- [ ] Had all fuzzers access to the same amount of computation time? This requires particular thought if a tool requires precompuation(s).
41+
- [ ] Is the runtime sufficient such that all fuzzers are flatlining towards the end of the fuzzing runs?
42+
- [ ] Is it documented how many instances of each fuzzer have been run in parallel?
43+
- [ ] Is it documented how many CPUs have been available to each fuzzing process (e.g., via CPU pinning)?
44+
- [ ] Is the setup (e.g., available cores, hardware) for fuzzers compared to in the evaluation suitable according to their requirements?
45+
- [ ] Seeds:
46+
- [ ] Are the used seeds documented?
47+
- [ ] Are the used seeds uninformed?
48+
- [ ] Are the seeds publicly available?
49+
- [ ] Were all fuzzers provided with the same set of seeds?
50+
- [ ] If informed seeds are used, is the initial coverage achieved by those visible from, e.g., plots?
51+
52+
- Evaluation Metrics
53+
- [ ] Are standard metrics, such as coverage over time or the number of (deduplicated) bugs, used?
54+
- [ ] Is it specified how coverage was collected?
55+
- [ ] Is a collision-free encoding used for coverage collection?
56+
- [ ] If collecting coverage for JIT-based emulation targets, are basic blocks reported instead of translation blocks (jitted blocks)?
57+
- [ ] Is no metric used the new approach is naturally optimized for?
58+
- [ ] Bug finding:
59+
- [ ] Are targets used that are relevant (have a sufficient user base, forks, GitHub stars, ...)?
60+
- [ ] Are the targets not deprecated and still receive updates?
61+
- [ ] Are the targets known for having unfixed bugs?
62+
- [ ] If reproducing existing CVEs, are the CVEs themselves valid and not duplicates or rejected?
63+
- [ ] Is an uninstrumented binary used for crash deduplication and reproduction?
64+
- [ ] Is the process of deduplication and triaging crashes described?
65+
- [ ] Are CVEs IDs provided (if possible anonymously)?
66+
- [ ] Are the CVEs referencing different bugs and not duplicates?
67+
68+
- Statistical Evaluation
69+
- [ ] Are measures of uncertainty, such as intervals in plots, used?
70+
- [ ] Is a sufficient number of trials used (>= 10)?
71+
- [ ] Are reasonable statistical tests used to check for the significance of the results?
72+
- [ ] Are reasonable statistical tests used to study the effect size?
73+
74+
- Other
75+
- [ ] Are threats to validity considered and discussed?
76+

0 commit comments

Comments
 (0)