ClawProBench

Transparent live-first benchmark harness for evaluating model capability inside the OpenClaw runtime.
102 active scenarios, 162 catalog scenarios, deterministic grading, and OpenClaw-native coverage.

ClawProBench focuses on real OpenClaw execution with deterministic grading, structured reports, and benchmark-profile selection. The default ranking path is the core profile; broader active coverage remains available through intelligence, coverage, native, and full.

The current worktree inventory reports 102 active scenarios and 162 total catalog scenarios (60 incubating) via python3 run.py inventory --json and python3 run.py inventory --benchmark-status all --json.

Leaderboard

Browse the public leaderboard and benchmark cases at suyoumo.github.io/bench.

📢 Updates

v1.0.3 - Reviewed leaderboard, detail, and raw-result consistency across 40+ benchmark models; fixed confirmed data mismatches for doubao-seed-code, qwen3.6-plus, qwen3-max-2026-01-23, astron-code-latest, and ERNIE-4.5-Turbo.
v1.0.2 - Added kimi-for-coding, gemma4-31b, and kimi-k2-thinking; improved image download flows for easier mobile-device browsing.
v1.0.1 - Added qwen3-coder-next, doubao-seed-code, qwen3-max-2026-01-23, and qwen3.6plus rerun with bailiancodingplan; added model image download and benchmark sharing to Twitter; fixed completed-report resume overwrite, tool_use_14 graceful fallback on skills inventory load failure, tool_use_17 invalid JSON and missing-file tolerance, and audit_scenario_quality.py compatibility.
v1.0.0 - ClawProBench released with 102 tasks across 6 domains, with 3-try runs, checkpoint resume, and cross-environment resume support.

Evaluation Logic

Default ranking path: core
Extended active capability suite: intelligence
Native-only slice: native
Multi-trial runs are supported via --trials N
Reports expose avg_score, max_score, coverage-aware summaries, cost, latency, and resume metadata
Interrupted runs can continue with --continue or --resume-from, and execution failures can be re-queued with --rerun-execution-failures

Quick Start

We recommend using uv for fast, reliable Python environment setup:

pip install uv
uv venv --python 3.11
source .venv/bin/activate
uv pip install -r requirements.txt

Before running the benchmark, make sure your local OpenClaw runtime is available:

openclaw --help
openclaw agents list --json

Inspect the benchmark catalog and validate the scenario set:

python3 run.py inventory
python3 run.py inventory --json
python3 run.py dry

Run a one-trial smoke on the default ranking benchmark:

python3 run.py run \
  --model '<MODEL>' \
  --execution-mode live \
  --benchmark-profile core \
  --trials 1 \
  --cleanup-agents

Run the full default benchmark:

python3 run.py run \
  --model '<MODEL>' \
  --execution-mode live \
  --benchmark-profile core \
  --trials 3 \
  --cleanup-agents

Compare generated reports:

python3 run.py compare --results-dir results

For isolated same-host runs, the harness also supports:

--openclaw-profile
--openclaw-state-dir
--openclaw-config-path
--openclaw-gateway-port
--openclaw-binary

Benchmark Profiles

Profile	Active scenarios	Purpose
`core`	26	Default ranking suite
`intelligence`	95	Extended active capability benchmark
`coverage`	7	Lower-stakes breadth and regression slice
`native`	36	Active OpenClaw-native slice only
`full`	102	Union of all active scenarios

The benchmark catalog also includes 60 incubating scenarios that can be inspected with --benchmark-status all.

OpenClaw Runtime

Live runs expect a working local openclaw CLI plus the auth and config required by the surfaces exercised by the selected scenarios. If your binary is not on PATH, set OPENCLAW_BINARY or pass --openclaw-binary.

config/openclaw.json.template is provided as a reference template for local OpenClaw configuration and isolated-run setups.

Repo Map

run.py: CLI entrypoint for inventory, dry, run, and compare
harness/: loader, runner, scoring, reporting, and live OpenClaw bridge
scenarios/: benchmark tasks in YAML
datasets/: seeded live-task data and optional setup / teardown scripts
custom_checks/: scenario-specific grading logic
tests/: regression coverage for loader, runner, scoring, and reporting
docs/: public assets plus evaluation validation and benchmark-profile policy

Generated Output

Benchmark reports are written to results/. They are generated runtime artifacts and are intentionally ignored by version control in this repo layout.

Citation

If you use ClawProBench in your research, please cite:

@misc{clawprobench2026,
  title={ClawProBench — a transparent benchmark for true intelligence in real-world AI agents.},
  author={suyoumo},
  year={2026},
  url={https://github.com/suyoumo/ClawProBench}
}

Contribution

We welcome issues, documentation fixes, scenario improvements, grader hardening, and benchmark-engine contributions. See CONTRIBUTING.md for setup and validation guidance.

Acknowledgements

This project was informed by prior open-source work on agent evaluation, benchmark design, and real-world task assessment.

We drew ideas from projects such as PinchBench, Claw-Eval, AgencyBench, and related agent-benchmark efforts, especially in areas like task design, evaluation methodology, harness structure, and public benchmark presentation.

Some tasks in this repository are adapted and reworked from earlier public benchmark-style task sets into the OpenClaw runtime and grading framework.

Contributors

Public contributor list: waiting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClawProBench

Leaderboard

📢 Updates

Evaluation Logic

Quick Start

Benchmark Profiles

OpenClaw Runtime

Repo Map

Generated Output

Citation

Contribution

Acknowledgements

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
config		config
custom_checks		custom_checks
datasets		datasets
docs		docs
fixtures		fixtures
frameworks		frameworks
harness		harness
mock_tools		mock_tools
results		results
scenarios		scenarios
scripts		scripts
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

ClawProBench

Leaderboard

📢 Updates

Evaluation Logic

Quick Start

Benchmark Profiles

OpenClaw Runtime

Repo Map

Generated Output

Citation

Contribution

Acknowledgements

Contributors

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages