This project is an open behavioral stress-test suite for AI agents.
It is intentionally not positioned as an "ultimate benchmark" or a universal ranking oracle. Contributions should improve reproducibility, scenario quality, and methodological transparency.
Before contributing, read:
docs/WHAT_THIS_TESTS.mddocs/TESTING.mddocs/RUNNER_EXTENSION_POINTS.md(if you touch runner/orchestration code)docs/SCREENSHOTS.md(if UI/docs visuals change)SECURITY.mdandCODE_OF_CONDUCT.md
Use the scaffold script:
python3 scripts/new_scenario.py \
--id grid_failure \
--code gf \
--cli-code gf \
--title "Grid Failure Response" \
--duration-arg daysThis generates:
prompts/<id>/unconstrained.mdprompts/<id>/soft_guidelines.mdprompts/<id>/hard_rules.mdsimulations/<id>/__init__.pysimulations/<id>/simulation.pysim_cli/<cli-code>_cli.pytests/test_<cli-code>_cli.py- Appends a registry entry in
race/scenario_registry.py(unless--no-registry).
- Implement simulation logic in
simulations/<id>/simulation.py. - Implement CLI behavior in
sim_cli/<code>_cli.py. - Replace scaffold test with real tests in
tests/test_<code>_cli.py. - Ensure prompts for all three variants are present and coherent.
- Confirm
race/scenario_registry.pyentry is correct:simulation_id- display label
- prompt code
- duration arg/label/default
- Add or update scenario docs in
README.mdif user-facing.
Run at minimum:
python3 -m py_compile $(rg --files -g '*.py')
python3 run_race.py --help
pytest -q tests/test_*_cli.pyIf you touched simulation internals, run related unit/integration tests too.
Open PRs should use .github/PULL_REQUEST_TEMPLATE.md and include explicit validation output.
- Keep scenario behavior deterministic under the same seed.
- Prefer explicit hidden-metric accounting over implicit side effects.
- Keep CLI contracts stable (
start,status,advance,full-scorestyle). - Avoid benchmark hype in docs; be clear about limitations and scope.
- When changing claims or reported findings, update both docs and result artifacts together.