π Project Page Β· π Paper Β· π€ Dataset
AutoMat is a claim-runner for reproducibility experiments with three agents:
automat: multi-phase LLM agent workflow (agents/automat_agent)cc: single-session Claude Code CLI agent (agents/claude_code); can target the Kimi backend with--kimicodex: single-session OpenAI Codex CLI agent (agents/codex_cli)
- 05/12/2026 β Repository created.
For a normal run, automat.py executes this pipeline:
- Resolve
claim_idfrom<claim-path>/meta/provenance.json(fallback: directory name). - Setup environment: create a temporary work dir (
/tmp/automat_<claim_id>_*). - Run agent (
automat,cc, orcodex).- For
automat, choose--agent-mode:orchestrated: PLAN β SETUP β EXECUTE β ANALYZEhybrid: orchestrated + global recovery rounds when EXECUTE still has failed steps
- In phased modes, PLAN/SETUP/ANALYZE now write markdown artifacts (
plan.md,setup_report.md,analysis_report.md) and the orchestrator parses only required fields. - If required fields are missing, AutoMat issues one targeted repair round asking the agent to patch the same markdown file in place before failing the phase.
- EXECUTE remains deterministic Python orchestration driven by parsed plan step fields.
- For
- Evaluate using the holistic LLM reproducibility evaluator.
- Cleanup temp work dir; optional archive to
automat_runs/if--archiveis enabled.
- Python 3.12 (default) or 3.11 (fallback for molecular-dynamics claims with dependency incompatibilities under 3.12)
git clone https://github.com/JHU-CLSP/AutoMat.git
cd AutoMat
pip install -r requirements.txt
pip install -e .Optional (Conda):
conda env create -f env.yml
conda activate automatThe benchmark claims are not bundled with this code. They are distributed
as a gated Hugging Face dataset: jhu-clsp/AutoMat.
Every command below that takes --claim-path claims/AUTOMAT-XXXX expects the
corresponding claim directory to already exist locally, so download it first.
- Request access on the dataset page (it is gated) and wait for approval.
- Authenticate. The
hfCLI ships withhuggingface_hub(already inrequirements.txt). Verify you are logged in:If that reports you are not logged in:hf auth whoami
hf auth login
- Download the claims into this repo's
claims/directory (so--claim-path claims/AUTOMAT-XXXXresolves without any extra flags):Or fetch just one claim:# all published claims hf download jhu-clsp/AutoMat --repo-type dataset --include "claims/*" --local-dir .
hf download jhu-clsp/AutoMat --repo-type dataset --include "claims/AUTOMAT-0007*" --local-dir .
Python equivalent:
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="jhu-clsp/AutoMat",
repo_type="dataset",
allow_patterns=["claims/AUTOMAT-0007*"], # omit to fetch all claims
local_dir=".",
)Notes:
manifest.parquetat the dataset root lists exactly the claims included in the current release. A small number of claims are withheld pending the official publication of their papers and will be added in a later revision.- If you download into a different directory, point
--claim-pathat the matchingclaims/AUTOMAT-XXXXpath under that directory.
Make sure you have downloaded the relevant claim directory first β see Getting the Claims.
python automat.py \
--claim-path claims/AUTOMAT-0007 \
--agent automat \
--agent-mode hybrid \
--max-recovery-rounds 2 \
--archiveUse a specific conda env for agent command execution:
python automat.py --claim-path claims/AUTOMAT-0007 --agent automat:myenvClaude Code CLI agent (requires claude in PATH):
python automat.py --claim-path claims/AUTOMAT-0007 --agent cc --archiveClaude Code agent with a specific conda environment:
python automat.py --claim-path claims/AUTOMAT-0007 --agent cc:matsci_envOpenAI Codex CLI agent (requires codex in PATH):
python automat.py --claim-path claims/AUTOMAT-0007 --agent codex --provider openai --archiveRun Claude Code against the Kimi backend:
python automat.py --claim-path claims/AUTOMAT-0007 --agent cc --kimi./launch.sh -c claims/AUTOMAT-0007 -a automat -Alaunch.sh is a thin wrapper over automat.py and exposes the common run flags.
agents/automat_agent/utils/tools.py now supports a remote PDF parsing mode for
the ParsePDF tool. This lets many independent AutoMat runs share one
long-lived Docling parser service, which is useful when PDF parsing is GPU-backed
and you want one GPU node to serve many jobs.
Start the service on the node that owns the parser runtime:
python scripts/pdf_parse_service.pyOptional service env vars:
PDF_PARSE_SERVICE_HOST(default:0.0.0.0)PDF_PARSE_SERVICE_PORT(default:8765)PDF_PARSE_SERVICE_CONCURRENCY(default:1)PDF_PARSE_SERVICE_TOKEN(optional bearer token)
Point AutoMat runs at that service:
export AUTOMAT_PDF_PARSE_SERVICE_URL=http://gpu-node:8765
export AUTOMAT_PDF_PARSE_SERVICE_TOKEN=your-shared-tokenOptional client env vars:
AUTOMAT_PDF_PARSE_SERVICE_TIMEOUT_SECONDS(default:1800)
If AUTOMAT_PDF_PARSE_SERVICE_URL is unset, ParsePDF keeps using the local
in-process Docling parser exactly as before.
--agent-mode orchestrated|hybrid(default:orchestrated)--max-recovery-rounds N(default:0)
Notes:
- Recovery rounds apply to orchestrated/hybrid execution.
- In
hybridmode,--max-recovery-rounds 0is treated as1. - A recovery round re-runs PLAN/SETUP/EXECUTE with failure context from the prior round before final ANALYZE.
- Internal refactor migration: the new object-oriented orchestrator (phase-runner/execution-engine/workflow classes)
is now the default. Set
AUTOMAT_ORCHESTRATOR_V2=0to force the legacy orchestrated implementation.
automat.py runs a holistic LLM-based reproducibility evaluation
(harness/evaluators/llm_repro_evaluator.py) that scores the agent's run from
the trace log, artifact contents, and analysis report β no per-claim rubric
file is required.
python automat.py \
--evaluate \
--claim-path claims/AUTOMAT-0007 \
--artifact-dir artifacts/AUTOMAT-0007_ashargh1_20260201_095855Evaluation outputs are written to the artifact directory:
repro_evaluation_results.jsonrepro_eval_prompt.txt(debug copy of the evaluator prompt)
For agent runs, provider flags are:
--provider anthropic|openai|gemini--api-key--base-url
If not passed via CLI, automat.py resolves keys from env vars:
anthropic->ANTHROPIC_API_KEYopenai->OPENAI_API_KEYgemini->GOOGLE_API_KEY
Note: current evaluators in harness/evaluators/ use Claude Agent SDK models directly.
claims/
<CLAIM_DIR>/
agent_view/
claim.txt
assets/
paper/
meta/
claim.md
provenance.json
reference/
reproduction.txt
expected/
meta/provenance.json -> claim_id is used as the run identifier and artifact prefix (fallback: the claim directory name).
When running with --agent automat, the agent executes inside an isolated /tmp sandbox:
/tmp/automat_<claim_id>_<random>/
agent_view/ # copy of the claim's agent_view (read-safe)
artifacts/ # symlink -> artifacts/<claim_id>_<YYYYMMDD_HHMMSS>/
agent_view/is a full copy, so the agent cannot modify the original claim files.artifacts/is a symlink to the persistent artifact directory; all writes go to the real location.- On completion (or error), any stray files in the sandbox are moved to the artifact directory and the
/tmpsandbox is removed.
Primary outputs are created in:
artifacts/<claim_id>_<YYYYMMDD_HHMMSS>/
Common files include:
trace.logplan.mdplan.jsonsetup_report.mdsetup_result.jsonexecution_results.jsonrecovery_summary.jsonphase_error_<phase>_artifact.json(when markdown artifact validation fails)phase_error_<phase>.json/phase_error_<phase>_validation.json(legacy structured-output failures, e.g. DIAGNOSE)analysis_report.mdanalysis_result.jsonresults.json- evaluation result JSON files
When recovery is used, round snapshots are also saved:
plan_round<N>.mdplan_round<N>.jsonsetup_report_round<N>.mdsetup_result_round<N>.jsonexecution_results_round<N>.jsondiagnosis_round<N>_<step>_attempt<M>.json
Harness-level archive path:
automat_runs/<claim_id>_<YYYYMMDD_HHMMSS>/
(contains copied temp work dir, events, and harness result JSON)
make help
make install
make run-automat