AutoMat: Reproducibility Benchmark for Computational Materials Science

🌐 Project Page · 📄 Paper · 🤗 Dataset

AutoMat is a claim-runner for reproducibility experiments with three agents:

automat: multi-phase LLM agent workflow (agents/automat_agent)
cc: single-session Claude Code CLI agent (agents/claude_code); can target the Kimi backend with --kimi
codex: single-session OpenAI Codex CLI agent (agents/codex_cli)

Updates

05/12/2026 — Repository created.

Current Workflow

For a normal run, automat.py executes this pipeline:

Resolve claim_id from <claim-path>/meta/provenance.json (fallback: directory name).
Setup environment: create a temporary work dir (/tmp/automat_<claim_id>_*).
Run agent (automat, cc, or codex).
- For automat, choose --agent-mode:
  - orchestrated: PLAN → SETUP → EXECUTE → ANALYZE
  - hybrid: orchestrated + global recovery rounds when EXECUTE still has failed steps
- In phased modes, PLAN/SETUP/ANALYZE now write markdown artifacts (plan.md, setup_report.md, analysis_report.md) and the orchestrator parses only required fields.
- If required fields are missing, AutoMat issues one targeted repair round asking the agent to patch the same markdown file in place before failing the phase.
- EXECUTE remains deterministic Python orchestration driven by parsed plan step fields.
Evaluate using the holistic LLM reproducibility evaluator.
Cleanup temp work dir; optional archive to automat_runs/ if --archive is enabled.

Installation

Prerequisites

Python 3.12 (default) or 3.11 (fallback for molecular-dynamics claims with dependency incompatibilities under 3.12)

Setup

git clone https://github.com/JHU-CLSP/AutoMat.git
cd AutoMat
pip install -r requirements.txt
pip install -e .

Optional (Conda):

conda env create -f env.yml
conda activate automat

Getting the Claims (Hugging Face)

The benchmark claims are not bundled with this code. They are distributed as a gated Hugging Face dataset: jhu-clsp/AutoMat. Every command below that takes --claim-path claims/AUTOMAT-XXXX expects the corresponding claim directory to already exist locally, so download it first.

Request access on the dataset page (it is gated) and wait for approval.
Authenticate. The hf CLI ships with huggingface_hub (already in requirements.txt). Verify you are logged in:
```
hf auth whoami
```
If that reports you are not logged in:
```
hf auth login
```

Download the claims into this repo's claims/ directory (so --claim-path claims/AUTOMAT-XXXX resolves without any extra flags):

# all published claims
hf download jhu-clsp/AutoMat --repo-type dataset --include "claims/*" --local-dir .

Or fetch just one claim:

hf download jhu-clsp/AutoMat --repo-type dataset --include "claims/AUTOMAT-0007*" --local-dir .

Python equivalent:

from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="jhu-clsp/AutoMat",
    repo_type="dataset",
    allow_patterns=["claims/AUTOMAT-0007*"],  # omit to fetch all claims
    local_dir=".",
)

Notes:

manifest.parquet at the dataset root lists exactly the claims included in the current release. A small number of claims are withheld pending the official publication of their papers and will be added in a later revision.
If you download into a different directory, point --claim-path at the matching claims/AUTOMAT-XXXX path under that directory.

Running Claims

Make sure you have downloaded the relevant claim directory first — see Getting the Claims.

Recommended: run `automat.py` directly

python automat.py \
  --claim-path claims/AUTOMAT-0007 \
  --agent automat \
  --agent-mode hybrid \
  --max-recovery-rounds 2 \
  --archive

Use a specific conda env for agent command execution:

python automat.py --claim-path claims/AUTOMAT-0007 --agent automat:myenv

Claude Code CLI agent (requires claude in PATH):

python automat.py --claim-path claims/AUTOMAT-0007 --agent cc --archive

Claude Code agent with a specific conda environment:

python automat.py --claim-path claims/AUTOMAT-0007 --agent cc:matsci_env

OpenAI Codex CLI agent (requires codex in PATH):

python automat.py --claim-path claims/AUTOMAT-0007 --agent codex --provider openai --archive

Run Claude Code against the Kimi backend:

python automat.py --claim-path claims/AUTOMAT-0007 --agent cc --kimi

Launcher wrapper (`launch.sh`)

./launch.sh -c claims/AUTOMAT-0007 -a automat -A

launch.sh is a thin wrapper over automat.py and exposes the common run flags.

Shared PDF Parsing Service

agents/automat_agent/utils/tools.py now supports a remote PDF parsing mode for the ParsePDF tool. This lets many independent AutoMat runs share one long-lived Docling parser service, which is useful when PDF parsing is GPU-backed and you want one GPU node to serve many jobs.

Start the service on the node that owns the parser runtime:

python scripts/pdf_parse_service.py

Optional service env vars:

PDF_PARSE_SERVICE_HOST (default: 0.0.0.0)
PDF_PARSE_SERVICE_PORT (default: 8765)
PDF_PARSE_SERVICE_CONCURRENCY (default: 1)
PDF_PARSE_SERVICE_TOKEN (optional bearer token)

Point AutoMat runs at that service:

export AUTOMAT_PDF_PARSE_SERVICE_URL=http://gpu-node:8765
export AUTOMAT_PDF_PARSE_SERVICE_TOKEN=your-shared-token

Optional client env vars:

AUTOMAT_PDF_PARSE_SERVICE_TIMEOUT_SECONDS (default: 1800)

If AUTOMAT_PDF_PARSE_SERVICE_URL is unset, ParsePDF keeps using the local in-process Docling parser exactly as before.

AutoMat Agent Mode and Recovery Flags

--agent-mode orchestrated|hybrid (default: orchestrated)
--max-recovery-rounds N (default: 0)

Notes:

Recovery rounds apply to orchestrated/hybrid execution.
In hybrid mode, --max-recovery-rounds 0 is treated as 1.
A recovery round re-runs PLAN/SETUP/EXECUTE with failure context from the prior round before final ANALYZE.
Internal refactor migration: the new object-oriented orchestrator (phase-runner/execution-engine/workflow classes) is now the default. Set AUTOMAT_ORCHESTRATOR_V2=0 to force the legacy orchestrated implementation.

Evaluation

automat.py runs a holistic LLM-based reproducibility evaluation (harness/evaluators/llm_repro_evaluator.py) that scores the agent's run from the trace log, artifact contents, and analysis report — no per-claim rubric file is required.

Re-evaluate an existing artifact directory

python automat.py \
  --evaluate \
  --claim-path claims/AUTOMAT-0007 \
  --artifact-dir artifacts/AUTOMAT-0007_ashargh1_20260201_095855

Evaluation outputs are written to the artifact directory:

repro_evaluation_results.json
repro_eval_prompt.txt (debug copy of the evaluator prompt)

Provider and API Key Configuration

For agent runs, provider flags are:

--provider anthropic|openai|gemini
--api-key
--base-url

If not passed via CLI, automat.py resolves keys from env vars:

anthropic -> ANTHROPIC_API_KEY
openai -> OPENAI_API_KEY
gemini -> GOOGLE_API_KEY

Note: current evaluators in harness/evaluators/ use Claude Agent SDK models directly.

Claim Pack Layout (current repository format)

claims/
  <CLAIM_DIR>/
    agent_view/
      claim.txt
      assets/
      paper/
    meta/
      claim.md
      provenance.json
    reference/
      reproduction.txt
      expected/

meta/provenance.json -> claim_id is used as the run identifier and artifact prefix (fallback: the claim directory name).

Agent Sandbox Isolation

When running with --agent automat, the agent executes inside an isolated /tmp sandbox:

/tmp/automat_<claim_id>_<random>/
  agent_view/    # copy of the claim's agent_view (read-safe)
  artifacts/     # symlink -> artifacts/<claim_id>_<YYYYMMDD_HHMMSS>/

agent_view/ is a full copy, so the agent cannot modify the original claim files.
artifacts/ is a symlink to the persistent artifact directory; all writes go to the real location.
On completion (or error), any stray files in the sandbox are moved to the artifact directory and the /tmp sandbox is removed.

Runtime Outputs

AutoMat agent runs (`--agent automat`)

Primary outputs are created in:

artifacts/<claim_id>_<YYYYMMDD_HHMMSS>/

Common files include:

trace.log
plan.md
plan.json
setup_report.md
setup_result.json
execution_results.json
recovery_summary.json
phase_error_<phase>_artifact.json (when markdown artifact validation fails)
phase_error_<phase>.json / phase_error_<phase>_validation.json (legacy structured-output failures, e.g. DIAGNOSE)
analysis_report.md
analysis_result.json
results.json
evaluation result JSON files

When recovery is used, round snapshots are also saved:

plan_round<N>.md
plan_round<N>.json
setup_report_round<N>.md
setup_result_round<N>.json
execution_results_round<N>.json
diagnosis_round<N>_<step>_attempt<M>.json

Optional run archive (`--archive`)

Harness-level archive path:

automat_runs/<claim_id>_<YYYYMMDD_HHMMSS>/

(contains copied temp work dir, events, and harness result JSON)

Useful Make Targets

make help
make install
make run-automat

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoMat: Reproducibility Benchmark for Computational Materials Science

Updates

Current Workflow

Installation

Prerequisites

Setup

Getting the Claims (Hugging Face)

Running Claims

Recommended: run `automat.py` directly

Launcher wrapper (`launch.sh`)

Shared PDF Parsing Service

AutoMat Agent Mode and Recovery Flags

Evaluation

Re-evaluate an existing artifact directory

Provider and API Key Configuration

Claim Pack Layout (current repository format)

Agent Sandbox Isolation

Runtime Outputs

AutoMat agent runs (`--agent automat`)

Optional run archive (`--archive`)

Useful Make Targets

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AutoMat: Reproducibility Benchmark for Computational Materials Science

Updates

Current Workflow

Installation

Prerequisites

Setup

Getting the Claims (Hugging Face)

Running Claims

Recommended: run automat.py directly

Launcher wrapper (launch.sh)

Shared PDF Parsing Service

AutoMat Agent Mode and Recovery Flags

Evaluation

Re-evaluate an existing artifact directory

Provider and API Key Configuration

Claim Pack Layout (current repository format)

Agent Sandbox Isolation

Runtime Outputs

AutoMat agent runs (--agent automat)

Optional run archive (--archive)

Useful Make Targets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Recommended: run `automat.py` directly

Launcher wrapper (`launch.sh`)

AutoMat agent runs (`--agent automat`)

Optional run archive (`--archive`)

Packages