Skip to content

feat: QED Math Environment with MCP tools and LLM-judge rubric#865

Open
rycerzes wants to merge 37 commits into
huggingface:mainfrom
rycerzes:feat/qed-math-env
Open

feat: QED Math Environment with MCP tools and LLM-judge rubric#865
rycerzes wants to merge 37 commits into
huggingface:mainfrom
rycerzes:feat/qed-math-env

Conversation

@rycerzes

@rycerzes rycerzes commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the QED Math Environment, a mathematical proof generation and evaluation env ported from QED-Nano. It deeply integrates both MCP tools (the sole agent API) and LLM-judge rubrics (structured 0–7 grading with normalized rewards), making it a reference implementation for the MCP + Rubrics features.

Supersedes #446 (closed). This PR is that branch rebased cleanly onto current main, plus a pass that aligns the per-rollout reward signal with QED-Nano's actual defaults (see Fidelity alignment below).

What's included

Environment (qed_math_env)

  • QEDMathEnvironment — extends MCPEnvironment; manages problem lifecycle, dataset loading, and proof submission
  • MathProofRubric — extends openenv.core.rubrics.base.Rubric; LLM-judge grading via OpenAI-compatible endpoints with <score>N</score> parsing, retry logic, and optional score thresholding
  • QEDMathEnv client — extends MCPToolClient with typed ProblemObservation / ProofSubmissionObservation models
  • 3 MCP tools: get_problem, submit_proof, get_grading_guidelines

Key features

  • Flexible dataset loading: local JSONL/JSON, Hugging Face Hub, or built-in bootstrap problems
  • Answer-mode verification: process-pool math_verify \boxed{} checking (no LLM call needed)
  • Reward shaping: discount factor (γ^tokens), length penalty (buffer zone), optional score thresholding (collapses 1–5 → 1)
  • Reasoning stripping: configurable delimiters (e.g. </think>) removed before grading
  • Multi-step problems: configurable max attempts with per-attempt feedback
  • Verifier metrics: QED-Nano-compatible metrics (verifier/rollouts/*, verifier/failures/*, latency, token counts) in observation metadata

Fidelity alignment with QED-Nano

This branch reconciles the port against the upstream CMU-AIRe/QED-Nano source (rollouts.py, conf/base.yaml, evaluator prompts):

  • Evaluator prompts: ship v0/v1 alongside v2; default v1 (full 0–7 range, the QED-Nano default) instead of v2 (which constrains scores to {0,1,6,7}).
  • Reasoning stripping now defaults to ["</think>"], so only the post-think answer is graded — matching upstream.
  • Grader sampling params (temperature=1.0, optional max_output_tokens) are forwarded, and real provider token usage is captured (heuristic only as a fallback).
  • Proof vs. answer routing is auto-detected from grading-schema presence (upstream's if "schema" in problem) when a row sets no explicit type.
  • Answer-mode rewards are selectable via answer_reward_preset (pure_success default, base adds wrong/no_answer/unparsable penalties), keyed on verifier status.
  • is_correct success threshold aligned to upstream score == 7.
  • Stale refs fixed: source link meta-pytorchCMU-AIRe; build refs → huggingface/OpenEnv.

Out of scope (by design): QED-Nano's Reasoning-Cache (RC) training loop — the iterative summarize→refine cycle, separate summarization model, and stream batching — is training orchestration, not an environment concern (and would conflict with the "controls only for orchestration" + dual-API invariants). The env already provides what RC needs: grading against original_problem and reasoning stripping.

Testing

tests/envs/test_qed_math_environment.py142 unit tests pass (5 server-integration tests marked @pytest.mark.integration):

Area Coverage
MCP tools ListToolsAction / CallToolAction for all 3 tools
Rubric grading mocked grade(), score→reward normalization, retry/failure paths
Answer verification correct/wrong/missing \boxed{} via math_verify service
Reward shaping discount factor, length penalty, score thresholding
Reasoning stripping </think> handling + fallback
Config defaults v1 default, v0/v1 prompt loading, </think> default, grader sampling
Mode routing proof/answer auto-detection from schema presence
Answer rewards pure_success / base presets
Success threshold score == 7 semantics
Token usage real-usage capture with heuristic fallback
Dataset loading local JSONL/JSON, problem-ID selection, \boxed{} wrapping
Multi-step / original_problem attempt tracking + RC-stream grading target
Verifier metrics GradingResult + submission payload
PYTHONPATH=src:envs uv run pytest tests/envs/test_qed_math_environment.py -v

Lint (ruff format / ruff check src tests), usort, and scripts/sync_env_docs.py --check all pass.


CC: @burtenshaw


Note

Medium Risk
Large additive training surface (LLM judge credentials, reward shaping, and verifier pool behavior) that directly affects rollout rewards, though it is isolated to a new env rather than core platform code.

Overview
Introduces a new qed_math_env OpenEnv package (QED-Nano port) for math proof/answer rollouts: agents use MCP tools get_problem, submit_proof, and get_grading_guidelines, with a FastAPI server and QEDMathEnv MCPToolClient.

Proof grading goes through MathProofRubric (OpenAI-compatible LLM judge, 0–7 <score> parsing, optional thresholding, v0/v1/v2 prompts defaulting to v1). Answer problems use a MathVerifierService process pool (math_verify, \boxed{} extraction) with queue backpressure, retries, worker restart, and a gold-answer cache.

The server loads problems from local JSON/JSONL, Hugging Face Hub, or bootstrap data; normalizes field aliases; auto-routes proof vs answer from rubric/schema presence; supports multi-step attempts, </think> reasoning stripping, grading on original_problem, and QED-Nano-style reward shaping (discount + length penalty) using output_length_tokens injected by the training harness on the HTTP step body—not via MCP. submit_proof observations carry TrackIO-compatible verifier_metrics. step_async is overridden for long judge calls and the OpenEnv #455 event-loop issue.

Packaging includes openenv.yaml, Docker/uv setup, and docs/source/environments/qed_math.md.

Reviewed by Cursor Bugbot for commit 225d439. Bugbot is set up for automated code reviews on this repo. Configure here.

rycerzes added 30 commits June 25, 2026 20:29
- add QEDMathAction and QEDMathObservation dataclasses in models.py
- implement QEDMathEnv client with reset, step, get_problem,
submit_proof
- main env class
- mcp server tools
- step & reset logic
- impl MathProofRubric
- LLM Grading Logic
- rubric config in env
- map problem data structure
- support multiple types of problems
- wss handling
- client methods
- `submit_proof` method accepts an optional `output_length_tokens` parameter for reward shaping.
- Introduced `remove_reasoning` function to strip reasoning traces from model output.
- Added `length_penalty` function to compute penalties for overlong sequences.
- Adjusted grading logic to apply discount factors and penalties based on token count.
- implement metrics aggregation
dockerfile
…solution handling

- update tests for answer mode
- integrate with QED Math environment
- Fix MCP client parsing bug where reset/step observations were coerced
into base Observation, dropping env-specific fields
- tests
Ship the QED-Nano v0 and v1 evaluator prompt templates verbatim alongside
the existing v2. v1 is the QED-Nano default (full 0-7 score range); v2
constrains scores to {0,1,6,7}; v0 additionally uses the reference solution.
…Nano

Bring the per-rollout reward signal in line with QED-Nano:

- Default prompt_name v2 -> v1 (matches QED-Nano base.yaml; full 0-7 range).
- Default reasoning_delimiters None -> ["</think>"] so only the post-think
  answer is graded, as upstream does.
- Forward grader sampling params (grader_temperature=1.0, optional
  grader_max_output_tokens) to the Responses/Chat calls, and capture real
  token usage from the provider response (heuristic fallback only when absent).
- Auto-detect proof vs answer mode from grading-schema presence, mirroring
  upstream's `if "schema" in problem` routing, when no explicit type is set.
- Add answer_reward_preset (pure_success default, base adds wrong/no_answer/
  unparsable penalties) keyed on verifier status; infra errors stay neutral.
- Align is_correct success threshold to upstream score==7 (default 7).
- Fix stale meta-pytorch/OpenEnv issue link in a docstring.
The repo moved from meta-pytorch/OpenEnv to huggingface/OpenEnv. Update the
Docker base image (ghcr.io/huggingface/openenv-base), the openenv-core git
dependency in pyproject.toml, and the matching uv.lock source entries.
rycerzes added 2 commits June 25, 2026 21:31
Add tests for the new defaults and behavior: v1 default + v0/v1 prompt
loading, default </think> reasoning stripping, proof/answer auto-detection
from schema presence, answer reward presets (pure_success/base), success
threshold == 7, grader sampling-param forwarding, and real token-usage
capture.
Update the config tables and prose for the new defaults (v1 prompt, </think>
stripping, grader sampling params, answer_reward_preset, success threshold 7,
proof/answer auto-detection) and clarify token-usage metrics. Fix the broken
QED-Nano source link (meta-pytorch -> CMU-AIRe) and add the generated
docs/source/environments/qed_math.md stub.

@burtenshaw burtenshaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rycerzes !

Comment thread envs/qed_math_env/server/math_verify_service.py
Comment thread envs/qed_math_env/server/qed_math_environment.py

@burtenshaw burtenshaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few blockers from a focused review:

  • envs/qed_math_env[dev] is missing pytest-asyncio. With the env's own dev extra, the async tests are not runnable (48 failed, 94 passed, 5 errors); they only pass if I inject --with pytest-asyncio.
  • Focused lint/format still fails on PR-owned files: unused imports in qed_math_environment.py, ruff-format diffs in several QED/example files, and trailing whitespace in prompt files.
  • openenv.yaml says prompt_name: v2, but the runtime/tests/docs now say the default is v1.

Verified locally with the missing async plugin injected: PYTHONPATH=src:envs uv run --project envs/qed_math_env --extra dev --with pytest-asyncio pytest tests/envs/test_qed_math_environment.py -q -rs -m "not integration" -> 142 passed, 5 deselected.

Comment thread envs/qed_math_env/server/qed_math_environment.py Outdated
Comment thread envs/qed_math_env/server/qed_math_environment.py
Comment thread envs/qed_math_env/server/math_verify_service.py
@rycerzes

Copy link
Copy Markdown
Contributor Author

Thanks @burtenshaw! Addressed in 3d38287:

  • Added pytest-asyncio to dev extras
  • Removed unused top-level imports
  • Fixed prompt_name: v2 → v1 in openenv.yaml
  • numeric_precision + float_rounding now forwarded to math_verify.verify() (were silently ignored)
  • Fixed timeout conflation in verify_answer — asyncio client timeout was being reused as the in-worker compute limit

@burtenshaw burtenshaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-checked latest 3d38287.

Previous blockers are fixed:

  • envs/qed_math_env[dev] now installs pytest-asyncio.
  • focused ruff check passes.
  • openenv.yaml now uses prompt_name: v1.
  • non-integration QED suite now passes from the env project: 142 passed, 5 deselected.

Still outstanding:

  • uv run ruff format --check ... examples/qed_math_inference.py still wants to reformat the example.
  • git diff --check hf/main...HEAD still reports PR-owned whitespace / blank-line issues, mostly in evaluator prompt files.
  • I reproduced Cursor's current step_async reward/done issue: submit_proof_payload returns done=True, reward=1.0, but await env.step_async(CallToolAction(...submit_proof...)) returns a top-level CallToolObservation with reward=None, done=False.
  • I also reproduced Cursor's invalid problem_id issue: reset(problem_id="definitely_missing") silently selected bootstrap_000001.

No new alignment/RFC blocker from my pass; MCP and rubric boundaries still match the OpenEnv invariants.

@burtenshaw burtenshaw added New Environment size: extra-large Extra-large pull request labels Jul 1, 2026
Comment thread envs/qed_math_env/server/qed_math_environment.py
@rycerzes rycerzes force-pushed the feat/qed-math-env branch from 03f32da to 1cdd1f1 Compare July 1, 2026 11:56
Comment thread envs/qed_math_env/server/qed_math_environment.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 225d439. Configure here.

result = env.call_tool("get_problem")
result = env.call_tool("submit_proof", proof="By induction...")
```
"""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client recv timeout too short

High Severity

QEDMathEnv inherits the WebSocket client’s default message_timeout_s of 60 seconds while submit_proof grading on the server allows up to 600 seconds. Long LLM judge calls can finish on the server but the default client disconnects waiting for the response, breaking the documented quick-start flow.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 225d439. Configure here.

def get_state_sync(self) -> State:
"""Synchronous helper for code paths that do not use async/await."""
with self.sync() as client:
return client.state()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sync state uses new session

Medium Severity

get_state_sync opens a separate .sync() client and reads state() from that new connection instead of the active session. On the WebSocket server each session owns its own environment, so the returned episode_id and step_count do not reflect the instance used for reset and submit_proof.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 225d439. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

New Environment size: extra-large Extra-large pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants