feat: QED Math Environment with MCP tools and LLM-judge rubric by rycerzes · Pull Request #865 · huggingface/OpenEnv

rycerzes · 2026-06-25T16:02:31Z

Summary

Adds the QED Math Environment, a mathematical proof generation and evaluation env ported from QED-Nano. It deeply integrates both MCP tools (the sole agent API) and LLM-judge rubrics (structured 0–7 grading with normalized rewards), making it a reference implementation for the MCP + Rubrics features.

Supersedes #446 (closed). This PR is that branch rebased cleanly onto current main, plus a pass that aligns the per-rollout reward signal with QED-Nano's actual defaults (see Fidelity alignment below).

What's included

Environment (`qed_math_env`)

QEDMathEnvironment — extends MCPEnvironment; manages problem lifecycle, dataset loading, and proof submission
MathProofRubric — extends openenv.core.rubrics.base.Rubric; LLM-judge grading via OpenAI-compatible endpoints with <score>N</score> parsing, retry logic, and optional score thresholding
QEDMathEnv client — extends MCPToolClient with typed ProblemObservation / ProofSubmissionObservation models
3 MCP tools: get_problem, submit_proof, get_grading_guidelines

Key features

Flexible dataset loading: local JSONL/JSON, Hugging Face Hub, or built-in bootstrap problems
Answer-mode verification: process-pool math_verify \boxed{} checking (no LLM call needed)
Reward shaping: discount factor (γ^tokens), length penalty (buffer zone), optional score thresholding (collapses 1–5 → 1)
Reasoning stripping: configurable delimiters (e.g. </think>) removed before grading
Multi-step problems: configurable max attempts with per-attempt feedback
Verifier metrics: QED-Nano-compatible metrics (verifier/rollouts/*, verifier/failures/*, latency, token counts) in observation metadata

Fidelity alignment with QED-Nano

This branch reconciles the port against the upstream CMU-AIRe/QED-Nano source (rollouts.py, conf/base.yaml, evaluator prompts):

Evaluator prompts: ship v0/v1 alongside v2; default v1 (full 0–7 range, the QED-Nano default) instead of v2 (which constrains scores to {0,1,6,7}).
Reasoning stripping now defaults to ["</think>"], so only the post-think answer is graded — matching upstream.
Grader sampling params (temperature=1.0, optional max_output_tokens) are forwarded, and real provider token usage is captured (heuristic only as a fallback).
Proof vs. answer routing is auto-detected from grading-schema presence (upstream's if "schema" in problem) when a row sets no explicit type.
Answer-mode rewards are selectable via answer_reward_preset (pure_success default, base adds wrong/no_answer/unparsable penalties), keyed on verifier status.
is_correct success threshold aligned to upstream score == 7.
Stale refs fixed: source link meta-pytorch → CMU-AIRe; build refs → huggingface/OpenEnv.

Out of scope (by design): QED-Nano's Reasoning-Cache (RC) training loop — the iterative summarize→refine cycle, separate summarization model, and stream batching — is training orchestration, not an environment concern (and would conflict with the "controls only for orchestration" + dual-API invariants). The env already provides what RC needs: grading against original_problem and reasoning stripping.

Testing

tests/envs/test_qed_math_environment.py — 142 unit tests pass (5 server-integration tests marked @pytest.mark.integration):

Area	Coverage
MCP tools	`ListToolsAction` / `CallToolAction` for all 3 tools
Rubric grading	mocked `grade()`, score→reward normalization, retry/failure paths
Answer verification	correct/wrong/missing `\boxed{}` via `math_verify` service
Reward shaping	discount factor, length penalty, score thresholding
Reasoning stripping	`</think>` handling + fallback
Config defaults	v1 default, v0/v1 prompt loading, `</think>` default, grader sampling
Mode routing	proof/answer auto-detection from schema presence
Answer rewards	`pure_success` / `base` presets
Success threshold	`score == 7` semantics
Token usage	real-usage capture with heuristic fallback
Dataset loading	local JSONL/JSON, problem-ID selection, `\boxed{}` wrapping
Multi-step / original_problem	attempt tracking + RC-stream grading target
Verifier metrics	`GradingResult` + submission payload

PYTHONPATH=src:envs uv run pytest tests/envs/test_qed_math_environment.py -v

Lint (ruff format / ruff check src tests), usort, and scripts/sync_env_docs.py --check all pass.

CC: @burtenshaw

Note

Medium Risk
Large additive training surface (LLM judge credentials, reward shaping, and verifier pool behavior) that directly affects rollout rewards, though it is isolated to a new env rather than core platform code.

Overview
Introduces a new qed_math_env OpenEnv package (QED-Nano port) for math proof/answer rollouts: agents use MCP tools get_problem, submit_proof, and get_grading_guidelines, with a FastAPI server and QEDMathEnv MCPToolClient.

Proof grading goes through MathProofRubric (OpenAI-compatible LLM judge, 0–7 <score> parsing, optional thresholding, v0/v1/v2 prompts defaulting to v1). Answer problems use a MathVerifierService process pool (math_verify, \boxed{} extraction) with queue backpressure, retries, worker restart, and a gold-answer cache.

The server loads problems from local JSON/JSONL, Hugging Face Hub, or bootstrap data; normalizes field aliases; auto-routes proof vs answer from rubric/schema presence; supports multi-step attempts, </think> reasoning stripping, grading on original_problem, and QED-Nano-style reward shaping (discount + length penalty) using output_length_tokens injected by the training harness on the HTTP step body—not via MCP. submit_proof observations carry TrackIO-compatible verifier_metrics. step_async is overridden for long judge calls and the OpenEnv #455 event-loop issue.

Packaging includes openenv.yaml, Docker/uv setup, and docs/source/environments/qed_math.md.

^{Reviewed by Cursor Bugbot for commit 225d439. Bugbot is set up for automated code reviews on this repo. Configure here.}

- add QEDMathAction and QEDMathObservation dataclasses in models.py - implement QEDMathEnv client with reset, step, get_problem, submit_proof

- main env class - mcp server tools - step & reset logic

- impl MathProofRubric - LLM Grading Logic - rubric config in env

- map problem data structure - support multiple types of problems

- wss handling - client methods

- `submit_proof` method accepts an optional `output_length_tokens` parameter for reward shaping. - Introduced `remove_reasoning` function to strip reasoning traces from model output. - Added `length_penalty` function to compute penalties for overlong sequences. - Adjusted grading logic to apply discount factors and penalties based on token count.

- implement metrics aggregation

dockerfile

- fix dockerfile

- refer to huggingface#456

…solution handling - update tests for answer mode

- integrate with QED Math environment

- tests

- Fix MCP client parsing bug where reset/step observations were coerced into base Observation, dropping env-specific fields - tests

…ated methods

Ship the QED-Nano v0 and v1 evaluator prompt templates verbatim alongside the existing v2. v1 is the QED-Nano default (full 0-7 score range); v2 constrains scores to {0,1,6,7}; v0 additionally uses the reference solution.

…Nano Bring the per-rollout reward signal in line with QED-Nano: - Default prompt_name v2 -> v1 (matches QED-Nano base.yaml; full 0-7 range). - Default reasoning_delimiters None -> ["</think>"] so only the post-think answer is graded, as upstream does. - Forward grader sampling params (grader_temperature=1.0, optional grader_max_output_tokens) to the Responses/Chat calls, and capture real token usage from the provider response (heuristic fallback only when absent). - Auto-detect proof vs answer mode from grading-schema presence, mirroring upstream's `if "schema" in problem` routing, when no explicit type is set. - Add answer_reward_preset (pure_success default, base adds wrong/no_answer/ unparsable penalties) keyed on verifier status; infra errors stay neutral. - Align is_correct success threshold to upstream score==7 (default 7). - Fix stale meta-pytorch/OpenEnv issue link in a docstring.

The repo moved from meta-pytorch/OpenEnv to huggingface/OpenEnv. Update the Docker base image (ghcr.io/huggingface/openenv-base), the openenv-core git dependency in pyproject.toml, and the matching uv.lock source entries.

Add tests for the new defaults and behavior: v1 default + v0/v1 prompt loading, default </think> reasoning stripping, proof/answer auto-detection from schema presence, answer reward presets (pure_success/base), success threshold == 7, grader sampling-param forwarding, and real token-usage capture.

Update the config tables and prose for the new defaults (v1 prompt, </think> stripping, grader sampling params, answer_reward_preset, success threshold 7, proof/answer auto-detection) and clarify token-usage metrics. Fix the broken QED-Nano source link (meta-pytorch -> CMU-AIRe) and add the generated docs/source/environments/qed_math.md stub.

burtenshaw

Thanks @rycerzes !

burtenshaw

A few blockers from a focused review:

envs/qed_math_env[dev] is missing pytest-asyncio. With the env's own dev extra, the async tests are not runnable (48 failed, 94 passed, 5 errors); they only pass if I inject --with pytest-asyncio.
Focused lint/format still fails on PR-owned files: unused imports in qed_math_environment.py, ruff-format diffs in several QED/example files, and trailing whitespace in prompt files.
openenv.yaml says prompt_name: v2, but the runtime/tests/docs now say the default is v1.

Verified locally with the missing async plugin injected: PYTHONPATH=src:envs uv run --project envs/qed_math_env --extra dev --with pytest-asyncio pytest tests/envs/test_qed_math_environment.py -q -rs -m "not integration" -> 142 passed, 5 deselected.

…meout

rycerzes · 2026-06-30T11:07:38Z

Thanks @burtenshaw! Addressed in 3d38287:

Added pytest-asyncio to dev extras
Removed unused top-level imports
Fixed prompt_name: v2 → v1 in openenv.yaml
numeric_precision + float_rounding now forwarded to math_verify.verify() (were silently ignored)
Fixed timeout conflation in verify_answer — asyncio client timeout was being reused as the in-worker compute limit

burtenshaw

Re-checked latest 3d38287.

Previous blockers are fixed:

envs/qed_math_env[dev] now installs pytest-asyncio.
focused ruff check passes.
openenv.yaml now uses prompt_name: v1.
non-integration QED suite now passes from the env project: 142 passed, 5 deselected.

Still outstanding:

uv run ruff format --check ... examples/qed_math_inference.py still wants to reformat the example.
git diff --check hf/main...HEAD still reports PR-owned whitespace / blank-line issues, mostly in evaluator prompt files.
I reproduced Cursor's current step_async reward/done issue: submit_proof_payload returns done=True, reward=1.0, but await env.step_async(CallToolAction(...submit_proof...)) returns a top-level CallToolObservation with reward=None, done=False.
I also reproduced Cursor's invalid problem_id issue: reset(problem_id="definitely_missing") silently selected bootstrap_000001.

No new alignment/RFC blocker from my pass; MCP and rubric boundaries still match the OpenEnv invariants.

…trings

…ng_guidelines

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 225d439. Configure here.}

cursor · 2026-07-01T12:59:41Z

+        result = env.call_tool("get_problem")
+        result = env.call_tool("submit_proof", proof="By induction...")
+    ```
+    """


Client recv timeout too short

High Severity

QEDMathEnv inherits the WebSocket client’s default message_timeout_s of 60 seconds while submit_proof grading on the server allows up to 600 seconds. Long LLM judge calls can finish on the server but the default client disconnects waiting for the response, breaking the documented quick-start flow.

^{Reviewed by Cursor Bugbot for commit 225d439. Configure here.}

cursor · 2026-07-01T12:59:41Z

+    def get_state_sync(self) -> State:
+        """Synchronous helper for code paths that do not use async/await."""
+        with self.sync() as client:
+            return client.state()


Sync state uses new session

Medium Severity

get_state_sync opens a separate .sync() client and reads state() from that new connection instead of the active session. On the WebSocket server each session owns its own environment, so the returned episode_id and step_count do not reflect the instance used for reset and submit_proof.

^{Reviewed by Cursor Bugbot for commit 225d439. Configure here.}

rycerzes added 30 commits June 25, 2026 20:29

initial qed-math OpenEnv scaffold with models and client

c8795ed

- add QEDMathAction and QEDMathObservation dataclasses in models.py - implement QEDMathEnv client with reset, step, get_problem, submit_proof

implement MCP server tools and QEDMath environment

2c9d1b0

- main env class - mcp server tools - step & reset logic

rubric implementation

a9fc15f

- impl MathProofRubric - LLM Grading Logic - rubric config in env

data pipeline integration

d3b38c1

- map problem data structure - support multiple types of problems

grading prompts

15112a0

API & Client

98970f1

- wss handling - client methods

fix deps

dfb4ff3

fix eval prompt

58650c2

original_problem field for grading in QED-Nano RC stream

89609fe

add verifier metrics to grading results

8d66115

- implement metrics aggregation

docs

0601674

dockerfile

tests for QED Math Env

365935e

add QED Math inference example

2440eca

- fix dockerfile

improve async step handling

7c1b18b

- refer to huggingface#456

increase timeout for LLM judge calls and adjust WebSocket ping settings

603e051

add feedback output for incorrect answers in QED Math inference

fc1dbc8

refactor get_problem method to include metadata and adjust reference …

220d961

…solution handling - update tests for answer mode

refactor math expression parsing and simplify async calls in tests

56aacd6

impl process-based math verification service

0ef8d55

- integrate with QED Math environment

gold answer caching and request admission control

23ae83e

- tests

worker pool restart and health probe reporting

51d7091

preserve custom observation fields during parsing

4756ff2

- Fix MCP client parsing bug where reset/step observations were coerced into base Observation, dropping env-specific fields - tests

metrics tracking for MathVerifierService and QEDMathEnvironment

b0ede2d

update env docs

662bc05

refactor: remove output_length_tokens param from submit_proof and rel…

7914510

…ated methods

chore(lint): usort + ruff format

04348b7

feat(qed_math): add upstream v0/v1 evaluator prompts

5e2d4e2

Ship the QED-Nano v0 and v1 evaluator prompt templates verbatim alongside the existing v2. v1 is the QED-Nano default (full 0-7 score range); v2 constrains scores to {0,1,6,7}; v0 additionally uses the reference solution.

fix(qed_math): point build refs at canonical huggingface/OpenEnv

1c923bf

The repo moved from meta-pytorch/OpenEnv to huggingface/OpenEnv. Update the Docker base image (ghcr.io/huggingface/openenv-base), the openenv-core git dependency in pyproject.toml, and the matching uv.lock source entries.

rycerzes added 2 commits June 25, 2026 21:31

burtenshaw approved these changes Jun 29, 2026

View reviewed changes

Merge branch 'main' into feat/qed-math-env

f5509d6

cursor Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread envs/qed_math_env/server/math_verify_service.py

Comment thread envs/qed_math_env/server/qed_math_environment.py

burtenshaw reviewed Jun 29, 2026

View reviewed changes

fix: add pytest-asyncio, forward precision params, collapse verify ti…

dc94a69

…meout

cursor Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread envs/qed_math_env/server/qed_math_environment.py Outdated

Comment thread envs/qed_math_env/server/qed_math_environment.py

Comment thread envs/qed_math_env/server/math_verify_service.py

burtenshaw reviewed Jul 1, 2026

View reviewed changes

burtenshaw added New Environment size: extra-large Extra-large pull request labels Jul 1, 2026

fix: propagate reward/done in step_async, raise on unknown problem_id

835034c

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread envs/qed_math_env/server/qed_math_environment.py

fix: release verifier pool on close, collapse no-op imports, fix docs…

1cdd1f1

…trings

rycerzes force-pushed the feat/qed-math-env branch from 03f32da to 1cdd1f1 Compare July 1, 2026 11:56

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread envs/qed_math_env/server/qed_math_environment.py

fix: nested boxed regex, stale token count, queue metrics, dict gradi…

225d439

…ng_guidelines

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: QED Math Environment with MCP tools and LLM-judge rubric#865

feat: QED Math Environment with MCP tools and LLM-judge rubric#865
rycerzes wants to merge 37 commits into
huggingface:mainfrom
rycerzes:feat/qed-math-env

rycerzes commented Jun 25, 2026 •

edited by cursor Bot

Loading

Uh oh!

burtenshaw left a comment

Uh oh!

Uh oh!

Uh oh!

burtenshaw left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rycerzes commented Jun 30, 2026

Uh oh!

burtenshaw left a comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jul 1, 2026

Uh oh!

cursor Bot Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rycerzes commented Jun 25, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Environment (qed_math_env)

Key features

Fidelity alignment with QED-Nano

Testing

Uh oh!

burtenshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

burtenshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rycerzes commented Jun 30, 2026

Uh oh!

burtenshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jul 1, 2026

Choose a reason for hiding this comment

Client recv timeout too short

Uh oh!

cursor Bot Jul 1, 2026

Choose a reason for hiding this comment

Sync state uses new session

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rycerzes commented Jun 25, 2026 •

edited by cursor Bot

Loading

Environment (`qed_math_env`)