feat: QED Math Environment with MCP tools and LLM-judge rubric#865
feat: QED Math Environment with MCP tools and LLM-judge rubric#865rycerzes wants to merge 37 commits into
Conversation
- add QEDMathAction and QEDMathObservation dataclasses in models.py - implement QEDMathEnv client with reset, step, get_problem, submit_proof
- main env class - mcp server tools - step & reset logic
- impl MathProofRubric - LLM Grading Logic - rubric config in env
- map problem data structure - support multiple types of problems
- wss handling - client methods
- `submit_proof` method accepts an optional `output_length_tokens` parameter for reward shaping. - Introduced `remove_reasoning` function to strip reasoning traces from model output. - Added `length_penalty` function to compute penalties for overlong sequences. - Adjusted grading logic to apply discount factors and penalties based on token count.
- implement metrics aggregation
- fix dockerfile
- refer to huggingface#456
…solution handling - update tests for answer mode
- integrate with QED Math environment
- Fix MCP client parsing bug where reset/step observations were coerced into base Observation, dropping env-specific fields - tests
Ship the QED-Nano v0 and v1 evaluator prompt templates verbatim alongside
the existing v2. v1 is the QED-Nano default (full 0-7 score range); v2
constrains scores to {0,1,6,7}; v0 additionally uses the reference solution.
…Nano Bring the per-rollout reward signal in line with QED-Nano: - Default prompt_name v2 -> v1 (matches QED-Nano base.yaml; full 0-7 range). - Default reasoning_delimiters None -> ["</think>"] so only the post-think answer is graded, as upstream does. - Forward grader sampling params (grader_temperature=1.0, optional grader_max_output_tokens) to the Responses/Chat calls, and capture real token usage from the provider response (heuristic fallback only when absent). - Auto-detect proof vs answer mode from grading-schema presence, mirroring upstream's `if "schema" in problem` routing, when no explicit type is set. - Add answer_reward_preset (pure_success default, base adds wrong/no_answer/ unparsable penalties) keyed on verifier status; infra errors stay neutral. - Align is_correct success threshold to upstream score==7 (default 7). - Fix stale meta-pytorch/OpenEnv issue link in a docstring.
The repo moved from meta-pytorch/OpenEnv to huggingface/OpenEnv. Update the Docker base image (ghcr.io/huggingface/openenv-base), the openenv-core git dependency in pyproject.toml, and the matching uv.lock source entries.
Add tests for the new defaults and behavior: v1 default + v0/v1 prompt loading, default </think> reasoning stripping, proof/answer auto-detection from schema presence, answer reward presets (pure_success/base), success threshold == 7, grader sampling-param forwarding, and real token-usage capture.
Update the config tables and prose for the new defaults (v1 prompt, </think> stripping, grader sampling params, answer_reward_preset, success threshold 7, proof/answer auto-detection) and clarify token-usage metrics. Fix the broken QED-Nano source link (meta-pytorch -> CMU-AIRe) and add the generated docs/source/environments/qed_math.md stub.
burtenshaw
left a comment
There was a problem hiding this comment.
A few blockers from a focused review:
envs/qed_math_env[dev]is missingpytest-asyncio. With the env's own dev extra, the async tests are not runnable (48 failed, 94 passed, 5 errors); they only pass if I inject--with pytest-asyncio.- Focused lint/format still fails on PR-owned files: unused imports in
qed_math_environment.py, ruff-format diffs in several QED/example files, and trailing whitespace in prompt files. openenv.yamlsaysprompt_name: v2, but the runtime/tests/docs now say the default isv1.
Verified locally with the missing async plugin injected: PYTHONPATH=src:envs uv run --project envs/qed_math_env --extra dev --with pytest-asyncio pytest tests/envs/test_qed_math_environment.py -q -rs -m "not integration" -> 142 passed, 5 deselected.
|
Thanks @burtenshaw! Addressed in 3d38287:
|
burtenshaw
left a comment
There was a problem hiding this comment.
Re-checked latest 3d38287.
Previous blockers are fixed:
envs/qed_math_env[dev]now installspytest-asyncio.- focused
ruff checkpasses. openenv.yamlnow usesprompt_name: v1.- non-integration QED suite now passes from the env project:
142 passed, 5 deselected.
Still outstanding:
uv run ruff format --check ... examples/qed_math_inference.pystill wants to reformat the example.git diff --check hf/main...HEADstill reports PR-owned whitespace / blank-line issues, mostly in evaluator prompt files.- I reproduced Cursor's current
step_asyncreward/done issue:submit_proof_payloadreturnsdone=True, reward=1.0, butawait env.step_async(CallToolAction(...submit_proof...))returns a top-levelCallToolObservationwithreward=None, done=False. - I also reproduced Cursor's invalid
problem_idissue:reset(problem_id="definitely_missing")silently selectedbootstrap_000001.
No new alignment/RFC blocker from my pass; MCP and rubric boundaries still match the OpenEnv invariants.
03f32da to
1cdd1f1
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 225d439. Configure here.
| result = env.call_tool("get_problem") | ||
| result = env.call_tool("submit_proof", proof="By induction...") | ||
| ``` | ||
| """ |
There was a problem hiding this comment.
Client recv timeout too short
High Severity
QEDMathEnv inherits the WebSocket client’s default message_timeout_s of 60 seconds while submit_proof grading on the server allows up to 600 seconds. Long LLM judge calls can finish on the server but the default client disconnects waiting for the response, breaking the documented quick-start flow.
Reviewed by Cursor Bugbot for commit 225d439. Configure here.
| def get_state_sync(self) -> State: | ||
| """Synchronous helper for code paths that do not use async/await.""" | ||
| with self.sync() as client: | ||
| return client.state() |
There was a problem hiding this comment.
Sync state uses new session
Medium Severity
get_state_sync opens a separate .sync() client and reads state() from that new connection instead of the active session. On the WebSocket server each session owns its own environment, so the returned episode_id and step_count do not reflect the instance used for reset and submit_proof.
Reviewed by Cursor Bugbot for commit 225d439. Configure here.


Summary
Adds the QED Math Environment, a mathematical proof generation and evaluation env ported from QED-Nano. It deeply integrates both MCP tools (the sole agent API) and LLM-judge rubrics (structured 0–7 grading with normalized rewards), making it a reference implementation for the MCP + Rubrics features.
What's included
Environment (
qed_math_env)QEDMathEnvironment— extendsMCPEnvironment; manages problem lifecycle, dataset loading, and proof submissionMathProofRubric— extendsopenenv.core.rubrics.base.Rubric; LLM-judge grading via OpenAI-compatible endpoints with<score>N</score>parsing, retry logic, and optional score thresholdingQEDMathEnvclient — extendsMCPToolClientwith typedProblemObservation/ProofSubmissionObservationmodelsget_problem,submit_proof,get_grading_guidelinesKey features
math_verify\boxed{}checking (no LLM call needed)γ^tokens), length penalty (buffer zone), optional score thresholding (collapses 1–5 → 1)</think>) removed before gradingverifier/rollouts/*,verifier/failures/*, latency, token counts) in observation metadataFidelity alignment with QED-Nano
This branch reconciles the port against the upstream
CMU-AIRe/QED-Nanosource (rollouts.py,conf/base.yaml, evaluator prompts):v0/v1alongsidev2; defaultv1(full 0–7 range, the QED-Nano default) instead ofv2(which constrains scores to{0,1,6,7}).["</think>"], so only the post-think answer is graded — matching upstream.temperature=1.0, optionalmax_output_tokens) are forwarded, and real provider token usage is captured (heuristic only as a fallback).if "schema" in problem) when a row sets no explicit type.answer_reward_preset(pure_successdefault,baseadds wrong/no_answer/unparsable penalties), keyed on verifier status.is_correctsuccess threshold aligned to upstreamscore == 7.meta-pytorch→CMU-AIRe; build refs →huggingface/OpenEnv.Testing
tests/envs/test_qed_math_environment.py— 142 unit tests pass (5 server-integration tests marked@pytest.mark.integration):ListToolsAction/CallToolActionfor all 3 toolsgrade(), score→reward normalization, retry/failure paths\boxed{}viamath_verifyservice</think>handling + fallback</think>default, grader samplingpure_success/basepresetsscore == 7semantics\boxed{}wrappingGradingResult+ submission payloadLint (
ruff format/ruff check src tests),usort, andscripts/sync_env_docs.py --checkall pass.CC: @burtenshaw
Note
Medium Risk
Large additive training surface (LLM judge credentials, reward shaping, and verifier pool behavior) that directly affects rollout rewards, though it is isolated to a new env rather than core platform code.
Overview
Introduces a new
qed_math_envOpenEnv package (QED-Nano port) for math proof/answer rollouts: agents use MCP toolsget_problem,submit_proof, andget_grading_guidelines, with a FastAPI server andQEDMathEnvMCPToolClient.Proof grading goes through
MathProofRubric(OpenAI-compatible LLM judge, 0–7<score>parsing, optional thresholding, v0/v1/v2 prompts defaulting to v1). Answer problems use aMathVerifierServiceprocess pool (math_verify,\boxed{}extraction) with queue backpressure, retries, worker restart, and a gold-answer cache.The server loads problems from local JSON/JSONL, Hugging Face Hub, or bootstrap data; normalizes field aliases; auto-routes proof vs answer from rubric/schema presence; supports multi-step attempts,
</think>reasoning stripping, grading onoriginal_problem, and QED-Nano-style reward shaping (discount + length penalty) usingoutput_length_tokensinjected by the training harness on the HTTP step body—not via MCP.submit_proofobservations carry TrackIO-compatibleverifier_metrics.step_asyncis overridden for long judge calls and the OpenEnv #455 event-loop issue.Packaging includes
openenv.yaml, Docker/uv setup, anddocs/source/environments/qed_math.md.Reviewed by Cursor Bugbot for commit 225d439. Bugbot is set up for automated code reviews on this repo. Configure here.