[2.0] Add nanowm_rollout_speedup + nanowm_rollout_stability (Nano World Models)#150
Open
wenhaochai wants to merge 8 commits into
Open
[2.0] Add nanowm_rollout_speedup + nanowm_rollout_stability (Nano World Models)#150wenhaochai wants to merge 8 commits into
wenhaochai wants to merge 8 commits into
Conversation
…> CSGO rollout -> speedup + LPIPS guardrail) + bf16 reference.patch
…SIGN, evaluator (patch policy validated: accepts ref, rejects metric-edit+env-leak), orchestrate+modal_app (Modal GPU, judge CPU), harbor/app, docker agent+judge, infra patch, evaluate.sh. Patch-policy + smoke validated on CPU; GPU runner validating on H100
…, score 22.4, qmult 1.0) — end-to-end H100 local backend
…frame tail-drift at iso-wall-clock. Reference (stab=0.20) reliably beats baseline (t~2.5/22 clips). Reuses framework; patch policy + smoke validated
…il-drift reduction @iso-wall-clock, score 5.54 > baseline)
The CI validator (scripts/validate_problems.py -> get_language_config) only
supports {python, cpp, rust}; `language: patch` raised
`ValueError: Unsupported language: patch` and crashed the whole validate
step with a traceback before any per-problem logic ran.
Mirror vllm_llm_serving_optimization (#145), the canonical Modal-GPU 2.0
task: declare `language: python`, keep the real solution in reference.patch,
and make reference.py a docstring-only placeholder. Also align the tag to
`systems` (matches #145 and the sibling systems-optimization tasks
duckdb_e2e_query_optimization / vector_db_ann).
Submission contract is unchanged: agents still submit /app/solution.patch;
the static patch policy + Modal-GPU scoring in {speedup,stability}_eval are
untouched. The judge stays CPU-only (no runtime.docker.gpu), exactly like
#145. Verified locally: language resolves to python, reference.py is found,
both evaluators return 1.0 on the no-GPU smoke path.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two paired Frontier-CS 2.0 tasks built from Nano World Models (arXiv:2605.23993,
simchowitzlabpublic/nano-world-model, MIT) — a frozen NanoWM-L/2 CSGO diffusionworld model. Same shape as
vllm_llm_serving_optimization(#145): submit aPython-only patch to source, the judge applies it and measures a continuous
metric with a guardrail, GPU runs on Modal (CPU judge). No changes to the
shared 2.0 adapter/template — each task is self-contained + per-problem Modal.
The two are a dual pair on the same frozen model:
nanowm_rollout_speedup— minimize inference latency at iso-qualityAgent patches the diffusion sampling layer to speed up a fixed 50-step,
50-frame CSGO rollout without losing quality.
score = clip(100·log2(geomean_speedup),0,100) · quality_mult(LPIPS-vs-GTguardrail ≤3%). The CSGO step↔quality frontier is real (seq@2 = +31% LPIPS vs
seq@50, monotonic). Reference (bf16 autocast) = 1.15× at iso-quality, score 20.1.
nanowm_rollout_stability— minimize long-horizon drift at iso-wall-clockThe dual: patch the sampling layer to reduce 80-frame tail drift (mean
LPIPS-vs-GT over frames ≥60) under a fixed wall-clock budget (so drift isn't
bought with compute — that's the speedup axis).
score = clip(100·(base_tail−patched_tail)/base_tail,0,100) · wallclock_mult.Reference (history-stabilization bump) reliably beats baseline (73% per-clip win,
t≈2.5 over 22 clips) = 7.2% drift reduction, score 7.2 (24-clip final role).
Validation (Della H100)
Both run the real
evaluator.pyend-to-end via a local-GPU backend (Modalstand-in): patch-policy validation → patch apply → CSGO rollout → metric →
guardrail → scoring. Both references reliably beat baseline (scores 20.1 / 7.2).
Patch policy accepts the references and rejects metric edits + env-var leakage;
the CPU smoke path (
FRONTIER_NWM_SMOKE=1) validates the policy + passes theempty reference for offline CI.
Patch policy
Python-only, ≤256 KB. Allowed:
src/diffusion/**.py,src/sample/sampling_utils.py. Denied: model/VAE/metric/rollout-harness/data/training; native/build files; benchmark detection, env-var leakage, timing
short-circuits. The rollout invocation is fixed by the judge; the agent changes
sampler internals only.
CI status (
validate-benchmark20)This check is expected red for these tasks, exactly like #145 — and
mainhasno required status checks, so it is non-blocking.
scripts/validate_problems.pyruns each task's reference inside the task's Docker image; for a Modal-GPU task the
real image bakes the NanoWM checkout + L/2 CSGO ckpt + held-out subset and is built
locally via
docker/build_images.sh, so it is not on a public registry CI canpull. The job therefore fails at
docker pullwithpull access denied ... repository does not existfor both tasks — the same point#145 fails at. There is no green path for a Modal-GPU 2.0 task without either a
GPU + published image + Modal token in CI, or editing the shared validator (which
this PR deliberately does not touch).
Everything up to that point is now well-formed (this was the one fix in the latest
push):
language: pythonresolves,reference.pyis found, the evaluator importsand — on the no-GPU smoke path — returns 1.0 for both tasks locally. Real
correctness is the Della H100 end-to-end runs above; once the images are published
to a registry the check goes green unchanged.
Notes for maintainers
*/modal_app.py, mirrors [2.0] Add new Frontier-CS 2.0 problem vllm_llm_serving_optimization #145); the judge stays CPU. End-to-endModal deploy is pending Modal credentials — the local-backend H100 runs above are
the validated proof; Modal only swaps the GPU location. A turnkey deploy+test
script is included in the task working repo.
are baked into the judge image at build time via each task's
docker/build_images.sh— not committed here.
(quality peaks at ~5 steps), so speedup is trivial. CSGO is uniquely suited
(monotonic step↔quality from the complex 16-frame-window model).
🤖 Generated with Claude Code