Skip to content

[2.0] Add nanowm_rollout_speedup + nanowm_rollout_stability (Nano World Models)#150

Open
wenhaochai wants to merge 8 commits into
mainfrom
problem/nanowm-rollout-tasks
Open

[2.0] Add nanowm_rollout_speedup + nanowm_rollout_stability (Nano World Models)#150
wenhaochai wants to merge 8 commits into
mainfrom
problem/nanowm-rollout-tasks

Conversation

@wenhaochai

@wenhaochai wenhaochai commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Two paired Frontier-CS 2.0 tasks built from Nano World Models (arXiv:2605.23993,
simchowitzlabpublic/nano-world-model, MIT) — a frozen NanoWM-L/2 CSGO diffusion
world model. Same shape as vllm_llm_serving_optimization (#145): submit a
Python-only patch to source, the judge applies it and measures a continuous
metric with a guardrail, GPU runs on Modal (CPU judge). No changes to the
shared 2.0 adapter/template
— each task is self-contained + per-problem Modal.

The two are a dual pair on the same frozen model:

nanowm_rollout_speedup — minimize inference latency at iso-quality

Agent patches the diffusion sampling layer to speed up a fixed 50-step,
50-frame CSGO rollout without losing quality.
score = clip(100·log2(geomean_speedup),0,100) · quality_mult (LPIPS-vs-GT
guardrail ≤3%). The CSGO step↔quality frontier is real (seq@2 = +31% LPIPS vs
seq@50, monotonic). Reference (bf16 autocast) = 1.15× at iso-quality, score 20.1.

nanowm_rollout_stability — minimize long-horizon drift at iso-wall-clock

The dual: patch the sampling layer to reduce 80-frame tail drift (mean
LPIPS-vs-GT over frames ≥60) under a fixed wall-clock budget (so drift isn't
bought with compute — that's the speedup axis).
score = clip(100·(base_tail−patched_tail)/base_tail,0,100) · wallclock_mult.
Reference (history-stabilization bump) reliably beats baseline (73% per-clip win,
t≈2.5 over 22 clips) = 7.2% drift reduction, score 7.2 (24-clip final role).

Validation (Della H100)

Both run the real evaluator.py end-to-end via a local-GPU backend (Modal
stand-in): patch-policy validation → patch apply → CSGO rollout → metric →
guardrail → scoring. Both references reliably beat baseline (scores 20.1 / 7.2).
Patch policy accepts the references and rejects metric edits + env-var leakage;
the CPU smoke path (FRONTIER_NWM_SMOKE=1) validates the policy + passes the
empty reference for offline CI.

Patch policy

Python-only, ≤256 KB. Allowed: src/diffusion/**.py,
src/sample/sampling_utils.py. Denied: model/VAE/metric/rollout-harness/data/
training; native/build files; benchmark detection, env-var leakage, timing
short-circuits. The rollout invocation is fixed by the judge; the agent changes
sampler internals only.

CI status (validate-benchmark20)

This check is expected red for these tasks, exactly like #145 — and main has
no required status checks, so it is non-blocking. scripts/validate_problems.py
runs each task's reference inside the task's Docker image; for a Modal-GPU task the
real image bakes the NanoWM checkout + L/2 CSGO ckpt + held-out subset and is built
locally via docker/build_images.sh, so it is not on a public registry CI can
pull
. The job therefore fails at docker pull with
pull access denied ... repository does not exist for both tasks — the same point
#145 fails at. There is no green path for a Modal-GPU 2.0 task without either a
GPU + published image + Modal token in CI, or editing the shared validator (which
this PR deliberately does not touch).

Everything up to that point is now well-formed (this was the one fix in the latest
push): language: python resolves, reference.py is found, the evaluator imports
and — on the no-GPU smoke path — returns 1.0 for both tasks locally. Real
correctness is the Della H100 end-to-end runs above; once the images are published
to a registry the check goes green unchanged.

Notes for maintainers

  • GPU via Modal (*/modal_app.py, mirrors [2.0] Add new Frontier-CS 2.0 problem vllm_llm_serving_optimization #145); the judge stays CPU. End-to-end
    Modal deploy is pending Modal credentials — the local-backend H100 runs above are
    the validated proof; Modal only swaps the GPU location. A turnkey deploy+test
    script is included in the task working repo.
  • Hidden assets (L/2 CSGO ckpt + held-out CSGO episode subset + cached baseline)
    are baked into the judge image at build time via each task's docker/build_images.sh
    — not committed here.
  • An RT-1 domain variant was evaluated and rejected: RT-1 is over-provisioned
    (quality peaks at ~5 steps), so speedup is trivial. CSGO is uniquely suited
    (monotonic step↔quality from the complex 16-frame-window model).

🤖 Generated with Claude Code

wenhaochai and others added 8 commits June 13, 2026 11:22
…> CSGO rollout -> speedup + LPIPS guardrail) + bf16 reference.patch
…SIGN, evaluator (patch policy validated: accepts ref, rejects metric-edit+env-leak), orchestrate+modal_app (Modal GPU, judge CPU), harbor/app, docker agent+judge, infra patch, evaluate.sh. Patch-policy + smoke validated on CPU; GPU runner validating on H100
…, score 22.4, qmult 1.0) — end-to-end H100 local backend
…frame tail-drift at iso-wall-clock. Reference (stab=0.20) reliably beats baseline (t~2.5/22 clips). Reuses framework; patch policy + smoke validated
…il-drift reduction @iso-wall-clock, score 5.54 > baseline)
The CI validator (scripts/validate_problems.py -> get_language_config) only
supports {python, cpp, rust}; `language: patch` raised
`ValueError: Unsupported language: patch` and crashed the whole validate
step with a traceback before any per-problem logic ran.

Mirror vllm_llm_serving_optimization (#145), the canonical Modal-GPU 2.0
task: declare `language: python`, keep the real solution in reference.patch,
and make reference.py a docstring-only placeholder. Also align the tag to
`systems` (matches #145 and the sibling systems-optimization tasks
duckdb_e2e_query_optimization / vector_db_ann).

Submission contract is unchanged: agents still submit /app/solution.patch;
the static patch policy + Modal-GPU scoring in {speedup,stability}_eval are
untouched. The judge stays CPU-only (no runtime.docker.gpu), exactly like
#145. Verified locally: language resolves to python, reference.py is found,
both evaluators return 1.0 on the no-GPU smoke path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant