[2.0] Add nanowm_rollout_speedup + nanowm_rollout_stability (Nano World Models) by wenhaochai · Pull Request #150 · FrontierCS/Frontier-CS

wenhaochai · 2026-06-13T15:26:15Z

Summary

Two paired Frontier-CS 2.0 tasks built from Nano World Models (arXiv:2605.23993,
simchowitzlabpublic/nano-world-model, MIT) — a frozen NanoWM-L/2 CSGO diffusion
world model. Same shape as vllm_llm_serving_optimization (#145): submit a
Python-only patch to source, the judge applies it and measures a continuous
metric with a guardrail, GPU runs on Modal (CPU judge). No changes to the
shared 2.0 adapter/template — each task is self-contained + per-problem Modal.

The two are a dual pair on the same frozen model:

`nanowm_rollout_speedup` — minimize inference latency at iso-quality

Agent patches the diffusion sampling layer to speed up a fixed 50-step,
50-frame CSGO rollout without losing quality.
score = clip(100·log2(geomean_speedup),0,100) · quality_mult (LPIPS-vs-GT
guardrail ≤3%). The CSGO step↔quality frontier is real (seq@2 = +31% LPIPS vs
seq@50, monotonic). Reference (bf16 autocast) = 1.15× at iso-quality, score 20.1.

`nanowm_rollout_stability` — minimize long-horizon drift at iso-wall-clock

The dual: patch the sampling layer to reduce 80-frame tail drift (mean
LPIPS-vs-GT over frames ≥60) under a fixed wall-clock budget (so drift isn't
bought with compute — that's the speedup axis).
score = clip(100·(base_tail−patched_tail)/base_tail,0,100) · wallclock_mult.
Reference (history-stabilization bump) reliably beats baseline (73% per-clip win,
t≈2.5 over 22 clips) = 7.2% drift reduction, score 7.2 (24-clip final role).

Validation (Della H100)

Both run the real evaluator.py end-to-end via a local-GPU backend (Modal
stand-in): patch-policy validation → patch apply → CSGO rollout → metric →
guardrail → scoring. Both references reliably beat baseline (scores 20.1 / 7.2).
Patch policy accepts the references and rejects metric edits + env-var leakage;
the CPU smoke path (FRONTIER_NWM_SMOKE=1) validates the policy + passes the
empty reference for offline CI.

Patch policy

Python-only, ≤256 KB. Allowed: src/diffusion/**.py,
src/sample/sampling_utils.py. Denied: model/VAE/metric/rollout-harness/data/
training; native/build files; benchmark detection, env-var leakage, timing
short-circuits. The rollout invocation is fixed by the judge; the agent changes
sampler internals only.

CI status (`validate-benchmark20`)

This check is expected red for these tasks, exactly like #145 — and main has
no required status checks, so it is non-blocking. scripts/validate_problems.py
runs each task's reference inside the task's Docker image; for a Modal-GPU task the
real image bakes the NanoWM checkout + L/2 CSGO ckpt + held-out subset and is built
locally via docker/build_images.sh, so it is not on a public registry CI can
pull. The job therefore fails at docker pull with
pull access denied ... repository does not exist for both tasks — the same point
#145 fails at. There is no green path for a Modal-GPU 2.0 task without either a
GPU + published image + Modal token in CI, or editing the shared validator (which
this PR deliberately does not touch).

Everything up to that point is now well-formed (this was the one fix in the latest
push): language: python resolves, reference.py is found, the evaluator imports
and — on the no-GPU smoke path — returns 1.0 for both tasks locally. Real
correctness is the Della H100 end-to-end runs above; once the images are published
to a registry the check goes green unchanged.

Notes for maintainers

GPU via Modal (*/modal_app.py, mirrors [2.0] Add new Frontier-CS 2.0 problem vllm_llm_serving_optimization #145); the judge stays CPU. End-to-end
Modal deploy is pending Modal credentials — the local-backend H100 runs above are
the validated proof; Modal only swaps the GPU location. A turnkey deploy+test
script is included in the task working repo.
Hidden assets (L/2 CSGO ckpt + held-out CSGO episode subset + cached baseline)
are baked into the judge image at build time via each task's docker/build_images.sh
— not committed here.
An RT-1 domain variant was evaluated and rejected: RT-1 is over-provisioned
(quality peaks at ~5 steps), so speedup is trivial. CSGO is uniquely suited
(monotonic step↔quality from the complex 16-frame-window model).

🤖 Generated with Claude Code

…> CSGO rollout -> speedup + LPIPS guardrail) + bf16 reference.patch

…SIGN, evaluator (patch policy validated: accepts ref, rejects metric-edit+env-leak), orchestrate+modal_app (Modal GPU, judge CPU), harbor/app, docker agent+judge, infra patch, evaluate.sh. Patch-policy + smoke validated on CPU; GPU runner validating on H100

…, score 22.4, qmult 1.0) — end-to-end H100 local backend

…frame tail-drift at iso-wall-clock. Reference (stab=0.20) reliably beats baseline (t~2.5/22 clips). Reuses framework; patch policy + smoke validated

…il-drift reduction @iso-wall-clock, score 5.54 > baseline)

…ong rollouts

The CI validator (scripts/validate_problems.py -> get_language_config) only supports {python, cpp, rust}; `language: patch` raised `ValueError: Unsupported language: patch` and crashed the whole validate step with a traceback before any per-problem logic ran. Mirror vllm_llm_serving_optimization (#145), the canonical Modal-GPU 2.0 task: declare `language: python`, keep the real solution in reference.patch, and make reference.py a docstring-only placeholder. Also align the tag to `systems` (matches #145 and the sibling systems-optimization tasks duckdb_e2e_query_optimization / vector_db_ann). Submission contract is unchanged: agents still submit /app/solution.patch; the static patch policy + Modal-GPU scoring in {speedup,stability}_eval are untouched. The judge stays CPU-only (no runtime.docker.gpu), exactly like #145. Verified locally: language resolves to python, reference.py is found, both evaluators return 1.0 on the no-GPU smoke path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wenhaochai and others added 8 commits June 13, 2026 11:22

nanowm_rollout_speedup: core — scoring/settings/runner (apply patch -…

7702140

…> CSGO rollout -> speedup + LPIPS guardrail) + bf16 reference.patch

nanowm_rollout_speedup: 2.0/README entry

02a8d6b

nanowm_rollout_speedup: record validated reference result (1.17x bf16…

55940fc

…, score 22.4, qmult 1.0) — end-to-end H100 local backend

nanowm_rollout_stability: drift task (dual of speedup) — minimize 80-…

5065f7c

…frame tail-drift at iso-wall-clock. Reference (stab=0.20) reliably beats baseline (t~2.5/22 clips). Reuses framework; patch policy + smoke validated

nanowm_rollout_stability: validated end-to-end (stab=0.20 ref 5.5% ta…

e5e71a7

…il-drift reduction @iso-wall-clock, score 5.54 > baseline)

nanowm_rollout_speedup: infra patch = frame-chunked decode (24) for l…

e0ec414

…ong rollouts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2.0] Add nanowm_rollout_speedup + nanowm_rollout_stability (Nano World Models)#150

[2.0] Add nanowm_rollout_speedup + nanowm_rollout_stability (Nano World Models)#150
wenhaochai wants to merge 8 commits into
mainfrom
problem/nanowm-rollout-tasks

wenhaochai commented Jun 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wenhaochai commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

nanowm_rollout_speedup — minimize inference latency at iso-quality

nanowm_rollout_stability — minimize long-horizon drift at iso-wall-clock

Validation (Della H100)

Patch policy

CI status (validate-benchmark20)

Notes for maintainers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wenhaochai commented Jun 13, 2026 •

edited

Loading

`nanowm_rollout_speedup` — minimize inference latency at iso-quality

`nanowm_rollout_stability` — minimize long-horizon drift at iso-wall-clock

CI status (`validate-benchmark20`)