fix(campaign): pin runtime-resolved model for HARNESS_NATIVE_MODEL cells#304
Conversation
A vendor-locked harness (e.g. kimi-code) that supports none of the swept models snaps to the HARNESS_NATIVE_MODEL sentinel and resolves its real model at runtime. runProfileMatrix's preflight asserted a snapshot on the declared model (failing the sentinel), and buildRunRecord recorded the sentinel verbatim — a provenance hole (which model actually ran was lost). Add an observeModel channel on CampaignCostMeter (mirrors observeTokens): the dispatch reports the backend-resolved model it read off the usage/ terminal events; run-campaign surfaces it as CampaignCellResult.resolvedModel. Preflight skips the snapshot assertion only for the sentinel; buildRunRecord substitutes the resolved, snapshot-bearing model and the AgentProfileCell identity, and FAILS LOUD when a sentinel cell reported no/unpinned model — never records the sentinel. The global provenance guard is unchanged. Release: agent-eval 0.102.1 (version trio synced).
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — 7937bebb
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-07-02T03:09:16Z
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 2 (2 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 225.1s (2 bridge agents) |
| Total | 225.1s |
💰 Value — sound
The change lets vendor-locked harnesses that snap to the HARNESS_NATIVE_MODEL sentinel report their runtime-resolved model via ctx.cost.observeModel, then pins that snapshot-bearing model in RunRecord instead of the bare sentinel — a coherent fix that keeps the global provenance guard intact.
- What it does: It adds
observeModel/resolvedModeltoCampaignCostMeterandresolvedModeltoCampaignCellResult(src/campaign/types.ts:379-401, 524-529), captures the resolved model inexecuteCell(src/campaign/run-campaign.ts:305-317, 412), and inrunProfileMatrixit skips the snapshot assertion for theHARNESS_NATIVE_MODELsentinel only during preflight by probing with a placeholder (src/campaig - Goals it achieves: It unblocks recording eval cells for vendor-locked harnesses (e.g.
kimi-code) that resolve their model at runtime, while preserving the invariant that every recorded row pins a concrete, snapshot-bearing model; it also keeps the harness/model pivot correct by stamping the resolved model into the cell identity. - Assessment: Good. The fix is localized, fail-loud, and follows the existing reporting pattern (
observe/observeTokens) for things the substrate cannot see directly. The version trio is synced (package.json, clients/python/pyproject.toml, clients/python/src/agent_eval_rpc/init.py). Typecheck passes and the new tests in tests/campaign/run-profile-matrix.test.ts:333-412 pass (verified withGOMAXPROCS=1 - Better / existing approach: none — this is the right approach. The change reuses and exports the canonical
modelHasSnapshotfrom run-record.ts. The only cleanup is a pre-existing private copy of the same heuristic in src/trace-analyst/otlp-to-run-records.ts:435-441 that could now import the exported function to fully enforce 'single source of truth'. - Model: opencode/kimi-for-coding/k2p7
- Bridge attempts: 1
🎯 Usefulness — sound
A well-fitting fix that repairs the HARNESS_NATIVE_MODEL provenance hole 0.102.0 introduced by adding an observeModel channel that mirrors the established cost/token pattern and fails loud so the sentinel can never be recorded.
- Integration: The observeModel/resolvedModel channel wires into the exact established pattern (observe/observeTokens/current/tokens on CampaignCostMeter). run-campaign accumulates it per cell, surfaces it on CampaignCellResult, round-trips through the JSON cell cache, and buildRunRecord is the single gated funnel. The intended caller is the consumer-supplied ProfileDispatchFn (the substrate never ships dispatch
- Fit with existing patterns: Fits the grain precisely: the substrate cannot see the LLM call, so the dispatch must report backend-resolved facts, identical to how cost and tokens already work. Resolving up front is impossible since the backend reports the served model during the call. modelHasSnapshot export is a justified single-source-of-truth predicate reused by the resolver.
- Real-world viability: Fail-loud paths are correct (missing resolved model throws, unpinned throws, empty observeModel no-ops so garbage can't poison). Per-cell single-threadedness makes last-write-wins safe. resolvedModel survives resume via JSON cache round-trip. The byProfile.model representative-model assumption (first record) is sound for vendor-locked harnesses and never compromises individual RunRecord provenance
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🎯 Usefulness Audit
🟡 byProfile.model fallback to sentinel is defensive but effectively unreachable [robustness] ``
At run-profile-matrix.ts:529-532,
byProfile.modelfalls back to?? declaredModel(the 'default' sentinel) when profileRecords is empty. For a sentinel profile, buildRunRecord's requireResolvedModel throws before push on every cell, so the only way to reach this with zero records is an empty campaign.cells loop — making the fallback dead in practice. A reviewer may prefer throwing there too (consistent with the fail-loud stance) rather than risking the sentinel surfacing in the byProfile pivo
💰 Value Audit
🟡 Pre-existing duplicate of modelHasSnapshot not consolidated [duplication] ``
The PR exports
modelHasSnapshotfrom src/run-record.ts:477 as the canonical snapshot rule, but an identical private copy remains in src/trace-analyst/otlp-to-run-records.ts:435-441. That file could now import the exported version so the rule is truly single-source. This is leftover duplication, not introduced by the fix, and does not affect correctness.
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
Problem
0.102.0 added snap: a vendor-locked harness (e.g.
kimi-code) that supports none of the swept models snaps to theHARNESS_NATIVE_MODELsentinel ('default') instead of being dropped, and resolves its real model at runtime.But
runProfileMatrixthen broke on that cell two ways:validateRunRecordprobe) asserts the declared model carries a snapshot —'default'has none, so the cell failed before any spend.buildRunRecordrecordedmodel: "default"verbatim — a provenance hole: which Kimi model actually ran is lost, and the global run-record guard (model MUST carry a snapshot) is defeated for that row.The provenance guard is correct and stays global.
Fix
CampaignCostMeter.observeModel(model)(mirrorsobserveTokens). The substrate can't see the LLM call, so only the dispatch — reading the backend's usage/terminal events — knows the resolved model.run-campaignsurfaces it asCampaignCellResult.resolvedModel(round-trips through the cell cache).buildRunRecordsubstitutes the resolved, snapshot-bearing model (and theAgentProfileCellidentity) when the declared model is the sentinel, and fails loud when a sentinel cell reported no resolved model or an unpinned one — never records'default'.modelHasSnapshotexported (single source of truth for the snapshot rule; reused by the resolver).Every recorded cell still pins a real, snapshot-bearing model — the guarantee is preserved, not weakened.
Tests
3 new cases in
run-profile-matrix.test.ts: sentinel + reported resolved model records the resolved id (and the cell identity), sentinel + no resolved model throws, sentinel + unpinned resolved model throws. Full suite green.Release: agent-eval 0.102.1 (version trio synced).