Skip to content

fix(campaign): pin runtime-resolved model for HARNESS_NATIVE_MODEL cells#304

Merged
drewstone merged 1 commit into
mainfrom
fix/harness-native-model-provenance
Jul 2, 2026
Merged

fix(campaign): pin runtime-resolved model for HARNESS_NATIVE_MODEL cells#304
drewstone merged 1 commit into
mainfrom
fix/harness-native-model-provenance

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Problem

0.102.0 added snap: a vendor-locked harness (e.g. kimi-code) that supports none of the swept models snaps to the HARNESS_NATIVE_MODEL sentinel ('default') instead of being dropped, and resolves its real model at runtime.

But runProfileMatrix then broke on that cell two ways:

  1. Preflight (validateRunRecord probe) asserts the declared model carries a snapshot — 'default' has none, so the cell failed before any spend.
  2. buildRunRecord recorded model: "default" verbatim — a provenance hole: which Kimi model actually ran is lost, and the global run-record guard (model MUST carry a snapshot) is defeated for that row.

The provenance guard is correct and stays global.

Fix

  • New reporting channel CampaignCostMeter.observeModel(model) (mirrors observeTokens). The substrate can't see the LLM call, so only the dispatch — reading the backend's usage/terminal events — knows the resolved model. run-campaign surfaces it as CampaignCellResult.resolvedModel (round-trips through the cell cache).
  • Preflight skips the snapshot assertion only for the sentinel (probes with a snapshot-bearing placeholder so every other recordability check still runs).
  • buildRunRecord substitutes the resolved, snapshot-bearing model (and the AgentProfileCell identity) when the declared model is the sentinel, and fails loud when a sentinel cell reported no resolved model or an unpinned one — never records 'default'.
  • modelHasSnapshot exported (single source of truth for the snapshot rule; reused by the resolver).

Every recorded cell still pins a real, snapshot-bearing model — the guarantee is preserved, not weakened.

Tests

3 new cases in run-profile-matrix.test.ts: sentinel + reported resolved model records the resolved id (and the cell identity), sentinel + no resolved model throws, sentinel + unpinned resolved model throws. Full suite green.

Release: agent-eval 0.102.1 (version trio synced).

A vendor-locked harness (e.g. kimi-code) that supports none of the swept
models snaps to the HARNESS_NATIVE_MODEL sentinel and resolves its real
model at runtime. runProfileMatrix's preflight asserted a snapshot on the
declared model (failing the sentinel), and buildRunRecord recorded the
sentinel verbatim — a provenance hole (which model actually ran was lost).

Add an observeModel channel on CampaignCostMeter (mirrors observeTokens):
the dispatch reports the backend-resolved model it read off the usage/
terminal events; run-campaign surfaces it as CampaignCellResult.resolvedModel.
Preflight skips the snapshot assertion only for the sentinel; buildRunRecord
substitutes the resolved, snapshot-bearing model and the AgentProfileCell
identity, and FAILS LOUD when a sentinel cell reported no/unpinned model —
never records the sentinel. The global provenance guard is unchanged.

Release: agent-eval 0.102.1 (version trio synced).

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved drewstone PR — 7937bebb

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: drewstone_author · 2026-07-02T03:09:16Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 2 (2 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 225.1s (2 bridge agents)
Total 225.1s

💰 Value — sound

The change lets vendor-locked harnesses that snap to the HARNESS_NATIVE_MODEL sentinel report their runtime-resolved model via ctx.cost.observeModel, then pins that snapshot-bearing model in RunRecord instead of the bare sentinel — a coherent fix that keeps the global provenance guard intact.

  • What it does: It adds observeModel/resolvedModel to CampaignCostMeter and resolvedModel to CampaignCellResult (src/campaign/types.ts:379-401, 524-529), captures the resolved model in executeCell (src/campaign/run-campaign.ts:305-317, 412), and in runProfileMatrix it skips the snapshot assertion for the HARNESS_NATIVE_MODEL sentinel only during preflight by probing with a placeholder (src/campaig
  • Goals it achieves: It unblocks recording eval cells for vendor-locked harnesses (e.g. kimi-code) that resolve their model at runtime, while preserving the invariant that every recorded row pins a concrete, snapshot-bearing model; it also keeps the harness/model pivot correct by stamping the resolved model into the cell identity.
  • Assessment: Good. The fix is localized, fail-loud, and follows the existing reporting pattern (observe/observeTokens) for things the substrate cannot see directly. The version trio is synced (package.json, clients/python/pyproject.toml, clients/python/src/agent_eval_rpc/init.py). Typecheck passes and the new tests in tests/campaign/run-profile-matrix.test.ts:333-412 pass (verified with GOMAXPROCS=1
  • Better / existing approach: none — this is the right approach. The change reuses and exports the canonical modelHasSnapshot from run-record.ts. The only cleanup is a pre-existing private copy of the same heuristic in src/trace-analyst/otlp-to-run-records.ts:435-441 that could now import the exported function to fully enforce 'single source of truth'.
  • Model: opencode/kimi-for-coding/k2p7
  • Bridge attempts: 1

🎯 Usefulness — sound

A well-fitting fix that repairs the HARNESS_NATIVE_MODEL provenance hole 0.102.0 introduced by adding an observeModel channel that mirrors the established cost/token pattern and fails loud so the sentinel can never be recorded.

  • Integration: The observeModel/resolvedModel channel wires into the exact established pattern (observe/observeTokens/current/tokens on CampaignCostMeter). run-campaign accumulates it per cell, surfaces it on CampaignCellResult, round-trips through the JSON cell cache, and buildRunRecord is the single gated funnel. The intended caller is the consumer-supplied ProfileDispatchFn (the substrate never ships dispatch
  • Fit with existing patterns: Fits the grain precisely: the substrate cannot see the LLM call, so the dispatch must report backend-resolved facts, identical to how cost and tokens already work. Resolving up front is impossible since the backend reports the served model during the call. modelHasSnapshot export is a justified single-source-of-truth predicate reused by the resolver.
  • Real-world viability: Fail-loud paths are correct (missing resolved model throws, unpinned throws, empty observeModel no-ops so garbage can't poison). Per-cell single-threadedness makes last-write-wins safe. resolvedModel survives resume via JSON cache round-trip. The byProfile.model representative-model assumption (first record) is sound for vendor-locked harnesses and never compromises individual RunRecord provenance
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🎯 Usefulness Audit

🟡 byProfile.model fallback to sentinel is defensive but effectively unreachable [robustness] ``

At run-profile-matrix.ts:529-532, byProfile.model falls back to ?? declaredModel (the 'default' sentinel) when profileRecords is empty. For a sentinel profile, buildRunRecord's requireResolvedModel throws before push on every cell, so the only way to reach this with zero records is an empty campaign.cells loop — making the fallback dead in practice. A reviewer may prefer throwing there too (consistent with the fail-loud stance) rather than risking the sentinel surfacing in the byProfile pivo

💰 Value Audit

🟡 Pre-existing duplicate of modelHasSnapshot not consolidated [duplication] ``

The PR exports modelHasSnapshot from src/run-record.ts:477 as the canonical snapshot rule, but an identical private copy remains in src/trace-analyst/otlp-to-run-records.ts:435-441. That file could now import the exported version so the rule is truly single-source. This is leftover duplication, not introduced by the fix, and does not affect correctness.


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260702T035331Z

@drewstone drewstone merged commit ba2328f into main Jul 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants