fix(campaign): pin runtime-resolved model for HARNESS_NATIVE_MODEL cells by drewstone · Pull Request #304 · tangle-network/agent-eval

drewstone · 2026-07-02T03:09:07Z

Problem

0.102.0 added snap: a vendor-locked harness (e.g. kimi-code) that supports none of the swept models snaps to the HARNESS_NATIVE_MODEL sentinel ('default') instead of being dropped, and resolves its real model at runtime.

But runProfileMatrix then broke on that cell two ways:

Preflight (validateRunRecord probe) asserts the declared model carries a snapshot — 'default' has none, so the cell failed before any spend.
buildRunRecord recorded model: "default" verbatim — a provenance hole: which Kimi model actually ran is lost, and the global run-record guard (model MUST carry a snapshot) is defeated for that row.

The provenance guard is correct and stays global.

Fix

New reporting channel CampaignCostMeter.observeModel(model) (mirrors observeTokens). The substrate can't see the LLM call, so only the dispatch — reading the backend's usage/terminal events — knows the resolved model. run-campaign surfaces it as CampaignCellResult.resolvedModel (round-trips through the cell cache).
Preflight skips the snapshot assertion only for the sentinel (probes with a snapshot-bearing placeholder so every other recordability check still runs).
buildRunRecord substitutes the resolved, snapshot-bearing model (and the AgentProfileCell identity) when the declared model is the sentinel, and fails loud when a sentinel cell reported no resolved model or an unpinned one — never records 'default'.
modelHasSnapshot exported (single source of truth for the snapshot rule; reused by the resolver).

Every recorded cell still pins a real, snapshot-bearing model — the guarantee is preserved, not weakened.

Tests

3 new cases in run-profile-matrix.test.ts: sentinel + reported resolved model records the resolved id (and the cell identity), sentinel + no resolved model throws, sentinel + unpinned resolved model throws. Full suite green.

Release: agent-eval 0.102.1 (version trio synced).

A vendor-locked harness (e.g. kimi-code) that supports none of the swept models snaps to the HARNESS_NATIVE_MODEL sentinel and resolves its real model at runtime. runProfileMatrix's preflight asserted a snapshot on the declared model (failing the sentinel), and buildRunRecord recorded the sentinel verbatim — a provenance hole (which model actually ran was lost). Add an observeModel channel on CampaignCostMeter (mirrors observeTokens): the dispatch reports the backend-resolved model it read off the usage/ terminal events; run-campaign surfaces it as CampaignCellResult.resolvedModel. Preflight skips the snapshot assertion only for the sentinel; buildRunRecord substitutes the resolved, snapshot-bearing model and the AgentProfileCell identity, and FAILS LOUD when a sentinel cell reported no/unpinned model — never records the sentinel. The global provenance guard is unchanged. Release: agent-eval 0.102.1 (version trio synced).

tangletools

✅ Auto-approved drewstone PR — `7937bebb`

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: drewstone_author · 2026-07-02T03:09:16Z}

tangletools

🟢 Value Audit — sound


Verdict	sound
Concerns	2 (2 weak-concern)
Heuristic	0.0s
Duplication	0.0s
Interrogation	225.1s (2 bridge agents)
Total	225.1s

💰 Value — sound

The change lets vendor-locked harnesses that snap to the HARNESS_NATIVE_MODEL sentinel report their runtime-resolved model via ctx.cost.observeModel, then pins that snapshot-bearing model in RunRecord instead of the bare sentinel — a coherent fix that keeps the global provenance guard intact.

What it does: It adds observeModel/resolvedModel to CampaignCostMeter and resolvedModel to CampaignCellResult (src/campaign/types.ts:379-401, 524-529), captures the resolved model in executeCell (src/campaign/run-campaign.ts:305-317, 412), and in runProfileMatrix it skips the snapshot assertion for the HARNESS_NATIVE_MODEL sentinel only during preflight by probing with a placeholder (src/campaig
Goals it achieves: It unblocks recording eval cells for vendor-locked harnesses (e.g. kimi-code) that resolve their model at runtime, while preserving the invariant that every recorded row pins a concrete, snapshot-bearing model; it also keeps the harness/model pivot correct by stamping the resolved model into the cell identity.
Assessment: Good. The fix is localized, fail-loud, and follows the existing reporting pattern (observe/observeTokens) for things the substrate cannot see directly. The version trio is synced (package.json, clients/python/pyproject.toml, clients/python/src/agent_eval_rpc/init.py). Typecheck passes and the new tests in tests/campaign/run-profile-matrix.test.ts:333-412 pass (verified with GOMAXPROCS=1
Better / existing approach: none — this is the right approach. The change reuses and exports the canonical modelHasSnapshot from run-record.ts. The only cleanup is a pre-existing private copy of the same heuristic in src/trace-analyst/otlp-to-run-records.ts:435-441 that could now import the exported function to fully enforce 'single source of truth'.
Model: opencode/kimi-for-coding/k2p7
Bridge attempts: 1

🎯 Usefulness — sound

A well-fitting fix that repairs the HARNESS_NATIVE_MODEL provenance hole 0.102.0 introduced by adding an observeModel channel that mirrors the established cost/token pattern and fails loud so the sentinel can never be recorded.

Integration: The observeModel/resolvedModel channel wires into the exact established pattern (observe/observeTokens/current/tokens on CampaignCostMeter). run-campaign accumulates it per cell, surfaces it on CampaignCellResult, round-trips through the JSON cell cache, and buildRunRecord is the single gated funnel. The intended caller is the consumer-supplied ProfileDispatchFn (the substrate never ships dispatch
Fit with existing patterns: Fits the grain precisely: the substrate cannot see the LLM call, so the dispatch must report backend-resolved facts, identical to how cost and tokens already work. Resolving up front is impossible since the backend reports the served model during the call. modelHasSnapshot export is a justified single-source-of-truth predicate reused by the resolver.
Real-world viability: Fail-loud paths are correct (missing resolved model throws, unpinned throws, empty observeModel no-ops so garbage can't poison). Per-cell single-threadedness makes last-write-wins safe. resolvedModel survives resume via JSON cache round-trip. The byProfile.model representative-model assumption (first record) is sound for vendor-locked harnesses and never compromises individual RunRecord provenance
Model: opencode/zai-coding-plan/glm-5.2
Bridge attempts: 1

🎯 Usefulness Audit

🟡 byProfile.model fallback to sentinel is defensive but effectively unreachable [robustness] ``

At run-profile-matrix.ts:529-532, byProfile.model falls back to ?? declaredModel (the 'default' sentinel) when profileRecords is empty. For a sentinel profile, buildRunRecord's requireResolvedModel throws before push on every cell, so the only way to reach this with zero records is an empty campaign.cells loop — making the fallback dead in practice. A reviewer may prefer throwing there too (consistent with the fail-loud stance) rather than risking the sentinel surfacing in the byProfile pivo

💰 Value Audit

🟡 Pre-existing duplicate of modelHasSnapshot not consolidated [duplication] ``

The PR exports modelHasSnapshot from src/run-record.ts:477 as the canonical snapshot rule, but an identical private copy remains in src/trace-analyst/otlp-to-run-records.ts:435-441. That file could now import the exported version so the rule is truly single-source. This is leftover duplication, not introduced by the fix, and does not affect correctness.

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260702T035331Z}

tangletools approved these changes Jul 2, 2026

View reviewed changes

tangletools reviewed Jul 2, 2026

View reviewed changes

drewstone merged commit ba2328f into main Jul 2, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(campaign): pin runtime-resolved model for HARNESS_NATIVE_MODEL cells#304

fix(campaign): pin runtime-resolved model for HARNESS_NATIVE_MODEL cells#304
drewstone merged 1 commit into
mainfrom
fix/harness-native-model-provenance

drewstone commented Jul 2, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jul 2, 2026

Problem

Fix

Tests

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved drewstone PR — 7937bebb

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟢 Value Audit — sound

💰 Value — sound

🎯 Usefulness — sound

🎯 Usefulness Audit

💰 Value Audit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved drewstone PR — `7937bebb`