feat(runtime): facade hardening — observeModel seam, shot metadata, bridge error propagation, generic TArtifact#462
Merged
Conversation
…ridge error propagation, generic TArtifact Four additive fixes grounded in the tax/gtm migration proofs: - defineLeaderboard spec.resolveModel: resolve the served model off a shot's raw events; the default dispatch reports it via ctx.cost.observeModel so HARNESS_NATIVE_MODEL-snapped cells (vendor-locked harness x out-of-family model) pass the recordability guard (#455) - onCellEvents third arg {index, error?, verdict?}: fires per shot after the cell loop settles, including THROWN shots that never reach parse — shot failures are visible through the facade instead of only as empty zero-token cells (#456) - bridgeExecutor: parse the SSE stream's unterminated tail (final frame missing its blank line, or a bare JSON error body — kimi's access_terminated_error shape) and throw with the upstream message instead of draining to one empty zero-token result; error frames also surface error.type when message is absent (#457) - LeaderboardSpec<TCase, TArtifact = string> (+ DefinedLeaderboard, LeaderboardBenchmarkAdapter, dispatch/judges/score/parseOutput/export/ matrix): structured artifacts flow natively; string default keeps every existing spec unchanged (#458) All changes are additive: 2-arg onCellEvents callers, string-artifact specs, and terminated SSE streams behave byte-identically.
tangletools
approved these changes
Jul 3, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — 834661d1
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-07-03T18:40:10Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Four additive facade/executor fixes, each grounded in a real product migration (tax-agent#358, gtm-agent#491). No existing consumer spec changes: 2-arg
onCellEventscallers, string-artifact specs, and terminated SSE streams behave byte-identically — verified by the untouched pre-existing test suite passing unmodified.#455 —
resolveModelseam (served-model recordability)LeaderboardSpec.resolveModel?: (events: readonly SandboxEvent[]) => string | undefined. The default dispatch calls it per shot and reports a returned value viactx.cost.observeModel, so HARNESS_NATIVE_MODEL-snapped cells (vendor-locked harness × out-of-family model, e.g. claude-code × moonshot/*) pass agent-eval's recordability guard and the RunRecord pins the real snapshot-bearing model. In-family cells never need it. This is tax's oldresolvedModelFrompattern lifted into the facade.#456 — per-shot outcome metadata on
onCellEventsThird optional arg
LeaderboardIterationInfo { index, error?, verdict? }. The tap now fires once per shot after the cell's loop settles — including thrown shots, which never reachparseOutputand were previously invisible (stub failures surfaced only asoutLen=0). Existing 2-arg callers are unaffected.#457 — bridgeExecutor upstream-error propagation
streamBridgeSession's SSE parser only handled frames terminated by a blank line; an upstream failure arriving as a bare JSON error body (kimi'saccess_terminated_errorshape) or as an unterminated finaldata:frame was dropped with the tail buffer, draining the turn as ONE empty zero-token result. The tail is now parsed and error payloads throwValidationErrorcarrying the upstream message (anderror.typewhenmessageis absent). Runs still fail loud; the diagnostic now rides the thrown event instead of requiring a manual bridge curl. Unit-tested against a real local HTTP stub: both error shapes drained to[]pre-fix, throw with the upstream message post-fix.#458 —
LeaderboardSpec<TCase, TArtifact = string>TArtifactgeneric threaded throughDefinedLeaderboard,LeaderboardBenchmarkAdapter,score,parseOutput,dispatch,judges,export, andmatrix. Thestringdefault means zero behavior/type change for existing adopters; a structured artifact (gtm'sPersonaArtifact) now flows natively fromparseOutput→ validator → judges → records. gtm's JSON round-trip adapter ateval/loops/leaderboard.tsbecomes deletable once this lands (not edited here).Tests
src/runtime/define-leaderboard.test.ts: +3 (snapped-cell pinning incl. the guard-failure negative, thrown-shot metadata, structured-artifact flow) — the two behavioral ones fail on pre-fix code with the exact reported symptoms.src/runtime/supervise/bridge-executor.test.ts: +4 over a real local HTTP server (bare JSON error, unterminated error frame, terminated mid-stream error, healthy-stream regression).Gate: lint, typecheck (+examples), build,
verify:package,docs:check(API docs regenerated as the last pre-commit step) all green. Full suite 1227 passed / 2 skipped; one pre-existing 5s-timeout flake intests/loops/strategy-evolution.test.tsunder parallel load, passes 19/19 in isolation, untouched by this diff.Closes #455 Closes #456 Closes #457 Closes #458