Skip to content

feat(runtime): facade hardening — observeModel seam, shot metadata, bridge error propagation, generic TArtifact#462

Merged
drewstone merged 1 commit into
mainfrom
feat/facade-hardening-455-458
Jul 3, 2026
Merged

feat(runtime): facade hardening — observeModel seam, shot metadata, bridge error propagation, generic TArtifact#462
drewstone merged 1 commit into
mainfrom
feat/facade-hardening-455-458

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Four additive facade/executor fixes, each grounded in a real product migration (tax-agent#358, gtm-agent#491). No existing consumer spec changes: 2-arg onCellEvents callers, string-artifact specs, and terminated SSE streams behave byte-identically — verified by the untouched pre-existing test suite passing unmodified.

#455resolveModel seam (served-model recordability)

LeaderboardSpec.resolveModel?: (events: readonly SandboxEvent[]) => string | undefined. The default dispatch calls it per shot and reports a returned value via ctx.cost.observeModel, so HARNESS_NATIVE_MODEL-snapped cells (vendor-locked harness × out-of-family model, e.g. claude-code × moonshot/*) pass agent-eval's recordability guard and the RunRecord pins the real snapshot-bearing model. In-family cells never need it. This is tax's old resolvedModelFrom pattern lifted into the facade.

#456 — per-shot outcome metadata on onCellEvents

Third optional arg LeaderboardIterationInfo { index, error?, verdict? }. The tap now fires once per shot after the cell's loop settles — including thrown shots, which never reach parseOutput and were previously invisible (stub failures surfaced only as outLen=0). Existing 2-arg callers are unaffected.

#457 — bridgeExecutor upstream-error propagation

streamBridgeSession's SSE parser only handled frames terminated by a blank line; an upstream failure arriving as a bare JSON error body (kimi's access_terminated_error shape) or as an unterminated final data: frame was dropped with the tail buffer, draining the turn as ONE empty zero-token result. The tail is now parsed and error payloads throw ValidationError carrying the upstream message (and error.type when message is absent). Runs still fail loud; the diagnostic now rides the thrown event instead of requiring a manual bridge curl. Unit-tested against a real local HTTP stub: both error shapes drained to [] pre-fix, throw with the upstream message post-fix.

#458LeaderboardSpec<TCase, TArtifact = string>

TArtifact generic threaded through DefinedLeaderboard, LeaderboardBenchmarkAdapter, score, parseOutput, dispatch, judges, export, and matrix. The string default means zero behavior/type change for existing adopters; a structured artifact (gtm's PersonaArtifact) now flows natively from parseOutput → validator → judges → records. gtm's JSON round-trip adapter at eval/loops/leaderboard.ts becomes deletable once this lands (not edited here).

Tests

  • src/runtime/define-leaderboard.test.ts: +3 (snapped-cell pinning incl. the guard-failure negative, thrown-shot metadata, structured-artifact flow) — the two behavioral ones fail on pre-fix code with the exact reported symptoms.
  • src/runtime/supervise/bridge-executor.test.ts: +4 over a real local HTTP server (bare JSON error, unterminated error frame, terminated mid-stream error, healthy-stream regression).

Gate: lint, typecheck (+examples), build, verify:package, docs:check (API docs regenerated as the last pre-commit step) all green. Full suite 1227 passed / 2 skipped; one pre-existing 5s-timeout flake in tests/loops/strategy-evolution.test.ts under parallel load, passes 19/19 in isolation, untouched by this diff.

Closes #455 Closes #456 Closes #457 Closes #458

…ridge error propagation, generic TArtifact

Four additive fixes grounded in the tax/gtm migration proofs:

- defineLeaderboard spec.resolveModel: resolve the served model off a shot's
  raw events; the default dispatch reports it via ctx.cost.observeModel so
  HARNESS_NATIVE_MODEL-snapped cells (vendor-locked harness x out-of-family
  model) pass the recordability guard (#455)
- onCellEvents third arg {index, error?, verdict?}: fires per shot after the
  cell loop settles, including THROWN shots that never reach parse — shot
  failures are visible through the facade instead of only as empty
  zero-token cells (#456)
- bridgeExecutor: parse the SSE stream's unterminated tail (final frame
  missing its blank line, or a bare JSON error body — kimi's
  access_terminated_error shape) and throw with the upstream message instead
  of draining to one empty zero-token result; error frames also surface
  error.type when message is absent (#457)
- LeaderboardSpec<TCase, TArtifact = string> (+ DefinedLeaderboard,
  LeaderboardBenchmarkAdapter, dispatch/judges/score/parseOutput/export/
  matrix): structured artifacts flow natively; string default keeps every
  existing spec unchanged (#458)

All changes are additive: 2-arg onCellEvents callers, string-artifact specs,
and terminated SSE streams behave byte-identically.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved drewstone PR — 834661d1

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: drewstone_author · 2026-07-03T18:40:10Z

@drewstone drewstone merged commit 44cafd7 into main Jul 3, 2026
1 check passed
@drewstone drewstone deleted the feat/facade-hardening-455-458 branch July 3, 2026 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants