Skip to content

feat(runtime): defineLeaderboard — declarative eval leaderboard facade over runProfileMatrix#453

Merged
drewstone merged 2 commits into
mainfrom
feat/define-leaderboard
Jul 3, 2026
Merged

feat(runtime): defineLeaderboard — declarative eval leaderboard facade over runProfileMatrix#453
drewstone merged 2 commits into
mainfrom
feat/define-leaderboard

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

defineLeaderboard<TCase>(spec) on @tangle-network/agent-runtime/loops: the declarative facade that turns a product's hand-written matrix runner (~650 lines, e.g. tax's matrix.ts) into ~40 lines of domain code — cases + prompt + score — assembled into ONE runProfileMatrix call.

Design gates held

  • No engine in the facade. Pure assembly of existing primitives: expandProfileAxes (harness×model axis), loopDispatch + naiveDriver (per-cell driven loop), resolveSandboxClient (cli-bridge backend), leaderboard/renderLeaderboardMarkdown (default export). Zero new execution/judging/metering code.
  • Every default overridable; runProfileMatrix stays the escape floor. Level-1 seams: backends, flags, parseOutput, onCellEvents, setup/teardown, export, modelBackend, matrix passthrough (spread last onto the runProfileMatrix call). Level-2: dispatch / judges full replacement.
  • Expressiveness = tax. Async prompt (shell-to-python, resolved once per case), onCellEvents (search-metric capture), parseOutput (marker/echo-guard scoring), custom export, --cases/--harnesses/--models/--shots/--reps + spec-declared flags.
  • toBenchmarkAdapter() returns a structurally BenchmarkAdapter-compatible object (name/preflight/loadTasks/judge/goldArtifact) with no dependency on a bench package.

Footgun fixed by default

The default run dir is fresh per invocation (timestamp+pid under tmpdir). runProfileMatrix caches cells by run dir; a stable default silently resumes a prior FAILED zero-token cell and skips dispatch. Only an explicit --run-dir opts into resume. Also default: bare model ids are snapshot-stamped (--model-snapshot, default @leaderboard) so RunRecord validation can't reject the run.

Version notes

  • devDep @tangle-network/agent-eval 0.100.0 → 0.103.1; peer floor raised >=0.97.0>=0.101.0 (expandProfileAxes/harnessAxisOf/CODING_HARNESSES first shipped in v0.101.0). Additive new subpath otherwise — no fleet bump needed.
  • Minor bump 0.83.0 → 0.84.0.

Proof

  • 10 new offline tests (inProcessSandboxClient, no network/creds): end-to-end matrix run with integrity.verdict === 'real', fresh-run-dir default + explicit --run-dir, --cases subsetting + unknown-id rejection, snapshot stamping, score→judge wrapping (dimensions/notes), onCellEvents seam, spec flags → ctx.args, fail-loud default sandbox backend, toBenchmarkAdapter loadTasks/judge round-trip + duplicate-id preflight.
  • Full suite: 124 files, 1212 passed / 2 skipped. typecheck (src + examples), build, lint, verify:package, docs:check (API docs regenerated) all green.

…e over runProfileMatrix

A product's harness×model leaderboard reduces to cases + prompt + score:
defineLeaderboard assembles expandProfileAxes × loopDispatch/naiveDriver ×
the backend resolvers into ONE runProfileMatrix call, with toBenchmarkAdapter()
exposing the same domain surface in the structural BenchmarkAdapter shape.
No new execution/judging/metering code — every default is overridable
(backends, flags, parseOutput, onCellEvents, export, dispatch, judges, matrix
passthrough) and runProfileMatrix stays public as the escape floor.

The default run dir is FRESH per invocation (timestamp+pid under tmpdir):
runProfileMatrix caches cells by run dir, and a stable default silently
resumes a prior failed zero-token cell — only an explicit --run-dir opts in.

Requires agent-eval >=0.101.0 (expandProfileAxes / harnessAxisOf /
CODING_HARNESSES); peer floor raised accordingly. 10 offline tests via
inProcessSandboxClient. Minor bump to 0.84.0.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved drewstone PR — 0d1936d6

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: drewstone_author · 2026-07-03T04:59:34Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 2 (1 low, 1 weak-concern)
Heuristic 0.0s
Duplication 0.1s
Interrogation 147.4s (2 bridge agents)
Total 147.5s

💰 Value — sound-with-nits

A declarative facade that assembles existing primitives (expandProfileAxes × loopDispatch × naiveDriver × runProfileMatrix) into one call, collapsing ~650 hand-rolled lines per product into ~40; sound consolidation in the codebase's established grain, with only minor nits.

  • What it does: Adds defineLeaderboard<TCase>(spec) to @tangle-network/agent-runtime/loops: give it cases + prompt + score, and .run(argv) expands a base profile across harness×model axes, runs every (profile, case) cell as a driven loop via loopDispatch+naiveDriver, wraps score as the campaign judge, and emits ONE runProfileMatrix call, then exports a ranked markdown leaderboard. `.toBenchmarkA
  • Goals it achieves: Kill the per-product hand-rolled matrix runner (~650 lines each, per the PR body) and the three recurring footguns it re-hits: stale run-dir cell-cache reuse silently skipping dispatch, zero-token stub cells, and RunRecord rejection of bare model ids. Give products one declarative seam so the domain is just cases+prompt+score, and let a product leaderboard register into the bench BenchmarkAdapter
  • Assessment: Good change, in the grain of the codebase. It is pure assembly — zero new execution/judging/metering code; every moving part (expandProfileAxes, loopDispatch, naiveDriver, resolveSandboxClient, leaderboard/renderLeaderboardMarkdown, runProfileMatrix) is an existing primitive already exported from this package. It directly continues the recent consolidation siblings: loop-dispatch.ts (collapse ru
  • Better / existing approach: Searched: runProfileMatrix, expandProfileAxes, BenchmarkAdapter, toBenchmarkAdapter, loopCampaignDispatch consumers (src/**/*.ts), the bench adapter registry (bench/src/benchmarks/types.ts, bench/src/adapters.ts), and hand-rolled matrix.ts across ~/company. No existing equivalent in agent-runtime — none found. The correct alternative (importing the real BenchmarkAdapter from bench) is
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound

A clean declarative facade that collapses the documented ~650-line hand-rolled matrix assembly (tax/legal/gtm) into ~40 lines of domain code, wiring only existing primitives with every default overridable and the documented footguns fixed by default.

  • Integration: Properly exported from src/runtime/index.ts:93-104 onto the /loops surface. Every primitive it assembles is confirmed present: loopDispatch (loop-dispatch.ts:159), naiveDriver (steering-drivers.ts:109), resolveSandboxClient (resolve-sandbox-client.ts:54), leaderboard/renderLeaderboardMarkdown (benchmark-report.ts), and expandProfileAxes/harnessAxisOf/CODING_HARNESSES + runProfileMatrix/JudgeConfig
  • Fit with existing patterns: Matches the codebase's established grain exactly. runPersonaDispatch in src/conversation/run-persona.ts:221 is the same shape (wrap a runner as a ProfileDispatchFn for runProfileMatrix); loopDispatch itself is the same shape; examples/webcode-matrix/webcode-matrix.ts is the canonical hand-rolled assembly. docs/archive/benchmark-matrix-consolidation.md identifies tax/legal/gtm as three duplicated ~
  • Real-world viability: Test coverage hits the load-bearing paths: end-to-end offline run with real metering (integrity verdict='real'), fresh run-dir per invocation (the stale-cell-cache footgun fix), explicit --run-dir resume, case subsetting + unknown-id rejection, snapshot stamping, structured score with dimensions, onCellEvents metric-capture seam, custom flag parsing, fail-loud on the default 'sandbox' backend, and
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added src/runtime/define-leaderboard.ts

  •    console.log(table)
    

💰 Value Audit

🟡 Local structural BenchmarkAdapter types shadow the canonical bench ones [maintenance] ``

define-leaderboard.ts:96-124 redeclares LeaderboardBenchTask/LeaderboardBenchScore/LeaderboardBenchmarkAdapter as structural mirrors of BenchTask/BenchScore/BenchmarkAdapter in bench/src/benchmarks/types.ts:13-46. This is forced by the dep direction (bench → agent-runtime, not back), so it's defensible, but the two definitions can drift silently: e.g. the canonical BenchmarkAdapter also has optional output?: OutputAdapter<string> and leafClient? (types.ts:46,57) which the local


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260703T052516Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 0d1936d6

Review health 100/100 · Reviewer score 62/100 · Confidence 85/100 · 18 findings (18 low)

glm deepseek aggregate
Readiness 62 74 62
Confidence 85 85 85
Correctness 62 74 62
Security 62 74 62
Testing 62 74 62
Architecture 62 74 62

Reviewer score is advisory once the run is complete and the verdict has no blockers.

Full multi-shot audit completed 5/5 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 8 changed files. Global verifier still owns final merge decision.

🟡 LOW Several newly-listed agent-eval symbols carry no-summary placeholders — docs/api/primitive-catalog.md

New campaign entries discoverEvalFixtures, loadEvalFixture, loadEvalFixtureScenarios, planCampaignRun, planEvalFixtureRun, policyEditProposer all render '(no summary — add a TSDoc line at the declaration)'. These originate from the bumped @tangle-network/agent-eval@0.103.1 dependency, not this repo's source, so the catalog is faithfully generated — but the docs ship public API rows with no description. Track as a follow-up TSDoc task in agent-eval; not a blocker for this PR.

🟡 LOW defineLeaderboard has no TSDoc summary at declaration — docs/api/primitive-catalog.md

Catalog row reads '(no summary — add a TSDoc line at the declaration)'. The module-level comment (define-leaderboard.ts:1-30) documents the facade thoroughly, but the exported function at line 255 has no JSDoc directly above it (line 254 blank, preceded by normalizeScore). Add a one-line TSDoc on the export so the catalog table self-describes instead of showing the placeholder. Cosmetic only — signature and file:line are correct.

🟡 LOW Range-form inconsistency: agent-eval uses caret, sibling Tangle deps use open range — package.json

@tangle-network/agent-interface and @tangle-network/sandbox (both dev and peer) use the >=X <1.0.0 form; only agent-eval devDep was switched to caret ^0.103.1. Cosmetic but breaks the local convention and narrows resolution semantics (caret for 0.x caps at next minor). If the intent was 'latest 0.103.x', keep caret; if intent was 'any compatible', restore >=0.101.0 <1.0.0 to match siblings.

🟡 LOW devDep caret pin narrower than peerDep advertised range creates untested compatibility window — package.json

devDep ^0.103.1 resolves to >=0.103.1 <0.104.0 for 0.x versions, so CI/test only ever exercises 0.103.x. Yet peerDependencies advertises >=0.101.0 <1.0.0 (line 126), claiming 0.101.0–0.102.x are supported. The previous devDep (>=0.100.0 <1.0.0) covered the same breadth as the peer range; the new caret does not. Consumers pinning 0.101.x/0.102.x will hit a peer range the project never tested. Fix: either widen devDep back to >=0.101.0 <1.0.0 to match peerDep, or raise peerDep floor to >=0.103.1 to match what is actually tested. Preferable: align both on the tested floor.

🟡 LOW devDep version pin narrow but within peer range; no minimum-version CI coverage — package.json

devDep changed from >=0.100.0 <1.0.0 to ^0.103.1 (effectively >=0.103.1 <0.104.0). CI tests only the latest version, never the declared peer floor of 0.101.0. If a consumer resolves to 0.101.0–0.102.x and hits a runtime API gap not exercised by tests, CI won't catch it. The imported APIs (expandProfileAxes, runProfileMatrix, etc.) all exist in 0.101.0, so risk is low.

🟡 LOW peerDependency minimum bump (0.97.0 → 0.101.0) is breaking for existing consumers — package.json

Raising the peerDep floor by 4 minor versions drops support for agent-eval 0.97.0–0.100.x. Under 0.x semver this is permitted in a minor bump, but consumers will see ERESOLVE peer warnings on install. No CHANGELOG entry is visible in this shot's scope; recommend the PR description or CHANGELOG explicitly call out the required consumer upgrade so downstream pnpm install failures are not a surprise.

🟡 LOW Transitive sandbox major-series jump (0.3.0 -> 0.9.5) introduced via tcloud@0.4.14 — pnpm-lock.yaml

tcloud@0.4.14 now resolves its sandbox dependency to 0.9.5 instead of 0.3.0, a large minor-series jump inside the 0.x range (semver-major in 0.x terms). The lock file itself is correct and consistent; the only risk is runtime/behavioral drift in sandbox APIs consumed transitively via tcloud. This is not a lock defect — flagging so source-level shots confirm no sandbox 0.3-era APIs are relied upon. No action required within this shot.

🟡 LOW Default scoreJudge declares only 'composite' but passes extra dimensions — src/runtime/define-leaderboard.ts

The default scoreJudge at line 281-292 declares dimensions: [{ key: 'composite' }] but its score() implementation spreads s.dimensions into the returned dimensions object (line 288). Extra dimensions from spec.score() are recorded but undeclared in the judge config. Downstream consumers iterating declared dimensions would miss them. Either declare all known dimensions or document that undeclared dimensions are only carried in the raw data.

🟡 LOW --shots/--reps parse to NaN with no validation, silently unbounding the naive retry cap — src/runtime/define-leaderboard.ts

const shots = Number(args.shots ?? spec.shots ?? 1) (and same for reps) accepts non-numeric input like --shots abc and yields NaN. In naiveDriver.plan (steering-drivers.ts:122), history.length >= NaN is always false, so the retry cap silently disappears — the loop only stops on a valid verdict, potentially hammering the backend indefinitely on a case that never passes. Fix: validate with Number.isFinite(n) && n >= 1 and throw a fail-loud error otherwise, consistent with the other fail-loud guards in this function.

🟡 LOW as never cast on sandboxOverrides hides shape drift — src/runtime/define-leaderboard.ts

The constructed sandboxOverrides: { backend: { type: axis.harness, model: {...} } } is cast via as never to bypass the type system. Tests pass today, but a future agent-eval type change to the expected sandboxOverrides shape will not surface as a compile error here — the facade could silently send a wrong-shaped override to the sandbox. Replace with a structural cast or import the real override type.

🟡 LOW sandboxClient = makeClient() runs unconditionally even when spec.dispatch overrides the default — src/runtime/define-leaderboard.ts

A spec that fully replaces dispatch (LEVEL 2 override) and has no need for the default loop still hits makeClient() — and the default sandbox backend factory throws unless overridden. So a LEVEL-2 dispatch user is forced to also override spec.backends.sandbox (or pass --backend <custom>) just to skip a client they never use. Guard with const sandboxClient = spec.dispatch ? null : makeClient() and pass it through only to the default path.

🟡 LOW teardown is skipped when setup throws — src/runtime/define-leaderboard.ts

await spec.setup?.(ctx) runs BEFORE the try/finally that wraps runProfileMatrix (line 481 vs try at 482). If setup partially succeeds (opens handles, acquires boxes) and then throws, teardown never runs — a resource leak path. The JSDoc says teardown runs 'after the matrix, even on failure,' which is honored for matrix failure but not for setup failure. Wrap setup in the same try, or document that setup must be self-cleaning on throw.

🟡 LOW toBenchmarkAdapter().loadTasks() ignores opts.splitsrc/runtime/define-leaderboard.ts

The adapter accepts split?: string (declared at line 119 to match the BenchmarkAdapter structural contract) but the implementation only honors ids and limit. A registry requesting split: 'holdout' silently receives every case — a quiet contract gap rather than a fail-loud refusal. Either filter by a per-case split field (if the spec exposes one), or throw split is not modeled by this leaderboard so callers don't get a wrong-shaped slice.

🟡 LOW bareModel() has dead null-coalescing fallback — src/runtime/define-leaderboard.ts

model.split('@')[0] ?? model — the ?? model is unreachable because String.prototype.split always returns at least one element. Remove the ?? model or change to model.split('@')[0].

🟡 LOW maxIterations duplicated on naiveDriver and top-level loop options — src/runtime/define-leaderboard.ts

maxIterations: shots appears on both the naiveDriver options (line 445) and the returned loop options object (line 476). Both use the same variable now so consistent, but fragile: if a future edit updates one and not the other, the effective cap becomes the minimum, potentially masking the bug.

🟡 LOW setup() outside try/finally — teardown not called on setup failure — src/runtime/define-leaderboard.ts

setup() at line 481 is called before the try/finally. If setup allocates resources (e.g., opens a fixture connection, starts a local server) and then throws, teardown() never runs. The documented contract says teardown runs 'even on failure' but that refers to matrix failures. Resources allocated in setup before its throw will leak. Fix: wrap setup inside the try block, or wrap the entire (setup + try/finally) in its own try/finally.

🟡 LOW shotNonce is shared across concurrent cells — src/runtime/define-leaderboard.ts

shotNonce is a single mutable counter shared by all cells. When runProfileMatrix dispatches cells concurrently, taskToPrompt reads and increments it without synchronization. Practically benign — different cells have different base prompts so caching isn't an issue — but semantically the nonce should be per-cell or even per-shot to match intent. A non-atomic increment could produce duplicate nonces across cells.

🟡 LOW toBenchmarkAdapter().judge() has dead code guard — src/runtime/define-leaderboard.ts

const [c] = selectCases([task.id]) at line 538 then if (c === undefined) throw at line 539 is dead at runtime. selectCases() always throws for unknown ids (line 272-274) and always returns a populated array for known ids (map preserves length). The guard is needed for TypeScript strict index access but unreachable at runtime. The throw message will never fire.


tangletools · 2026-07-03T05:39:13Z · trace

tangletools
tangletools previously approved these changes Jul 3, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 18 non-blocking findings — 0d1936d6

Full multi-shot audit completed 5/5 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 8 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-07-03T05:39:13Z · immutable trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved drewstone PR — 76d764d0

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: drewstone_author · 2026-07-03T06:47:22Z

@drewstone drewstone merged commit 3aef976 into main Jul 3, 2026
1 check passed
@drewstone drewstone deleted the feat/define-leaderboard branch July 3, 2026 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants