feat(runtime): defineLeaderboard — declarative eval leaderboard facade over runProfileMatrix by drewstone · Pull Request #453 · tangle-network/agent-runtime

drewstone · 2026-07-03T04:59:24Z

What

defineLeaderboard<TCase>(spec) on @tangle-network/agent-runtime/loops: the declarative facade that turns a product's hand-written matrix runner (~650 lines, e.g. tax's matrix.ts) into ~40 lines of domain code — cases + prompt + score — assembled into ONE runProfileMatrix call.

Design gates held

No engine in the facade. Pure assembly of existing primitives: expandProfileAxes (harness×model axis), loopDispatch + naiveDriver (per-cell driven loop), resolveSandboxClient (cli-bridge backend), leaderboard/renderLeaderboardMarkdown (default export). Zero new execution/judging/metering code.
Every default overridable; runProfileMatrix stays the escape floor. Level-1 seams: backends, flags, parseOutput, onCellEvents, setup/teardown, export, modelBackend, matrix passthrough (spread last onto the runProfileMatrix call). Level-2: dispatch / judges full replacement.
Expressiveness = tax. Async prompt (shell-to-python, resolved once per case), onCellEvents (search-metric capture), parseOutput (marker/echo-guard scoring), custom export, --cases/--harnesses/--models/--shots/--reps + spec-declared flags.
toBenchmarkAdapter() returns a structurally BenchmarkAdapter-compatible object (name/preflight/loadTasks/judge/goldArtifact) with no dependency on a bench package.

Footgun fixed by default

The default run dir is fresh per invocation (timestamp+pid under tmpdir). runProfileMatrix caches cells by run dir; a stable default silently resumes a prior FAILED zero-token cell and skips dispatch. Only an explicit --run-dir opts into resume. Also default: bare model ids are snapshot-stamped (--model-snapshot, default @leaderboard) so RunRecord validation can't reject the run.

Version notes

devDep @tangle-network/agent-eval 0.100.0 → 0.103.1; peer floor raised >=0.97.0 → >=0.101.0 (expandProfileAxes/harnessAxisOf/CODING_HARNESSES first shipped in v0.101.0). Additive new subpath otherwise — no fleet bump needed.
Minor bump 0.83.0 → 0.84.0.

Proof

10 new offline tests (inProcessSandboxClient, no network/creds): end-to-end matrix run with integrity.verdict === 'real', fresh-run-dir default + explicit --run-dir, --cases subsetting + unknown-id rejection, snapshot stamping, score→judge wrapping (dimensions/notes), onCellEvents seam, spec flags → ctx.args, fail-loud default sandbox backend, toBenchmarkAdapter loadTasks/judge round-trip + duplicate-id preflight.
Full suite: 124 files, 1212 passed / 2 skipped. typecheck (src + examples), build, lint, verify:package, docs:check (API docs regenerated) all green.

…e over runProfileMatrix A product's harness×model leaderboard reduces to cases + prompt + score: defineLeaderboard assembles expandProfileAxes × loopDispatch/naiveDriver × the backend resolvers into ONE runProfileMatrix call, with toBenchmarkAdapter() exposing the same domain surface in the structural BenchmarkAdapter shape. No new execution/judging/metering code — every default is overridable (backends, flags, parseOutput, onCellEvents, export, dispatch, judges, matrix passthrough) and runProfileMatrix stays public as the escape floor. The default run dir is FRESH per invocation (timestamp+pid under tmpdir): runProfileMatrix caches cells by run dir, and a stable default silently resumes a prior failed zero-token cell — only an explicit --run-dir opts in. Requires agent-eval >=0.101.0 (expandProfileAxes / harnessAxisOf / CODING_HARNESSES); peer floor raised accordingly. 10 offline tests via inProcessSandboxClient. Minor bump to 0.84.0.

tangletools

✅ Auto-approved drewstone PR — `0d1936d6`

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: drewstone_author · 2026-07-03T04:59:34Z}

tangletools

🟡 Value Audit — sound-with-nits


Verdict	sound-with-nits
Concerns	2 (1 low, 1 weak-concern)
Heuristic	0.0s
Duplication	0.1s
Interrogation	147.4s (2 bridge agents)
Total	147.5s

💰 Value — sound-with-nits

A declarative facade that assembles existing primitives (expandProfileAxes × loopDispatch × naiveDriver × runProfileMatrix) into one call, collapsing ~650 hand-rolled lines per product into ~40; sound consolidation in the codebase's established grain, with only minor nits.

What it does: Adds defineLeaderboard<TCase>(spec) to @tangle-network/agent-runtime/loops: give it cases + prompt + score, and .run(argv) expands a base profile across harness×model axes, runs every (profile, case) cell as a driven loop via loopDispatch+naiveDriver, wraps score as the campaign judge, and emits ONE runProfileMatrix call, then exports a ranked markdown leaderboard. `.toBenchmarkA
Goals it achieves: Kill the per-product hand-rolled matrix runner (~650 lines each, per the PR body) and the three recurring footguns it re-hits: stale run-dir cell-cache reuse silently skipping dispatch, zero-token stub cells, and RunRecord rejection of bare model ids. Give products one declarative seam so the domain is just cases+prompt+score, and let a product leaderboard register into the bench BenchmarkAdapter
Assessment: Good change, in the grain of the codebase. It is pure assembly — zero new execution/judging/metering code; every moving part (expandProfileAxes, loopDispatch, naiveDriver, resolveSandboxClient, leaderboard/renderLeaderboardMarkdown, runProfileMatrix) is an existing primitive already exported from this package. It directly continues the recent consolidation siblings: loop-dispatch.ts (collapse ru
Better / existing approach: Searched: runProfileMatrix, expandProfileAxes, BenchmarkAdapter, toBenchmarkAdapter, loopCampaignDispatch consumers (src/**/*.ts), the bench adapter registry (bench/src/benchmarks/types.ts, bench/src/adapters.ts), and hand-rolled matrix.ts across ~/company. No existing equivalent in agent-runtime — none found. The correct alternative (importing the real BenchmarkAdapter from bench) is
Model: opencode/zai-coding-plan/glm-5.2
Bridge attempts: 2
Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound

A clean declarative facade that collapses the documented ~650-line hand-rolled matrix assembly (tax/legal/gtm) into ~40 lines of domain code, wiring only existing primitives with every default overridable and the documented footguns fixed by default.

Integration: Properly exported from src/runtime/index.ts:93-104 onto the /loops surface. Every primitive it assembles is confirmed present: loopDispatch (loop-dispatch.ts:159), naiveDriver (steering-drivers.ts:109), resolveSandboxClient (resolve-sandbox-client.ts:54), leaderboard/renderLeaderboardMarkdown (benchmark-report.ts), and expandProfileAxes/harnessAxisOf/CODING_HARNESSES + runProfileMatrix/JudgeConfig
Fit with existing patterns: Matches the codebase's established grain exactly. runPersonaDispatch in src/conversation/run-persona.ts:221 is the same shape (wrap a runner as a ProfileDispatchFn for runProfileMatrix); loopDispatch itself is the same shape; examples/webcode-matrix/webcode-matrix.ts is the canonical hand-rolled assembly. docs/archive/benchmark-matrix-consolidation.md identifies tax/legal/gtm as three duplicated ~
Real-world viability: Test coverage hits the load-bearing paths: end-to-end offline run with real metering (integrity verdict='real'), fresh run-dir per invocation (the stale-cell-cache footgun fix), explicit --run-dir resume, case subsetting + unknown-id rejection, snapshot stamping, structured score with dimensions, onCellEvents metric-capture seam, custom flag parsing, fail-loud on the default 'sandbox' backend, and
Model: opencode/zai-coding-plan/glm-5.2
Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added src/runtime/define-leaderboard.ts

```
   console.log(table)
```

💰 Value Audit

🟡 Local structural BenchmarkAdapter types shadow the canonical bench ones [maintenance] ``

define-leaderboard.ts:96-124 redeclares LeaderboardBenchTask/LeaderboardBenchScore/LeaderboardBenchmarkAdapter as structural mirrors of BenchTask/BenchScore/BenchmarkAdapter in bench/src/benchmarks/types.ts:13-46. This is forced by the dep direction (bench → agent-runtime, not back), so it's defensible, but the two definitions can drift silently: e.g. the canonical BenchmarkAdapter also has optional output?: OutputAdapter<string> and leafClient? (types.ts:46,57) which the local

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260703T052516Z}

tangletools · 2026-07-03T05:39:15Z

✅ No Blockers — `0d1936d6`

Review health 100/100 · Reviewer score 62/100 · Confidence 85/100 · 18 findings (18 low)

	glm	deepseek	aggregate
Readiness	62	74	62
Confidence	85	85	85
Correctness	62	74	62
Security	62	74	62
Testing	62	74	62
Architecture	62	74	62

Reviewer score is advisory once the run is complete and the verdict has no blockers.

Full multi-shot audit completed 5/5 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 8 changed files. Global verifier still owns final merge decision.

🟡 LOW Several newly-listed agent-eval symbols carry no-summary placeholders — docs/api/primitive-catalog.md

New campaign entries discoverEvalFixtures, loadEvalFixture, loadEvalFixtureScenarios, planCampaignRun, planEvalFixtureRun, policyEditProposer all render '(no summary — add a TSDoc line at the declaration)'. These originate from the bumped @tangle-network/agent-eval@0.103.1 dependency, not this repo's source, so the catalog is faithfully generated — but the docs ship public API rows with no description. Track as a follow-up TSDoc task in agent-eval; not a blocker for this PR.

🟡 LOW defineLeaderboard has no TSDoc summary at declaration — docs/api/primitive-catalog.md

Catalog row reads '(no summary — add a TSDoc line at the declaration)'. The module-level comment (define-leaderboard.ts:1-30) documents the facade thoroughly, but the exported function at line 255 has no JSDoc directly above it (line 254 blank, preceded by normalizeScore). Add a one-line TSDoc on the export so the catalog table self-describes instead of showing the placeholder. Cosmetic only — signature and file:line are correct.

🟡 LOW Range-form inconsistency: agent-eval uses caret, sibling Tangle deps use open range — package.json

@tangle-network/agent-interface and @tangle-network/sandbox (both dev and peer) use the >=X <1.0.0 form; only agent-eval devDep was switched to caret ^0.103.1. Cosmetic but breaks the local convention and narrows resolution semantics (caret for 0.x caps at next minor). If the intent was 'latest 0.103.x', keep caret; if intent was 'any compatible', restore >=0.101.0 <1.0.0 to match siblings.

🟡 LOW devDep caret pin narrower than peerDep advertised range creates untested compatibility window — package.json

devDep ^0.103.1 resolves to >=0.103.1 <0.104.0 for 0.x versions, so CI/test only ever exercises 0.103.x. Yet peerDependencies advertises >=0.101.0 <1.0.0 (line 126), claiming 0.101.0–0.102.x are supported. The previous devDep (>=0.100.0 <1.0.0) covered the same breadth as the peer range; the new caret does not. Consumers pinning 0.101.x/0.102.x will hit a peer range the project never tested. Fix: either widen devDep back to >=0.101.0 <1.0.0 to match peerDep, or raise peerDep floor to >=0.103.1 to match what is actually tested. Preferable: align both on the tested floor.

🟡 LOW devDep version pin narrow but within peer range; no minimum-version CI coverage — package.json

devDep changed from >=0.100.0 <1.0.0 to ^0.103.1 (effectively >=0.103.1 <0.104.0). CI tests only the latest version, never the declared peer floor of 0.101.0. If a consumer resolves to 0.101.0–0.102.x and hits a runtime API gap not exercised by tests, CI won't catch it. The imported APIs (expandProfileAxes, runProfileMatrix, etc.) all exist in 0.101.0, so risk is low.

🟡 LOW peerDependency minimum bump (0.97.0 → 0.101.0) is breaking for existing consumers — package.json

Raising the peerDep floor by 4 minor versions drops support for agent-eval 0.97.0–0.100.x. Under 0.x semver this is permitted in a minor bump, but consumers will see ERESOLVE peer warnings on install. No CHANGELOG entry is visible in this shot's scope; recommend the PR description or CHANGELOG explicitly call out the required consumer upgrade so downstream pnpm install failures are not a surprise.

🟡 LOW Transitive sandbox major-series jump (0.3.0 -> 0.9.5) introduced via tcloud@0.4.14 — pnpm-lock.yaml

tcloud@0.4.14 now resolves its sandbox dependency to 0.9.5 instead of 0.3.0, a large minor-series jump inside the 0.x range (semver-major in 0.x terms). The lock file itself is correct and consistent; the only risk is runtime/behavioral drift in sandbox APIs consumed transitively via tcloud. This is not a lock defect — flagging so source-level shots confirm no sandbox 0.3-era APIs are relied upon. No action required within this shot.

🟡 LOW Default scoreJudge declares only 'composite' but passes extra dimensions — src/runtime/define-leaderboard.ts

The default scoreJudge at line 281-292 declares dimensions: [{ key: 'composite' }] but its score() implementation spreads s.dimensions into the returned dimensions object (line 288). Extra dimensions from spec.score() are recorded but undeclared in the judge config. Downstream consumers iterating declared dimensions would miss them. Either declare all known dimensions or document that undeclared dimensions are only carried in the raw data.

🟡 LOW --shots/--reps parse to NaN with no validation, silently unbounding the naive retry cap — src/runtime/define-leaderboard.ts

const shots = Number(args.shots ?? spec.shots ?? 1) (and same for reps) accepts non-numeric input like --shots abc and yields NaN. In naiveDriver.plan (steering-drivers.ts:122), history.length >= NaN is always false, so the retry cap silently disappears — the loop only stops on a valid verdict, potentially hammering the backend indefinitely on a case that never passes. Fix: validate with Number.isFinite(n) && n >= 1 and throw a fail-loud error otherwise, consistent with the other fail-loud guards in this function.

🟡 LOW as never cast on sandboxOverrides hides shape drift — src/runtime/define-leaderboard.ts

The constructed sandboxOverrides: { backend: { type: axis.harness, model: {...} } } is cast via as never to bypass the type system. Tests pass today, but a future agent-eval type change to the expected sandboxOverrides shape will not surface as a compile error here — the facade could silently send a wrong-shaped override to the sandbox. Replace with a structural cast or import the real override type.

🟡 LOW sandboxClient = makeClient() runs unconditionally even when spec.dispatch overrides the default — src/runtime/define-leaderboard.ts

A spec that fully replaces dispatch (LEVEL 2 override) and has no need for the default loop still hits makeClient() — and the default sandbox backend factory throws unless overridden. So a LEVEL-2 dispatch user is forced to also override spec.backends.sandbox (or pass --backend <custom>) just to skip a client they never use. Guard with const sandboxClient = spec.dispatch ? null : makeClient() and pass it through only to the default path.

🟡 LOW teardown is skipped when setup throws — src/runtime/define-leaderboard.ts

await spec.setup?.(ctx) runs BEFORE the try/finally that wraps runProfileMatrix (line 481 vs try at 482). If setup partially succeeds (opens handles, acquires boxes) and then throws, teardown never runs — a resource leak path. The JSDoc says teardown runs 'after the matrix, even on failure,' which is honored for matrix failure but not for setup failure. Wrap setup in the same try, or document that setup must be self-cleaning on throw.

🟡 LOW toBenchmarkAdapter().loadTasks() ignores opts.split — src/runtime/define-leaderboard.ts

The adapter accepts split?: string (declared at line 119 to match the BenchmarkAdapter structural contract) but the implementation only honors ids and limit. A registry requesting split: 'holdout' silently receives every case — a quiet contract gap rather than a fail-loud refusal. Either filter by a per-case split field (if the spec exposes one), or throw split is not modeled by this leaderboard so callers don't get a wrong-shaped slice.

🟡 LOW bareModel() has dead null-coalescing fallback — src/runtime/define-leaderboard.ts

model.split('@')[0] ?? model — the ?? model is unreachable because String.prototype.split always returns at least one element. Remove the ?? model or change to model.split('@')[0].

🟡 LOW maxIterations duplicated on naiveDriver and top-level loop options — src/runtime/define-leaderboard.ts

maxIterations: shots appears on both the naiveDriver options (line 445) and the returned loop options object (line 476). Both use the same variable now so consistent, but fragile: if a future edit updates one and not the other, the effective cap becomes the minimum, potentially masking the bug.

🟡 LOW setup() outside try/finally — teardown not called on setup failure — src/runtime/define-leaderboard.ts

setup() at line 481 is called before the try/finally. If setup allocates resources (e.g., opens a fixture connection, starts a local server) and then throws, teardown() never runs. The documented contract says teardown runs 'even on failure' but that refers to matrix failures. Resources allocated in setup before its throw will leak. Fix: wrap setup inside the try block, or wrap the entire (setup + try/finally) in its own try/finally.

🟡 LOW shotNonce is shared across concurrent cells — src/runtime/define-leaderboard.ts

shotNonce is a single mutable counter shared by all cells. When runProfileMatrix dispatches cells concurrently, taskToPrompt reads and increments it without synchronization. Practically benign — different cells have different base prompts so caching isn't an issue — but semantically the nonce should be per-cell or even per-shot to match intent. A non-atomic increment could produce duplicate nonces across cells.

🟡 LOW toBenchmarkAdapter().judge() has dead code guard — src/runtime/define-leaderboard.ts

const [c] = selectCases([task.id]) at line 538 then if (c === undefined) throw at line 539 is dead at runtime. selectCases() always throws for unknown ids (line 272-274) and always returns a populated array for known ids (map preserves length). The guard is needed for TypeScript strict index access but unreachable at runtime. The throw message will never fire.

_{tangletools · 2026-07-03T05:39:13Z · trace}

tangletools

✅ Approved — 18 non-blocking findings — `0d1936d6`

Full multi-shot audit completed 5/5 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 8 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-07-03T05:39:13Z · immutable trace}

tangletools

✅ Auto-approved drewstone PR — `76d764d0`

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: drewstone_author · 2026-07-03T06:47:22Z}

tangletools approved these changes Jul 3, 2026

View reviewed changes

tangletools reviewed Jul 3, 2026

View reviewed changes

tangletools previously approved these changes Jul 3, 2026

View reviewed changes

docs(api): regenerate for defineLeaderboard

76d764d

drewstone dismissed tangletools’s stale review via 76d764d July 3, 2026 06:47

tangletools approved these changes Jul 3, 2026

View reviewed changes

drewstone merged commit 3aef976 into main Jul 3, 2026
1 check passed

drewstone deleted the feat/define-leaderboard branch July 3, 2026 06:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(runtime): defineLeaderboard — declarative eval leaderboard facade over runProfileMatrix#453

feat(runtime): defineLeaderboard — declarative eval leaderboard facade over runProfileMatrix#453
drewstone merged 2 commits into
mainfrom
feat/define-leaderboard

drewstone commented Jul 3, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jul 3, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jul 3, 2026

What

Design gates held

Footgun fixed by default

Version notes

Proof

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved drewstone PR — 0d1936d6

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟡 Value Audit — sound-with-nits

💰 Value — sound-with-nits

🎯 Usefulness — sound

🔎 Heuristic Signals

💰 Value Audit

Uh oh!

tangletools commented Jul 3, 2026

✅ No Blockers — 0d1936d6

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Approved — 18 non-blocking findings — 0d1936d6

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved drewstone PR — 76d764d0

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved drewstone PR — `0d1936d6`

✅ No Blockers — `0d1936d6`

✅ Approved — 18 non-blocking findings — `0d1936d6`

✅ Auto-approved drewstone PR — `76d764d0`