From 0d1936d659134df326a4ab0d8c8070e8c0c3c159 Mon Sep 17 00:00:00 2001
From: Drew Stone <drewstone329@gmail.com>
Date: Thu, 2 Jul 2026 22:57:50 -0600
Subject: [PATCH 1/2] =?UTF-8?q?feat(runtime):=20defineLeaderboard=20?=
 =?UTF-8?q?=E2=80=94=20declarative=20eval=20leaderboard=20facade=20over=20?=
 =?UTF-8?q?runProfileMatrix?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

A product's harness×model leaderboard reduces to cases + prompt + score:
defineLeaderboard assembles expandProfileAxes × loopDispatch/naiveDriver ×
the backend resolvers into ONE runProfileMatrix call, with toBenchmarkAdapter()
exposing the same domain surface in the structural BenchmarkAdapter shape.
No new execution/judging/metering code — every default is overridable
(backends, flags, parseOutput, onCellEvents, export, dispatch, judges, matrix
passthrough) and runProfileMatrix stays public as the escape floor.

The default run dir is FRESH per invocation (timestamp+pid under tmpdir):
runProfileMatrix caches cells by run dir, and a stable default silently
resumes a prior failed zero-token cell — only an explicit --run-dir opts in.

Requires agent-eval >=0.101.0 (expandProfileAxes / harnessAxisOf /
CODING_HARNESSES); peer floor raised accordingly. 10 offline tests via
inProcessSandboxClient. Minor bump to 0.84.0.
---
 docs/api/primitive-catalog.md          |  88 ++--
 docs/api/runtime.md                    | 689 +++++++++++++++++++++++++
 docs/canonical-api.md                  |   2 +-
 package.json                           |   6 +-
 pnpm-lock.yaml                         |  53 +-
 src/runtime/define-leaderboard.test.ts | 182 +++++++
 src/runtime/define-leaderboard.ts      | 551 ++++++++++++++++++++
 src/runtime/index.ts                   |  15 +
 8 files changed, 1503 insertions(+), 83 deletions(-)
 create mode 100644 src/runtime/define-leaderboard.test.ts
 create mode 100644 src/runtime/define-leaderboard.ts

diff --git a/docs/api/primitive-catalog.md b/docs/api/primitive-catalog.md
index 8320698..01ed065 100644
--- a/docs/api/primitive-catalog.md
+++ b/docs/api/primitive-catalog.md
@@ -7,7 +7,7 @@
 
 # Primitive catalog — the never-stale anti-reinvention inventory
 
-> **GENERATED** from `@tangle-network/agent-runtime@0.83.0` and `@tangle-network/agent-eval@0.100.0` by `scripts/gen-primitive-catalog.mjs`. Do NOT hand-edit — run `pnpm run docs:api`. This is the mechanical companion to the JUDGMENT in `canonical-api.md` (§2 decision table + §1.5 AgentProfile law): that doc says WHICH primitive to reach for and what NOT to build; this catalog proves WHAT exists. Per-symbol signatures + `file:line` live in the per-module pages under `docs/api/`.
+> **GENERATED** from `@tangle-network/agent-runtime@0.84.0` and `@tangle-network/agent-eval@0.103.1` by `scripts/gen-primitive-catalog.mjs`. Do NOT hand-edit — run `pnpm run docs:api`. This is the mechanical companion to the JUDGMENT in `canonical-api.md` (§2 decision table + §1.5 AgentProfile law): that doc says WHICH primitive to reach for and what NOT to build; this catalog proves WHAT exists. Per-symbol signatures + `file:line` live in the per-module pages under `docs/api/`.
 
 ## 1. agent-runtime — own public surface
 
@@ -246,7 +246,7 @@ Import from `@tangle-network/agent-runtime/intelligence` — 63 exports.
 
 ### Recursive atom + loop kernel (alias of ./runtime)
 
-Import from `@tangle-network/agent-runtime/loops` — 409 exports.
+Import from `@tangle-network/agent-runtime/loops` — 419 exports.
 
 | Symbol | Kind | Summary |
 |---|---|---|
@@ -285,6 +285,7 @@ Import from `@tangle-network/agent-runtime/loops` — 409 exports.
 | `decodeToolPart` | function | Decode a part with a specific harness's adapter when known, else try every registered adapter |
 | `defaultSelectWinner` | function | The kernel's winner argmax — best-valid-score, ties broken by earliest index, |
 | `defaultToolDetectors` | function | The default online panel for a tool-call pipe: a worker repeating the same call, or hammering |
+| `defineLeaderboard` | function | _(no summary — add a TSDoc line at the declaration)_ |
 | `definePersona` | function | Build a frozen `Persona`. Fails loud on the executors-supplied invariant: a persona with |
 | `defineStrategy` | function | Author a Strategy from the composable steps — the open, compact way. |
 | `delegate` | function | Delegate an INTENT to a default authoring supervisor and return its `SupervisedResult` unchanged. |
@@ -438,7 +439,14 @@ Import from `@tangle-network/agent-runtime/loops` — 409 exports.
 | `InMemoryRunContextOptions` | interface | Options for the in-memory run context. |
 | `InProcessPromptCtx` | interface | Context handed to each `onPrompt` call. |
 | `Interval` | interface | A 95%-by-default confidence interval. |
+| `LeaderboardBenchmarkAdapter` | interface | Structurally `BenchmarkAdapter` (bench registry shape): `name`, |
+| `LeaderboardBenchScore` | interface | Structurally `BenchScore` (bench registry shape). |
+| `LeaderboardBenchTask` | interface | Structurally `BenchTask` (bench registry shape) — declared locally so this |
+| `LeaderboardFlagSpec` | interface | One extra CLI flag a spec declares. Parsed by `run()` as `--<name> <value>` |
 | `LeaderboardRow` | interface | One leaderboard row — a harness×model profile, every measured column. |
+| `LeaderboardRunContext` | interface | Resolved run configuration handed to `setup` / `teardown` / `export`. |
+| `LeaderboardScenario` | interface | The campaign scenario a case is wrapped into: the case rides along so |
+| `LeaderboardScore` | interface | Structured per-case verdict a `score` function may return (a bare number is |
 | `LoopCampaignDispatchOptions` | interface | Options for adapting plain agent-eval campaign scenarios into runtime `runLoop` cells. |
 | `LoopIterationDispatchPayload` | interface | Where the iteration's worker was placed. `sibling` = a fresh sandbox the |
 | `LoopLineageOptions` | interface | Opt-in box-lineage controls for `runLoop`. Default OFF — with both flags |
@@ -558,7 +566,7 @@ Import from `@tangle-network/agent-runtime/loops` — 409 exports.
 | `WinnerStrategy` | type | Built-in valid-only winner strategies for `selectValidWinner` (selector≠judge): best gated-valid |
 | `WorktreePatchArtifact` | type | Terminal artifact of one worktree-CLI run — the canonical worktree-harness result (the captured |
 
-**Undocumented supporting types** (add a TSDoc line at the declaration to earn a table row): `AgentEnvironment`, `AgentEnvironmentCapabilities`, `AgentEnvironmentEvent`, `AgentEnvironmentProvider`, `AgentEnvironmentQuery`, `AgentEnvironmentSummary`, `AgenticOptions`, `AgenticRunResult`, `AgenticTool`, `AgentSession`, `AgentSessionRef`, `AgentTurnInput`, `AgentTurnResult`, `AnalystRegistry`, `AnytimeReport`, `AnytimeStrategySummary`, `ArtifactHandle`, `AuditIntentOptions`, `AuthoredHarness`, `AuthoredStrategy`, `AuthorStrategyOptions`, `BenchmarkConfig`, `BenchmarkLift`, `BenchmarkStrategySummary`, `BenchmarkTaskRow`, `BudgetPool`, `BusStats`, `ChampionPick`, `CheckpointRef`, `CheckpointRequest`, `CreateAgentEnvironmentInput`, `Driver`, `EventBus`, `EvolutionArchiveNode`, `EvolutionBandInfo`, `EvolutionCandidate`, `EvolutionGeneration`, `EvolutionReport`, `ExecRequest`, `ExecResult`, `ForkRequest`, `GitWorkspaceOptions`, `HarvestFailure`, `HarvestReport`, `Inbox`, `InProcessSandboxClientOptions`, `IntentAudit`, `Iteration`, `Leaderboard`, `LeaderboardOptions`, `LoopDecisionPayload`, `LoopDispatchOptions`, `LoopEndedPayload`, `LoopIterationEndedPayload`, `LoopIterationStartedPayload`, `LoopPlanDescription`, `LoopResult`, `LoopSandboxPlacement`, `LoopStartedPayload`, `LoopTraceEmitter`, `LoopWinner`, `McpEnvironmentOptions`, `Observation`, `ObserveOptions`, `OpenSandboxRunOptions`, `PairwiseOptions`, `PatchDeliverableOptions`, `PlacementInfo`, `PromotionGateOptions`, `PromotionVerdict`, `PublishOptions`, `ResourceRequest`, `RouterChatResult`, `RouterChatToolsResult`, `RouterToolLoopResult`, `RunAgenticOptions`, `SandboxRun`, `ShotSpec`, `StrategyEvolutionConfig`, `StrategyResult`, `SuperviseOptions`, `SuperviseSurfaceOptions`, `SupervisorAgentDeps`, `SupervisorOpts`, `SurfaceScore`, `ToolSpec`, `TraceSource`, `ValidationCtx`, `Validator`, `WaterfallCollector`, `WaterfallReport`, `Workspace`, `WorkspaceRequest`, `WorkspaceRun`, `WorktreeCliExecutorOptions`, `WorktreeFanoutOptions`, `AgentEnvironmentStatus`, `AgentSessionStatus`, `ChampionPolicy`, `LoopTraceEvent`, `MakeWorkerAgent`, `WorkspaceCommit`.
+**Undocumented supporting types** (add a TSDoc line at the declaration to earn a table row): `AgentEnvironment`, `AgentEnvironmentCapabilities`, `AgentEnvironmentEvent`, `AgentEnvironmentProvider`, `AgentEnvironmentQuery`, `AgentEnvironmentSummary`, `AgenticOptions`, `AgenticRunResult`, `AgenticTool`, `AgentSession`, `AgentSessionRef`, `AgentTurnInput`, `AgentTurnResult`, `AnalystRegistry`, `AnytimeReport`, `AnytimeStrategySummary`, `ArtifactHandle`, `AuditIntentOptions`, `AuthoredHarness`, `AuthoredStrategy`, `AuthorStrategyOptions`, `BenchmarkConfig`, `BenchmarkLift`, `BenchmarkStrategySummary`, `BenchmarkTaskRow`, `BudgetPool`, `BusStats`, `ChampionPick`, `CheckpointRef`, `CheckpointRequest`, `CreateAgentEnvironmentInput`, `DefinedLeaderboard`, `Driver`, `EventBus`, `EvolutionArchiveNode`, `EvolutionBandInfo`, `EvolutionCandidate`, `EvolutionGeneration`, `EvolutionReport`, `ExecRequest`, `ExecResult`, `ForkRequest`, `GitWorkspaceOptions`, `HarvestFailure`, `HarvestReport`, `Inbox`, `InProcessSandboxClientOptions`, `IntentAudit`, `Iteration`, `Leaderboard`, `LeaderboardOptions`, `LeaderboardSpec`, `LoopDecisionPayload`, `LoopDispatchOptions`, `LoopEndedPayload`, `LoopIterationEndedPayload`, `LoopIterationStartedPayload`, `LoopPlanDescription`, `LoopResult`, `LoopSandboxPlacement`, `LoopStartedPayload`, `LoopTraceEmitter`, `LoopWinner`, `McpEnvironmentOptions`, `Observation`, `ObserveOptions`, `OpenSandboxRunOptions`, `PairwiseOptions`, `PatchDeliverableOptions`, `PlacementInfo`, `PromotionGateOptions`, `PromotionVerdict`, `PublishOptions`, `ResourceRequest`, `RouterChatResult`, `RouterChatToolsResult`, `RouterToolLoopResult`, `RunAgenticOptions`, `SandboxRun`, `ShotSpec`, `StrategyEvolutionConfig`, `StrategyResult`, `SuperviseOptions`, `SuperviseSurfaceOptions`, `SupervisorAgentDeps`, `SupervisorOpts`, `SurfaceScore`, `ToolSpec`, `TraceSource`, `ValidationCtx`, `Validator`, `WaterfallCollector`, `WaterfallReport`, `Workspace`, `WorkspaceRequest`, `WorkspaceRun`, `WorktreeCliExecutorOptions`, `WorktreeFanoutOptions`, `AgentEnvironmentStatus`, `AgentSessionStatus`, `ChampionPolicy`, `LoopTraceEvent`, `MakeWorkerAgent`, `WorkspaceCommit`.
 
 ### Environment provider adapters — generic sandbox/compute bridge
 
@@ -824,8 +832,8 @@ Import from `@tangle-network/agent-eval` — 30 exports.
 |---|---|---|
 | `buildAgreementJudge` | function | Build a `JudgeConfig` that scores a produced student artifact against the |
 | `cachedJudge` | function | Wrap a `JudgeConfig` so repeat judgments of the same artifact are served |
-| `calibrateJudge` | function | _(no summary — add a TSDoc line at the declaration)_ |
-| `compilerJudge` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `calibrateJudge` | function | Measure judge quality against human gold labels: computes Cohen's κ, Pearson correlation, and MAE over matched item ids. |
+| `compilerJudge` | function | Build a `SandboxJudgeSpec` that scores whether the harness compiles without errors. |
 | `contractJudge` | function | Adapt trace contracts to a campaign `JudgeConfig`. One judge dimension per |
 | `createAntiSlopJudge` | function | Create a reusable Judge function from an anti-slop config. |
 | `createCustomJudge` | function | Create a custom judge with a fully custom prompt. |
@@ -834,16 +842,16 @@ Import from `@tangle-network/agent-eval` — 30 exports.
 | `createSemanticConceptJudge` | function | Factory: pin LLM options once, return a closure that accepts inputs. |
 | `ensembleJudge` | function | Build a campaign-shaped `JudgeConfig` whose `score()` runs every panel |
 | `judgeFamily` | function | Classify a model id into its provider family. Strips a `@snapshot` suffix |
-| `judgeReplayGate` | function | _(no summary — add a TSDoc line at the declaration)_ |
-| `judgeSpans` | function | _(no summary — add a TSDoc line at the declaration)_ |
-| `linterJudge` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `judgeReplayGate` | function | Confirm a candidate's win with a stronger judge: score baseline and candidate outputs independently, then bootstrap a CI to verify the lift generalises beyond the inner loop. |
+| `judgeSpans` | function | Query judge-kind spans from the trace store, optionally scoped to a single run. |
+| `linterJudge` | function | Build a `SandboxJudgeSpec` that scores the harness by linter rule violations. |
 | `llmJudge` | function | Build a campaign-shaped `JudgeConfig` whose `score()` makes ONE LLM call |
 | `replayTraceThroughJudge` | function | Apply a judge function to every LLM span in a run and record the |
 | `runIntentMatchJudge` | function | Run the intent-match judge. Soft-fails to available=false on error. |
 | `runKeywordCoverageJudge` | function | Score expected concepts against an already-fetched HTML payload + any |
 | `runSemanticConceptJudge` | function | Run the semantic concept judge. Soft-fails to available=false on |
-| `securityJudge` | function | _(no summary — add a TSDoc line at the declaration)_ |
-| `testJudge` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `securityJudge` | function | Build a `SandboxJudgeSpec` that scores the harness output for security issues via a security scanner. |
+| `testJudge` | function | Build a `SandboxJudgeSpec` that scores the harness by its test-suite pass rate. |
 | `traceJudge` | function | Wrap a single JudgeFn so its LLM call emits a traced span. |
 | `adversarialJudge` | const | Adversarial judge — red-teams agent responses. |
 | `codeExecutionJudge` | const | Code execution judge — evaluates whether code blocks are valid and runnable. |
@@ -875,11 +883,11 @@ Import from `@tangle-network/agent-eval` — 10 exports.
 | Symbol | Kind | Summary |
 |---|---|---|
 | `gradeSemanticStatus` | function | Grade a semantic-concept-style judge result into a single layer status. |
-| `verifyAgentProfileCell` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `verifyAgentProfileCell` | function | Verify an `AgentProfileCell`'s `cellId` matches the sha256 of its hash-material fields, confirming the record has not been tampered with. |
 | `verifyAttestation` | function | Verify a report against its attestation. Returns a typed outcome rather |
 | `verifyCompletion` | function | Verify whether a run completed the task. `checkCorrectness` is injected — |
 | `verifyManifest` | function | Verify that a signed manifest has not been tampered with. |
-| `MultiLayerVerifier` | class | _(no summary — add a TSDoc line at the declaration)_ |
+| `MultiLayerVerifier` | class | Ordered DAG of verification layers with dependency-based skipping, per-layer findings, soft-fail semantics, and a blended composite score across all passed layers. |
 | `VerificationReport` | interface | Extends the substrate verdict spine: `valid` = `allPass` and `score` = |
 | `LayerStatus` | type | Multi-layer verifier — ordered pipeline of verification layers. |
 
@@ -930,70 +938,78 @@ Import from `@tangle-network/agent-eval` — 49 exports.
 
 ### CAMPAIGN — profile matrix, gates, improvement loop
 
-Import from `@tangle-network/agent-eval/campaign` — 206 exports.
+Import from `@tangle-network/agent-eval/campaign` — 226 exports.
 
 | Symbol | Kind | Summary |
 |---|---|---|
-| `aceProposer` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `aceProposer` | function | Append-only context engineering proposer: grows a skill playbook by appending generation-tagged lessons without merging or overwriting prior entries. |
 | `applySkillPatch` | function | Apply a SkillOpt patch to a text surface. Ops apply in array order against |
 | `buildAnalystSurfaceDispatch` | function | Build the `dispatchWithSurface(surface, scenario, ctx)` the improvement loop |
 | `buildEvidenceVector` | function | The Evidence Bus. For each objective, pair candidate vs baseline by full |
 | `buildLoopProvenanceRecord` | function | Build the durable provenance record from a completed loop result. |
 | `campaignBreakdown` | function | Per-candidate evidence a reflective/patch proposer grounds its next proposal |
 | `campaignMeanComposite` | function | Mean composite across a campaign: per cell, the mean of its judges' |
-| `compareProposers` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `compareProposers` | function | Run a head-to-head lift benchmark across surface proposers on a shared holdout, returning per-proposer lift CIs and pairwise "who wins" verdicts. |
 | `composeGate` | function | Compose gates — all must `ship` for the composite to `ship`. First |
 | `countSentenceEdits` | function | Sentence-level edit distance — count distinct add/remove ops between |
-| `defaultProductionGate` | function | _(no summary — add a TSDoc line at the declaration)_ |
-| `defaultRenderDiff` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `defaultProductionGate` | function | Opinionated production gate composing held-out significance, red-team, reward-hacking, and canary checks into a single `Gate.decide` decision. |
+| `defaultRenderDiff` | function | Default surface diff renderer: produces a unified baseline/winner text diff for prompt surfaces or a worktree-ref summary for code surfaces. |
 | `detectScale` | function | Detect the native scale of a set of scores: 0-100 when any magnitude clears |
 | `dimensionRegressions` | function | Per-critical-dimension regression guard. For each dimension, pair the |
+| `discoverEvalFixtures` | function | _(no summary — add a TSDoc line at the declaration)_ |
 | `emitLoopProvenance` | function | Build the provenance record + OTel spans and persist them durably under the |
-| `evolutionaryProposer` | function | _(no summary — add a TSDoc line at the declaration)_ |
-| `extractFapoAttributionSignals` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `evolutionaryProposer` | function | Wrap a stateless `Mutator` (GEPA, AxGEPA, reflective-mutation) as a `SurfaceProposer` that mutates the current best surface into N candidates each generation. |
+| `extractFapoAttributionSignals` | function | Scan a findings array and extract FAPO attribution signals — per-level counts and failure clusters used to decide which optimization level to escalate to next. |
 | `extractH2Sections` | function | Extract H2 headings (`## Foo`) from a markdown surface. Exported for |
 | `failureModeRecallJudge` | function | Deterministic, ground-truth judge for analyst findings. Composite = |
-| `fapoEscalationEntry` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `fapoEscalationEntry` | function | Build a `ProposerEntry` that runs the full FAPO escalation policy (prompt → parameter → structural) as a single comparable optimizer entry. |
 | `fapoProposer` | function | Build a FAPO policy proposer from level-specific candidate generators. |
 | `fsCampaignStorage` | function | Node-filesystem storage — the default. Lazily requires `node:fs` so the |
 | `gepaParetoEntry` | function | GEPA with the Pareto frontier + combine-complementary-lessons. |
-| `gepaProposer` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `gepaProposer` | function | GEPA reflective proposer: each generation reflects on the weakest scenarios and dimensions to produce targeted prompt rewrites, optionally combining Pareto-frontier parents. |
 | `gepaReflectionEntry` | function | GEPA, reflection-only (single-parent, no Pareto combine). |
-| `gitWorktreeAdapter` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `gitWorktreeAdapter` | function | Git-backed `WorktreeAdapter`: creates isolated worktrees on fresh branches, commits agent changes, and discards losers. |
 | `haloProposer` | function | Wrap the real halo-engine CLI as a SurfaceProposer (prompt-tier). |
-| `heldOutGate` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `heldOutGate` | function | Composable held-out delta gate: ships only when the candidate's mean composite on `scenarios` beats the baseline by at least `deltaThreshold`. |
 | `heldoutSignificance` | function | Significance of the held-out composite lift: ship only when the paired |
 | `inMemoryCampaignStorage` | function | In-memory storage for filesystem-less runtimes. Artifacts + trace spans |
 | `isProposedCandidate` | function | Type guard: a proposal carrying its rationale vs a bare |
 | `labelTrustRank` | function | Ordinal rank for a label-trust tier; absent ⇒ `unverified` (rank 0). |
 | `llmJudge` | function | Build a campaign-shaped `JudgeConfig` whose `score()` makes ONE LLM call |
+| `loadEvalFixture` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `loadEvalFixtureScenarios` | function | _(no summary — add a TSDoc line at the declaration)_ |
 | `loopProvenanceSpans` | function | Build the loop's OTLP-ingestable spans from a provenance record. One root |
 | `makePlaybackDispatch` | function | Adapt a `PlaybackDriver` into a `runProfileMatrix` dispatch. The artifact the |
 | `memoryCurationProposer` | function | Build the CURATOR proposer. |
-| `openAutoPr` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `openAutoPr` | function | Open a GitHub PR for a gate-approved surface promotion, attaching the manifest hash, gate verdict, and diff as the PR body. |
 | `pairHoldout` | function | Pair candidate vs baseline holdout observations by FULL cellId. `select` |
 | `parameterSweepProposer` | function | Config/parameter-level proposer for FAPO's middle escalation level. |
 | `paretoSignificanceGate` | function | Wrap the bus + a policy as a `Gate`. Plugs into the existing |
-| `parseSkillPatchResponse` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `parseSkillPatchResponse` | function | Parse a SkillOpt LLM response into validated `SkillPatch` objects, throwing `SkillPatchParseError` on malformed JSON and silently dropping ops that violate the edit budget. |
 | `patchEditCount` | function | Total ops in a patch — the edit-budget axis (SkillOpt's "textual learning |
+| `planCampaignRun` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `planEvalFixtureRun` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `policyEditProposer` | function | _(no summary — add a TSDoc line at the declaration)_ |
 | `provenanceRecordPath` | function | Canonical durable paths under the run dir. |
-| `provenanceSpansPath` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `provenanceSpansPath` | function | Canonical path for the durable OTLP spans JSONL file under a loop run directory. |
 | `renderScoreboardMarkdown` | function | Render the scoreboard as a launch-readiness Markdown document — the literal |
+| `resolveRunDir` | function | Resolve a campaign `runDir`. An absolute path is honored as-is (the caller |
 | `resolveWorktreePath` | function | Resolve a `CodeSurface`'s worktreeRef to a directory the measurement can |
-| `runCampaign` | function | _(no summary — add a TSDoc line at the declaration)_ |
-| `runEval` | function | _(no summary — add a TSDoc line at the declaration)_ |
-| `runImprovementLoop` | function | _(no summary — add a TSDoc line at the declaration)_ |
-| `runOptimization` | function | _(no summary — add a TSDoc line at the declaration)_ |
-| `runProfileMatrix` | function | _(no summary — add a TSDoc line at the declaration)_ |
-| `runSkillOpt` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `runCampaign` | function | Core campaign orchestrator: fan scenarios through dispatch, score with judges, aggregate bootstrap CIs, and persist reproducible `CampaignResult` records. |
+| `runEval` | function | Simplest evaluation preset: run scenarios through dispatch, score with judges, and return a `CampaignResult` — no optimizer, no gate, no PR. |
+| `runImprovementLoop` | function | Gated-promotion shell over `runOptimization`: scores the winner against the baseline on a holdout set, runs the release gate, and optionally opens a PR. |
+| `runOptimization` | function | Improvement loop body: N generations of propose → campaign → rank, maintaining a Pareto frontier and promoting the top-scoring candidates to the next generation. |
+| `runProfileMatrix` | function | Profile × scenario matrix runner: fan N agent profiles across M scenarios, project each cell to a validated `RunRecord` with real token usage, and enforce the backend-integrity guard before returning. |
+| `runSkillOpt` | function | SkillOpt sequential hill-climb: each epoch reflects on train-scenario weaknesses, proposes bounded patches, accepts the first patch that strictly improves the held-out composite, and anneals the edit  |
 | `scoreboardSummary` | function | Roll the per-requirement rows up into the launch headline counts. |
 | `scoreUserStory` | function | Score one story's produced state against its requirements. Thin wrapper over |
 | `sequentialDecide` | function | `SurfaceProposer.decide` adapter — stops the optimization loop the moment |
 | `sequentialPairedGate` | function | Anytime-valid sequential paired gate. Conforms to the existing `Gate` |
 | `skillOptEntry` | function | SkillOpt patch-mode hill-climb. Runs findings-BLIND: `runSkillOpt` owns its |
-| `skillOptProposer` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `skillOptProposer` | function | SkillOpt proposer: proposes bounded, anchored patch operations (add/delete/replace) on a skill document, conforming to both the patch-native `SkillOptProposer` and the generic `SurfaceProposer` interf |
 | `surfaceContentHash` | function | Stable sha256 (full hex) of a surface's effective text. Code surfaces hash |
-| `surfaceHash` | function | _(no summary — add a TSDoc line at the declaration)_ |
+| `surfaceHash` | function | Short (16-char) sha256 fingerprint of a `MutableSurface`: hashes text content for prompt surfaces, or the worktree + base ref pair for code surfaces. |
+| `tangleTracesRoot` | function | The shared, out-of-repo root for campaign/benchmark run bundles. Keeping run |
 | `traceAnalystProposer` | function | Wrap agent-eval's trace-analyst registry as a SurfaceProposer (prompt-tier). |
 | `userStoryScoreboard` | function | Flatten story verdicts into the per-requirement scoreboard — the literal |
 | `paretoPolicy` | const | The default strategy: symmetric multi-objective Pareto significance. Ship iff |
@@ -1019,7 +1035,6 @@ Import from `@tangle-network/agent-eval/campaign` — 206 exports.
 | `GenerationCandidate` | interface | One scored candidate surface in a generation. `dimensions` + `scenarios` |
 | `GepaProposerConstraints` | interface | `gepaProposer` — a reflective `SurfaceProposer` for prompt-tier surfaces. |
 | `HaloProposerOptions` | interface | `haloProposer` — wraps the REAL halo-engine (Inference.net's hierarchical |
-| `HeldOutGateOptions` | interface | Thin Gate adapter — exposes delta-threshold-on-holdout as a composable |
 | `JudgeConfig` | interface | Pluggable dimensional scorer. `score` is the contract: |
 | `JudgeScore` | interface | The canonical judge verdict shape — one declaration, shared by campaign |
 | `LabeledScenarioWrite` | interface | Required-provenance write. The store rejects writes that |
@@ -1033,6 +1048,7 @@ Import from `@tangle-network/agent-eval/campaign` — 206 exports.
 | `PlaybackContext` | interface | Dispatch context plus the profile under test (which cheap model, etc.). |
 | `PlaybackDriver` | interface | Drives the real product through a story and returns the runtime event stream |
 | `PlaybackStep` | interface | One step of a user story — what the user does. The driver interprets |
+| `PolicyEditProposerOptions` | interface | `policyEditProposer` turns typed analyst policy edits into measured candidate |
 | `ProposeContext` | interface | Everything a proposer may read to plan the next |
 | `ProposedCandidate` | interface | A proposer output carrying the surface AND the WHY behind |
 | `ProposerEntry` | interface | What an optimizer produced: the surface it promoted + what it cost to get |
@@ -1068,7 +1084,7 @@ Import from `@tangle-network/agent-eval/campaign` — 206 exports.
 | `SequentialDecision` | type | Anytime-valid sequential promotion gate — an e-process (betting |
 | `SkillPatchOp` | type | A single bounded edit against a skill surface. |
 
-**Undocumented supporting types** (add a TSDoc line at the declaration to earn a table row): `AcceptedEdit`, `ApplySkillPatchResult`, `AxisEvidence`, `BuildAnalystSurfaceDispatchOptions`, `BuildEvidenceVectorOptions`, `BuildLoopProvenanceArgs`, `CampaignAggregates`, `CampaignBreakdown`, `CampaignCellResult`, `CampaignResult`, `CompareProposersOptions`, `DimensionRegression`, `EmitLoopProvenanceArgs`, `EmitLoopProvenanceResult`, `EvidenceVector`, `FailureModeRecallJudgeOptions`, `FapoAttributionSignals`, `FapoFailureCluster`, `FapoProposerOptions`, `FapoReviewInput`, `FapoReviewIssue`, `FapoReviewResult`, `FapoScopeContract`, `GateContext`, `GateResult`, `GenerationRecord`, `GepaProposerOptions`, `GitWorktreeAdapterOptions`, `HeldoutSignificance`, `HeldoutSignificanceOptions`, `JudgeAggregate`, `JudgeDimension`, `LabeledScenarioRecord`, `LabeledScenarioSampleArgs`, `LabeledScenarioStore`, `LlmJudgeOptions`, `LoopProvenanceBackend`, `LoopProvenanceCandidate`, `OpenAutoPrResult`, `OptimizerConfig`, `ParameterCandidate`, `ParameterChange`, `ParameterSweepProposerOptions`, `ParetoSignificanceGateOptions`, `ProfileSummary`, `PromotionObjective`, `ProposePatchesArgs`, `ProposerComparison`, `ProposerPairwise`, `ProposerScore`, `RunImprovementLoopResult`, `RunOptimizationResult`, `RunProfileMatrixOptions`, `RunProfileMatrixResult`, `RunSkillOptResult`, `ScenarioAggregate`, `ScenarioRollup`, `ScoreboardRenderOptions`, `SequentialDecideFn`, `SequentialDecideOptions`, `SequentialObservation`, `SequentialPairedGate`, `SequentialPairedGateOptions`, `SkillOptEpochRecord`, `SkillOptProposer`, `SkillOptProposerOptions`, `SkillPatchRejection`, `TraceSpan`, `WorktreeAdapter`, `JsonPrimitive`, `JsonValue`, `RedactionStatus`, `RunOptimizationOptions`.
+**Undocumented supporting types** (add a TSDoc line at the declaration to earn a table row): `AcceptedEdit`, `ApplySkillPatchResult`, `AxisEvidence`, `BuildAnalystSurfaceDispatchOptions`, `BuildEvidenceVectorOptions`, `BuildLoopProvenanceArgs`, `CampaignAggregates`, `CampaignBreakdown`, `CampaignCellResult`, `CampaignResult`, `CampaignRunPlan`, `CampaignRunPlanCell`, `CompareProposersOptions`, `DimensionRegression`, `EmitLoopProvenanceArgs`, `EmitLoopProvenanceResult`, `EvalFixture`, `EvalFixtureFile`, `EvalFixtureLoadOptions`, `EvalFixtureScenario`, `EvidenceVector`, `FailureModeRecallJudgeOptions`, `FapoAttributionSignals`, `FapoFailureCluster`, `FapoProposerOptions`, `FapoReviewInput`, `FapoReviewIssue`, `FapoReviewResult`, `FapoScopeContract`, `GateContext`, `GateResult`, `GenerationRecord`, `GepaProposerOptions`, `GitWorktreeAdapterOptions`, `HeldOutGateOptions`, `HeldoutSignificance`, `HeldoutSignificanceOptions`, `JudgeAggregate`, `JudgeDimension`, `LabeledScenarioRecord`, `LabeledScenarioSampleArgs`, `LabeledScenarioStore`, `LlmJudgeOptions`, `LoadEvalFixtureScenariosOptions`, `LoopProvenanceBackend`, `LoopProvenanceCandidate`, `OpenAutoPrResult`, `OptimizerConfig`, `ParameterCandidate`, `ParameterChange`, `ParameterSweepProposerOptions`, `ParetoSignificanceGateOptions`, `PlanCampaignRunOptions`, `PlanEvalFixtureRunOptions`, `ProfileSummary`, `PromotionObjective`, `ProposePatchesArgs`, `ProposerComparison`, `ProposerPairwise`, `ProposerScore`, `RunImprovementLoopResult`, `RunOptimizationResult`, `RunProfileMatrixOptions`, `RunProfileMatrixResult`, `RunSkillOptResult`, `ScenarioAggregate`, `ScenarioRollup`, `ScoreboardRenderOptions`, `SequentialDecideFn`, `SequentialDecideOptions`, `SequentialObservation`, `SequentialPairedGate`, `SequentialPairedGateOptions`, `SkillOptEpochRecord`, `SkillOptProposer`, `SkillOptProposerOptions`, `SkillPatchRejection`, `TraceSpan`, `WorktreeAdapter`, `EvalFixtureRunPlan`, `EvalFixtureValidationMode`, `JsonPrimitive`, `JsonValue`, `RedactionStatus`, `RunOptimizationOptions`.
 
 ### TOKEN / USAGE — usage extraction + run-record usage types
 
diff --git a/docs/api/runtime.md b/docs/api/runtime.md
index ac4fb41..067d1e5 100644
--- a/docs/api/runtime.md
+++ b/docs/api/runtime.md
@@ -1308,6 +1308,671 @@ Minimum confidence a PROBABILISTIC verdict must clear to end. Default 0.8.
 
 ***
 
+### LeaderboardScore
+
+Defined in: runtime/define-leaderboard.ts:60
+
+Structured per-case verdict a `score` function may return (a bare number is
+ shorthand for `{ composite }`). `composite` is the [0,1] leaderboard score;
+ `dimensions` are recorded as extra judge dimensions.
+
+#### Properties
+
+##### composite
+
+> **composite**: `number`
+
+Defined in: runtime/define-leaderboard.ts:61
+
+##### dimensions?
+
+> `optional` **dimensions?**: `Record`\<`string`, `number`\>
+
+Defined in: runtime/define-leaderboard.ts:62
+
+##### notes?
+
+> `optional` **notes?**: `string`
+
+Defined in: runtime/define-leaderboard.ts:63
+
+***
+
+### LeaderboardScenario
+
+Defined in: runtime/define-leaderboard.ts:68
+
+The campaign scenario a case is wrapped into: the case rides along so
+ judges and hooks can reach the full domain payload, not just its id.
+
+#### Extends
+
+- `Scenario`
+
+#### Type Parameters
+
+##### TCase
+
+`TCase`
+
+#### Properties
+
+##### case
+
+> **case**: `TCase`
+
+Defined in: runtime/define-leaderboard.ts:69
+
+***
+
+### LeaderboardFlagSpec
+
+Defined in: runtime/define-leaderboard.ts:74
+
+One extra CLI flag a spec declares. Parsed by `run()` as `--<name> <value>`
+ and surfaced to every hook via `ctx.args`.
+
+#### Properties
+
+##### default?
+
+> `optional` **default?**: `string`
+
+Defined in: runtime/define-leaderboard.ts:75
+
+##### description
+
+> **description**: `string`
+
+Defined in: runtime/define-leaderboard.ts:76
+
+***
+
+### LeaderboardRunContext
+
+Defined in: runtime/define-leaderboard.ts:80
+
+Resolved run configuration handed to `setup` / `teardown` / `export`.
+
+#### Properties
+
+##### name
+
+> **name**: `string`
+
+Defined in: runtime/define-leaderboard.ts:81
+
+##### backend
+
+> **backend**: `string`
+
+Defined in: runtime/define-leaderboard.ts:83
+
+Execution backend name (`--backend`), a key of `backends`.
+
+##### runDir
+
+> **runDir**: `string`
+
+Defined in: runtime/define-leaderboard.ts:84
+
+##### exportDir
+
+> **exportDir**: `string`
+
+Defined in: runtime/define-leaderboard.ts:85
+
+##### args
+
+> **args**: `Record`\<`string`, `string` \| `undefined`\>
+
+Defined in: runtime/define-leaderboard.ts:87
+
+Every parsed flag (standard + `spec.flags`), by name without `--`.
+
+##### harnesses
+
+> **harnesses**: readonly `HarnessType`[]
+
+Defined in: runtime/define-leaderboard.ts:88
+
+##### models
+
+> **models**: readonly `string`[]
+
+Defined in: runtime/define-leaderboard.ts:90
+
+Snapshot-stamped model ids (`name@snapshot`) — the eval identity models.
+
+##### caseIds
+
+> **caseIds**: readonly `string`[]
+
+Defined in: runtime/define-leaderboard.ts:91
+
+##### shots
+
+> **shots**: `number`
+
+Defined in: runtime/define-leaderboard.ts:92
+
+##### reps
+
+> **reps**: `number`
+
+Defined in: runtime/define-leaderboard.ts:93
+
+***
+
+### LeaderboardBenchTask
+
+Defined in: runtime/define-leaderboard.ts:98
+
+Structurally `BenchTask` (bench registry shape) — declared locally so this
+ module adds no dependency on a benchmark package.
+
+#### Properties
+
+##### id
+
+> **id**: `string`
+
+Defined in: runtime/define-leaderboard.ts:99
+
+##### prompt
+
+> **prompt**: `string`
+
+Defined in: runtime/define-leaderboard.ts:100
+
+##### split?
+
+> `optional` **split?**: `string`
+
+Defined in: runtime/define-leaderboard.ts:101
+
+##### metadata?
+
+> `optional` **metadata?**: `Record`\<`string`, `unknown`\>
+
+Defined in: runtime/define-leaderboard.ts:102
+
+***
+
+### LeaderboardBenchScore
+
+Defined in: runtime/define-leaderboard.ts:106
+
+Structurally `BenchScore` (bench registry shape).
+
+#### Properties
+
+##### resolved
+
+> **resolved**: `boolean`
+
+Defined in: runtime/define-leaderboard.ts:107
+
+##### score
+
+> **score**: `number`
+
+Defined in: runtime/define-leaderboard.ts:108
+
+##### detail?
+
+> `optional` **detail?**: `string`
+
+Defined in: runtime/define-leaderboard.ts:109
+
+***
+
+### LeaderboardBenchmarkAdapter
+
+Defined in: runtime/define-leaderboard.ts:114
+
+Structurally `BenchmarkAdapter` (bench registry shape): `name`,
+ `preflight()`, `loadTasks()`, deterministic `judge()`, `goldArtifact()`.
+
+#### Properties
+
+##### name
+
+> `readonly` **name**: `string`
+
+Defined in: runtime/define-leaderboard.ts:115
+
+#### Methods
+
+##### preflight()
+
+> **preflight**(): `Promise`\<`void`\>
+
+Defined in: runtime/define-leaderboard.ts:116
+
+###### Returns
+
+`Promise`\<`void`\>
+
+##### loadTasks()
+
+> **loadTasks**(`opts?`): `Promise`\<[`LeaderboardBenchTask`](#leaderboardbenchtask)[]\>
+
+Defined in: runtime/define-leaderboard.ts:117
+
+###### Parameters
+
+###### opts?
+
+###### limit?
+
+`number`
+
+###### split?
+
+`string`
+
+###### ids?
+
+`string`[]
+
+###### Returns
+
+`Promise`\<[`LeaderboardBenchTask`](#leaderboardbenchtask)[]\>
+
+##### judge()
+
+> **judge**(`task`, `artifact`): `Promise`\<[`LeaderboardBenchScore`](#leaderboardbenchscore)\>
+
+Defined in: runtime/define-leaderboard.ts:122
+
+###### Parameters
+
+###### task
+
+[`LeaderboardBenchTask`](#leaderboardbenchtask)
+
+###### artifact
+
+`string`
+
+###### Returns
+
+`Promise`\<[`LeaderboardBenchScore`](#leaderboardbenchscore)\>
+
+##### goldArtifact()
+
+> **goldArtifact**(`task`): `Promise`\<`string` \| `undefined`\>
+
+Defined in: runtime/define-leaderboard.ts:123
+
+###### Parameters
+
+###### task
+
+[`LeaderboardBenchTask`](#leaderboardbenchtask)
+
+###### Returns
+
+`Promise`\<`string` \| `undefined`\>
+
+***
+
+### LeaderboardSpec
+
+Defined in: runtime/define-leaderboard.ts:126
+
+#### Type Parameters
+
+##### TCase
+
+`TCase`
+
+#### Properties
+
+##### name
+
+> **name**: `string`
+
+Defined in: runtime/define-leaderboard.ts:128
+
+Leaderboard name — the scenario `kind`, default profile name, and report title.
+
+##### cases
+
+> **cases**: `TCase`[]
+
+Defined in: runtime/define-leaderboard.ts:130
+
+The case corpus. Every case needs a stable string id (see `caseId`).
+
+##### caseId?
+
+> `optional` **caseId?**: (`c`) => `string`
+
+Defined in: runtime/define-leaderboard.ts:133
+
+Stable id extractor. Default: the case's own `id` property (fail-loud
+ when absent or not a string).
+
+###### Parameters
+
+###### c
+
+`TCase`
+
+###### Returns
+
+`string`
+
+##### prompt
+
+> **prompt**: (`c`) => `string` \| `Promise`\<`string`\>
+
+Defined in: runtime/define-leaderboard.ts:136
+
+The per-case task prompt. May be async (e.g. built by shelling out to a
+ reference implementation); resolved ONCE per case before dispatch.
+
+###### Parameters
+
+###### c
+
+`TCase`
+
+###### Returns
+
+`string` \| `Promise`\<`string`\>
+
+##### score
+
+> **score**: (`output`, `c`) => `number` \| [`LeaderboardScore`](#leaderboardscore)
+
+Defined in: runtime/define-leaderboard.ts:140
+
+The domain grader: agent output text → score. Used BOTH as the per-shot
+ validator (a shot with `composite > 0` stops the naive retry loop) and,
+ wrapped as a campaign judge, as the recorded leaderboard score.
+
+###### Parameters
+
+###### output
+
+`string`
+
+###### c
+
+`TCase`
+
+###### Returns
+
+`number` \| [`LeaderboardScore`](#leaderboardscore)
+
+##### axis?
+
+> `optional` **axis?**: `object`
+
+Defined in: runtime/define-leaderboard.ts:144
+
+Harness × model axes for `expandProfileAxes`. Defaults: the canonical
+ `CODING_HARNESSES` × the base profile's `model.default`. `--harnesses` /
+ `--models` override per run.
+
+###### harnesses?
+
+> `optional` **harnesses?**: readonly `HarnessType`[]
+
+###### models?
+
+> `optional` **models?**: readonly `string`[]
+
+##### baseProfile?
+
+> `optional` **baseProfile?**: `AgentProfile`
+
+Defined in: runtime/define-leaderboard.ts:147
+
+Base profile the axes expand over (prompt/tools/skills held fixed).
+ Default: a minimal `{ name, model: { default: <first model> } }`.
+
+##### backends?
+
+> `optional` **backends?**: `Record`\<`string`, (() => [`SandboxClient`](#sandboxclient-3)) \| `undefined`\>
+
+Defined in: runtime/define-leaderboard.ts:157
+
+Execution-backend registry: `--backend <name>` picks the factory that
+yields the `SandboxClient` every cell runs on. Merged over the defaults:
+  - `sandbox` — throws with guidance (a product must supply its real
+    Sandbox-backed client; the facade has no credentials).
+  - `cli-bridge` — `resolveSandboxClient({ backend: 'bridge' })` reading
+    `CLI_BRIDGE_URL` + `BRIDGE_BEARER`/`CLI_BRIDGE_BEARER`; the per-cell
+    harness/model ride in via `sandboxOverrides.backend`.
+
+##### flags?
+
+> `optional` **flags?**: `Record`\<`string`, [`LeaderboardFlagSpec`](#leaderboardflagspec)\>
+
+Defined in: runtime/define-leaderboard.ts:159
+
+Extra `--flag value` CLI args `run()` parses and surfaces via `ctx.args`.
+
+##### modelBackend?
+
+> `optional` **modelBackend?**: `Record`\<`string`, `unknown`\>
+
+Defined in: runtime/define-leaderboard.ts:163
+
+Extra fields merged into each cell's `backend.model` create override —
+ e.g. `{ provider: 'openai-compat', apiKey, baseUrl }` for a router-backed
+ sandbox. The cell's bare model id is set by the facade from the axis.
+
+##### setup?
+
+> `optional` **setup?**: (`ctx`) => `void` \| `Promise`\<`void`\>
+
+Defined in: runtime/define-leaderboard.ts:165
+
+Runs once before the matrix (fetch fixtures, warm caches).
+
+###### Parameters
+
+###### ctx
+
+[`LeaderboardRunContext`](#leaderboardruncontext)
+
+###### Returns
+
+`void` \| `Promise`\<`void`\>
+
+##### teardown?
+
+> `optional` **teardown?**: (`ctx`) => `void` \| `Promise`\<`void`\>
+
+Defined in: runtime/define-leaderboard.ts:167
+
+Runs once after the matrix, even on failure (reap boxes, close handles).
+
+###### Parameters
+
+###### ctx
+
+[`LeaderboardRunContext`](#leaderboardruncontext)
+
+###### Returns
+
+`void` \| `Promise`\<`void`\>
+
+##### onCellEvents?
+
+> `optional` **onCellEvents?**: (`events`, `c`) => `void`
+
+Defined in: runtime/define-leaderboard.ts:171
+
+Per-cell event tap: the raw sandbox events of each parsed iteration,
+ with the case — the seam for domain metric capture (search counts,
+ citations) without a substrate change.
+
+###### Parameters
+
+###### events
+
+readonly `SandboxEvent`[]
+
+###### c
+
+`TCase`
+
+###### Returns
+
+`void`
+
+##### parseOutput?
+
+> `optional` **parseOutput?**: (`events`, `c`) => `string`
+
+Defined in: runtime/define-leaderboard.ts:175
+
+Output decode override: raw events → the scored output text. Default:
+ the sandbox SDK's `collectAgentResponseText` (final answer text; empty
+ string when the stream carried none — which then scores 0).
+
+###### Parameters
+
+###### events
+
+readonly `SandboxEvent`[]
+
+###### c
+
+`TCase`
+
+###### Returns
+
+`string`
+
+##### export?
+
+> `optional` **export?**: (`result`, `ctx`) => `void` \| `Promise`\<`void`\>
+
+Defined in: runtime/define-leaderboard.ts:178
+
+Result export. Default: write `matrix-result.json` under the run dir and
+ print (+ write) the ranked leaderboard markdown under the export dir.
+
+###### Parameters
+
+###### result
+
+`RunProfileMatrixResult`\<`string`, [`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>\>
+
+###### ctx
+
+[`LeaderboardRunContext`](#leaderboardruncontext)
+
+###### Returns
+
+`void` \| `Promise`\<`void`\>
+
+##### dispatch?
+
+> `optional` **dispatch?**: `ProfileDispatchFn`\<[`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>, `string`\>
+
+Defined in: runtime/define-leaderboard.ts:184
+
+LEVEL 2 — full dispatch replacement (in-process products bring their own).
+ The default is `loopDispatch` + `naiveDriver` over the resolved backend.
+
+##### judges?
+
+> `optional` **judges?**: `JudgeConfig`\<`string`, [`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>\>[]
+
+Defined in: runtime/define-leaderboard.ts:186
+
+LEVEL 2 — full judge replacement. Default: `score` wrapped as one judge.
+
+##### shots?
+
+> `optional` **shots?**: `number`
+
+Defined in: runtime/define-leaderboard.ts:188
+
+Naive-retry shot cap per cell (`--shots`). Default 1.
+
+##### reps?
+
+> `optional` **reps?**: `number`
+
+Defined in: runtime/define-leaderboard.ts:190
+
+Replicates per cell (`--reps`). Default 1.
+
+##### matrix?
+
+> `optional` **matrix?**: `Partial`\<`RunProfileMatrixOptions`\<[`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>, `string`\>\>
+
+Defined in: runtime/define-leaderboard.ts:194
+
+Passthrough overrides spread onto the final `runProfileMatrix` call
+ (e.g. `maxConcurrency`, `costCeiling`, `integrity`, `storage`) — spread
+ LAST, so anything the facade wired can be overridden.
+
+***
+
+### DefinedLeaderboard
+
+Defined in: runtime/define-leaderboard.ts:197
+
+#### Type Parameters
+
+##### TCase
+
+`TCase`
+
+#### Methods
+
+##### run()
+
+> **run**(`argv?`): `Promise`\<`RunProfileMatrixResult`\<`string`, [`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>\>\>
+
+Defined in: runtime/define-leaderboard.ts:211
+
+Parse flags, run the matrix, export, and return the raw result.
+
+Standard flags: `--backend <name>` (default `sandbox`), `--harnesses a,b`,
+`--models m1,m2`, `--cases id1,id2`, `--shots N`, `--reps N`,
+`--model-snapshot <tag>`, `--run-dir <path>`, `--export-dir <path>`,
+plus every `spec.flags` entry. `argv` defaults to `process.argv.slice(2)`.
+
+The default run dir is FRESH per invocation (timestamp+pid under the OS
+tmpdir). `runProfileMatrix` caches cells by run dir, and a stable default
+would silently reuse a prior FAILED zero-token cell and skip dispatch —
+only an explicit `--run-dir` opts into that resume behavior.
+
+###### Parameters
+
+###### argv?
+
+`string`[]
+
+###### Returns
+
+`Promise`\<`RunProfileMatrixResult`\<`string`, [`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>\>\>
+
+##### toBenchmarkAdapter()
+
+> **toBenchmarkAdapter**(): [`LeaderboardBenchmarkAdapter`](#leaderboardbenchmarkadapter)
+
+Defined in: runtime/define-leaderboard.ts:213
+
+The same domain surface in the structural `BenchmarkAdapter` shape.
+
+###### Returns
+
+[`LeaderboardBenchmarkAdapter`](#leaderboardbenchmarkadapter)
+
+***
+
 ### HarvestCorpusOptions
 
 Defined in: [runtime/harvest-corpus.ts:28](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/harvest-corpus.ts#L28)
@@ -14650,6 +15315,30 @@ passes. Ground truth — the driver ends directly, no validation. The check read
 
 ***
 
+### defineLeaderboard()
+
+> **defineLeaderboard**\<`TCase`\>(`spec`): [`DefinedLeaderboard`](#definedleaderboard)\<`TCase`\>
+
+Defined in: runtime/define-leaderboard.ts:255
+
+#### Type Parameters
+
+##### TCase
+
+`TCase`
+
+#### Parameters
+
+##### spec
+
+[`LeaderboardSpec`](#leaderboardspec)\<`TCase`\>
+
+#### Returns
+
+[`DefinedLeaderboard`](#definedleaderboard)\<`TCase`\>
+
+***
+
 ### harvestCorpus()
 
 > **harvestCorpus**(`opts`): `Promise`\<[`HarvestReport`](#harvestreport)\>
diff --git a/docs/canonical-api.md b/docs/canonical-api.md
index 260e1bb..9d3c72e 100644
--- a/docs/canonical-api.md
+++ b/docs/canonical-api.md
@@ -2,7 +2,7 @@
 
 <!-- This doc is the JUDGMENT layer: the mental model (§1), the AgentProfile law (§1.5), and the anti-reinvention decision table (§2) — WHICH primitive to reach for and what NOT to build. The export INVENTORY (WHAT exists) and per-symbol signatures + `file:line` are GENERATED into `docs/api/` (TypeDoc + `scripts/gen-primitive-catalog.mjs`, do NOT hand-edit) — that is the mechanical reference: `docs/api/primitive-catalog.md` is the never-stale list of every primitive to reuse. The freshness gate (`pnpm docs:freshness`) FAILS CI if a version pin, a cited `file:line`, a decision-table symbol, or the generated catalog drifts from source — see `docs/MAINTAINING.md`. Keep this file the small, hand-curated spine; never re-list the inventory here — point at the catalog. -->
 
-> **Version 0.83.0.** The export inventory + per-symbol signatures live in the generated `docs/api/` reference: **`docs/api/primitive-catalog.md`** is the never-stale, grouped list of every primitive to reuse (own surface + the agent-eval judge / authenticity / verification / statistics / campaign / token-usage surfaces), with each one's import path and one-line summary read live from source; the per-module pages hold the full signatures. The pinned substrate is agent-eval `>=0.97.0 <1.0.0`; the sandbox substrate that materializes profiles into harness shapes is `@tangle-network/sandbox` (peer `>=0.8.0 <1.0.0`). The neutral contract types (`AgentProfile`, `AgentProfileMcpServer`, `HarnessType`, `ReasoningEffort`, `Part`/`ToolPart`/`ToolState`, plus environment-provider types) are owned by **`@tangle-network/agent-interface`** (peer `>=0.14.0 <1.0.0`) — the single source of truth. Substrate primitives are re-exported through `@tangle-network/agent-eval/contract` (or `/campaign`), not local to this package — the catalog's §2 shows exactly which subpath each lives under.
+> **Version 0.84.0.** The export inventory + per-symbol signatures live in the generated `docs/api/` reference: **`docs/api/primitive-catalog.md`** is the never-stale, grouped list of every primitive to reuse (own surface + the agent-eval judge / authenticity / verification / statistics / campaign / token-usage surfaces), with each one's import path and one-line summary read live from source; the per-module pages hold the full signatures. The pinned substrate is agent-eval `>=0.101.0 <1.0.0`; the sandbox substrate that materializes profiles into harness shapes is `@tangle-network/sandbox` (peer `>=0.8.0 <1.0.0`). The neutral contract types (`AgentProfile`, `AgentProfileMcpServer`, `HarnessType`, `ReasoningEffort`, `Part`/`ToolPart`/`ToolState`, plus environment-provider types) are owned by **`@tangle-network/agent-interface`** (peer `>=0.14.0 <1.0.0`) — the single source of truth. Substrate primitives are re-exported through `@tangle-network/agent-eval/contract` (or `/campaign`), not local to this package — the catalog's §2 shows exactly which subpath each lives under.
 >
 > **`./loops` is the runtime barrel** — `package.json` maps it to `src/runtime/index.ts`. Everything below labelled `/loops` is the recursive-atom + loop-kernel surface.
 >
diff --git a/package.json b/package.json
index 3b07a91..9abcb1d 100644
--- a/package.json
+++ b/package.json
@@ -1,6 +1,6 @@
 {
   "name": "@tangle-network/agent-runtime",
-  "version": "0.83.0",
+  "version": "0.84.0",
   "description": "Shared task-lifecycle skeleton for agents: a recursive loop kernel for chat turns, one-shot tasks, and multi-attempt loops, with trace capture and eval-gated self-improvement. Domain behavior lives in adapters; scoring and ship-gates in @tangle-network/agent-eval.",
   "homepage": "https://github.com/tangle-network/agent-runtime#readme",
   "repository": {
@@ -94,7 +94,7 @@
   },
   "devDependencies": {
     "@biomejs/biome": "^2.4.15",
-    "@tangle-network/agent-eval": ">=0.100.0 <1.0.0",
+    "@tangle-network/agent-eval": "^0.103.1",
     "@tangle-network/agent-interface": ">=0.14.0 <1.0.0",
     "@tangle-network/sandbox": ">=0.8.0 <1.0.0",
     "@types/node": "^25.9.3",
@@ -123,7 +123,7 @@
   "license": "MIT",
   "packageManager": "pnpm@10.28.0",
   "peerDependencies": {
-    "@tangle-network/agent-eval": ">=0.97.0 <1.0.0",
+    "@tangle-network/agent-eval": ">=0.101.0 <1.0.0",
     "@tangle-network/agent-interface": ">=0.14.0 <1.0.0",
     "@tangle-network/sandbox": ">=0.8.0 <1.0.0",
     "playwright": "^1.40.0"
diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml
index b9f061f..cce5f93 100644
--- a/pnpm-lock.yaml
+++ b/pnpm-lock.yaml
@@ -12,8 +12,8 @@ importers:
         specifier: ^2.4.15
         version: 2.4.15
       '@tangle-network/agent-eval':
-        specifier: '>=0.100.0 <1.0.0'
-        version: 0.100.0(typescript@5.9.3)
+        specifier: ^0.103.1
+        version: 0.103.1(typescript@5.9.3)
       '@tangle-network/agent-interface':
         specifier: '>=0.14.0 <1.0.0'
         version: 0.14.0
@@ -636,13 +636,8 @@ packages:
   '@tangle-network/agent-core@0.3.4':
     resolution: {integrity: sha512-Hvz3ABRouNtBmRvGqPxifAO2yuILneJMylWH5jW/jeS2F03RvqkGYuXyGXWWLqosYbb3hVAvSEe4Ykm2FMGEDQ==}
 
-  '@tangle-network/agent-eval@0.100.0':
-    resolution: {integrity: sha512-yBupVJJAqHozhe1BL5xBuDObjvNsoY+XmJo7qfpw/w7rehAXbKliBb4k3XS1G55+GaYPjFA+xwPzlEDQISpMRw==}
-    engines: {node: '>=20'}
-    hasBin: true
-
-  '@tangle-network/agent-integrations@0.29.0':
-    resolution: {integrity: sha512-Avn4oBDTRP5v/3o1xq++uu/9+Rhl2hscIggeFPBGjtVYwhvbsSZL9pRrF3LfjqL9rjx9AocZOdsZC6MXrxKnkg==}
+  '@tangle-network/agent-eval@0.103.1':
+    resolution: {integrity: sha512-9V37IcaRixSfIUkZ50pgU8a5nSVrkVmq5BimNLwVzbi3USwOkkJ9RcecMScpLUnrYNeaoe5Sac8lS6kzL1uTDQ==}
     engines: {node: '>=20'}
     hasBin: true
 
@@ -655,26 +650,6 @@ packages:
   '@tangle-network/agent-interface@0.14.0':
     resolution: {integrity: sha512-9CyGhIpl90E7v4MTm3b1ti3Bp7BfPigk2Nafgi21Lg0U+QxlNB656F2JmVpUuSbOo9aGZPtg5nXu5EBTlV5a1g==}
 
-  '@tangle-network/sandbox@0.3.0':
-    resolution: {integrity: sha512-KfgvKhsUaOpkJe3AD19w7s4hdQekBlXQGoNx0xS4u6vuQk5YnFzBgv+EQeHCkkgETpYOWS2AN+6u/JhSyWStMw==}
-    peerDependencies:
-      '@mastra/core': '*'
-      '@modelcontextprotocol/sdk': '*'
-      ai: '*'
-      openai: ^6.36.0
-      viem: ^2.0.0
-    peerDependenciesMeta:
-      '@mastra/core':
-        optional: true
-      '@modelcontextprotocol/sdk':
-        optional: true
-      ai:
-        optional: true
-      openai:
-        optional: true
-      viem:
-        optional: true
-
   '@tangle-network/sandbox@0.9.5':
     resolution: {integrity: sha512-yvX2OX6uISBVnMQ+v6Upkesa3u8yj6BHxsfcS6p8Vze+M4WBpyhkwA+onzFHuo9rti557ItZn8yDu4a/klljvQ==}
     peerDependencies:
@@ -699,8 +674,8 @@ packages:
     resolution: {integrity: sha512-+TAF9s5t1jOWGyGHvKhIWe2FYmG7puVaxmmg0Et67ylAjGa7GqUAvISXGjG/6dzld7A170V0kQHK0WVdh2Wh0Q==}
     engines: {node: '>=18'}
 
-  '@tangle-network/tcloud@0.4.12':
-    resolution: {integrity: sha512-3Qs90sV0P3LBtrTGC9HW2rwCMDjbScyhZIQU6H2/dVd84S5uKN+tCsURnXE6uu54U766Xa/V3Rcdqqjmgv7AXg==}
+  '@tangle-network/tcloud@0.4.14':
+    resolution: {integrity: sha512-jWYt//cGdLBDOv0luLH6xAGS4gbuOt8uHIkaCWwDDpQ1zp0FUPATHIrA3RMuF0qtQq9Vq00IhLrmCnHdHBP+dg==}
     engines: {node: '>=18'}
     hasBin: true
 
@@ -1629,13 +1604,13 @@ snapshots:
       '@tangle-network/agent-interface': 0.14.0
       zod: 4.4.3
 
-  '@tangle-network/agent-eval@0.100.0(typescript@5.9.3)':
+  '@tangle-network/agent-eval@0.103.1(typescript@5.9.3)':
     dependencies:
       '@asteasolutions/zod-to-openapi': 8.5.0(zod@4.4.3)
       '@ax-llm/ax': 19.0.45(zod@4.4.3)
       '@hono/node-server': 2.0.4(hono@4.12.25)
       '@tangle-network/agent-interface': 0.10.0
-      '@tangle-network/tcloud': 0.4.12(typescript@5.9.3)(zod@4.4.3)
+      '@tangle-network/tcloud': 0.4.14(typescript@5.9.3)(zod@4.4.3)
       hono: 4.12.25
       zod: 4.4.3
     transitivePeerDependencies:
@@ -1647,8 +1622,6 @@ snapshots:
       - typescript
       - utf-8-validate
 
-  '@tangle-network/agent-integrations@0.29.0': {}
-
   '@tangle-network/agent-interface@0.10.0':
     dependencies:
       zod: 4.4.3
@@ -1661,12 +1634,6 @@ snapshots:
     dependencies:
       zod: 4.4.3
 
-  '@tangle-network/sandbox@0.3.0(viem@2.52.2(typescript@5.9.3)(zod@4.4.3))':
-    dependencies:
-      '@tangle-network/agent-integrations': 0.29.0
-    optionalDependencies:
-      viem: 2.52.2(typescript@5.9.3)(zod@4.4.3)
-
   '@tangle-network/sandbox@0.9.5(viem@2.52.2(typescript@5.9.3)(zod@4.4.3))':
     dependencies:
       '@tangle-network/agent-core': 0.3.4
@@ -1676,11 +1643,11 @@ snapshots:
 
   '@tangle-network/tcloud-attestation@0.1.1': {}
 
-  '@tangle-network/tcloud@0.4.12(typescript@5.9.3)(zod@4.4.3)':
+  '@tangle-network/tcloud@0.4.14(typescript@5.9.3)(zod@4.4.3)':
     dependencies:
       '@scure/bip32': 2.2.0
       '@scure/bip39': 2.2.0
-      '@tangle-network/sandbox': 0.3.0(viem@2.52.2(typescript@5.9.3)(zod@4.4.3))
+      '@tangle-network/sandbox': 0.9.5(viem@2.52.2(typescript@5.9.3)(zod@4.4.3))
       '@tangle-network/tcloud-attestation': 0.1.1
       commander: 14.0.3
       viem: 2.52.2(typescript@5.9.3)(zod@4.4.3)
diff --git a/src/runtime/define-leaderboard.test.ts b/src/runtime/define-leaderboard.test.ts
new file mode 100644
index 0000000..9f7a17b
--- /dev/null
+++ b/src/runtime/define-leaderboard.test.ts
@@ -0,0 +1,182 @@
+import { mkdtempSync } from 'node:fs'
+import { tmpdir } from 'node:os'
+import { join } from 'node:path'
+import type { SandboxEvent } from '@tangle-network/sandbox'
+import { describe, expect, it } from 'vitest'
+import { defineLeaderboard, type LeaderboardRunContext } from './define-leaderboard'
+import { inProcessSandboxClient } from './in-process-sandbox-client'
+
+interface FakeCase {
+  id: string
+  answer: string
+}
+
+const CASES: FakeCase[] = [
+  { id: 'case-alpha', answer: 'ALPHA-42' },
+  { id: 'case-beta', answer: 'BETA-7' },
+]
+
+/** Offline backend: echoes the prompt's embedded answer + meters an llm_call,
+ *  so the matrix integrity guard sees a real (non-stub) backend. */
+function fakeBackend() {
+  return inProcessSandboxClient({
+    onPrompt: (prompt): SandboxEvent[] => {
+      const answer = /answer=(\S+)/.exec(prompt)?.[1] ?? 'missing'
+      return [
+        { type: 'llm_call', data: { tokensIn: 12, tokensOut: 6, costUsd: 0.002 } },
+        { type: 'result', data: { finalText: `final answer=${answer}` } },
+      ]
+    },
+  })
+}
+
+function board(overrides: Partial<Parameters<typeof defineLeaderboard<FakeCase>>[0]> = {}) {
+  return defineLeaderboard<FakeCase>({
+    name: 'fake-board',
+    cases: CASES,
+    prompt: async (c) => `solve the task. answer=${c.answer}`,
+    score: (output, c) => (output.includes(c.answer) ? 1 : 0),
+    backends: { inproc: fakeBackend },
+    export: async () => {}, // silence the default table print in tests
+    ...overrides,
+  })
+}
+
+const AXIS = ['--backend', 'inproc', '--harnesses', 'opencode', '--models', 'test-model@2026-01-01']
+
+describe('defineLeaderboard', () => {
+  it('runs the matrix end-to-end offline and scores every (profile, case) cell', async () => {
+    const result = await board().run([...AXIS])
+
+    expect(result.records).toHaveLength(2)
+    expect(Object.keys(result.byScenario).sort()).toEqual(['case-alpha', 'case-beta'])
+    const summaries = Object.values(result.byProfile)
+    expect(summaries).toHaveLength(1)
+    expect(summaries[0]?.meanComposite).toBe(1)
+    expect(summaries[0]?.model).toBe('test-model@2026-01-01')
+    // The fake backend's llm_call events were metered — the run is REAL, not a stub.
+    expect(result.integrity.verdict).toBe('real')
+    for (const r of result.records) expect(r.tokenUsage.input).toBeGreaterThan(0)
+  })
+
+  it('defaults to a FRESH run dir per invocation (no stale cell-cache reuse)', async () => {
+    const dirs: string[] = []
+    const b = board({
+      export: async (_result, ctx: LeaderboardRunContext) => {
+        dirs.push(ctx.runDir)
+      },
+    })
+    await b.run([...AXIS, '--cases', 'case-alpha'])
+    await b.run([...AXIS, '--cases', 'case-alpha'])
+    expect(dirs).toHaveLength(2)
+    expect(dirs[0]).not.toBe(dirs[1])
+    for (const d of dirs) expect(d.startsWith(tmpdir())).toBe(true)
+  })
+
+  it('honors an explicit --run-dir (the opt-in resume path)', async () => {
+    const runDir = mkdtempSync(join(tmpdir(), 'lb-explicit-'))
+    let seen: string | undefined
+    await board({
+      export: async (_r, ctx) => {
+        seen = ctx.runDir
+      },
+    }).run([...AXIS, '--cases', 'case-alpha', '--run-dir', runDir])
+    expect(seen).toBe(runDir)
+  })
+
+  it('subsets cases via --cases and rejects unknown ids', async () => {
+    const result = await board().run([...AXIS, '--cases', 'case-beta'])
+    expect(result.records).toHaveLength(1)
+    expect(Object.keys(result.byScenario)).toEqual(['case-beta'])
+
+    await expect(board().run([...AXIS, '--cases', 'nope'])).rejects.toThrow(/unknown case "nope"/)
+  })
+
+  it('stamps a snapshot onto bare model ids (RunRecord identity requirement)', async () => {
+    const result = await board().run([
+      '--backend',
+      'inproc',
+      '--harnesses',
+      'opencode',
+      '--models',
+      'test-model',
+    ])
+    expect(result.records[0]?.model).toBe('test-model@leaderboard')
+  })
+
+  it('wraps score() as the campaign judge, carrying dimensions and notes', async () => {
+    const result = await board({
+      score: (output, c) => ({
+        composite: output.includes(c.answer) ? 0.5 : 0,
+        dimensions: { exactness: 1 },
+        notes: 'structured',
+      }),
+    }).run([...AXIS, '--cases', 'case-alpha'])
+    expect(Object.values(result.byProfile)[0]?.meanComposite).toBe(0.5)
+    const outcome = result.records[0]?.outcome as { searchScore?: number } | undefined
+    expect(outcome?.searchScore).toBe(0.5)
+  })
+
+  it('feeds each cell raw events + case through onCellEvents (the metric-capture seam)', async () => {
+    const seen: Array<{ id: string; types: string[] }> = []
+    await board({
+      onCellEvents: (events, c) => {
+        seen.push({ id: c.id, types: events.map((e) => (e as { type: string }).type) })
+      },
+    }).run([...AXIS])
+    expect(seen.map((s) => s.id).sort()).toEqual(['case-alpha', 'case-beta'])
+    for (const s of seen) expect(s.types).toContain('llm_call')
+  })
+
+  it('parses spec.flags and surfaces every flag to the hooks via ctx.args', async () => {
+    let args: Record<string, string | undefined> = {}
+    await board({
+      flags: { split: { default: 'dev', description: 'dataset split' } },
+      setup: (ctx) => {
+        args = ctx.args
+      },
+    }).run([...AXIS, '--cases', 'case-alpha', '--split', 'holdout'])
+    expect(args.split).toBe('holdout')
+    expect(args.backend).toBe('inproc')
+    expect(args.harnesses).toBe('opencode')
+  })
+
+  it("fails loud on the default 'sandbox' backend with guidance to supply a real client", async () => {
+    await expect(
+      defineLeaderboard<FakeCase>({
+        name: 'no-backend',
+        cases: CASES,
+        prompt: (c) => c.id,
+        score: () => 0,
+      }).run(['--models', 'm@1']),
+    ).rejects.toThrow(/backends\.sandbox/)
+  })
+
+  it('toBenchmarkAdapter(): loadTasks/judge round-trip in the structural BenchmarkAdapter shape', async () => {
+    const adapter = board().toBenchmarkAdapter()
+    expect(adapter.name).toBe('fake-board')
+    await adapter.preflight()
+
+    const tasks = await adapter.loadTasks()
+    expect(tasks.map((t) => t.id)).toEqual(['case-alpha', 'case-beta'])
+    expect(tasks[0]?.prompt).toContain('answer=ALPHA-42')
+
+    const pass = await adapter.judge(tasks[0] as { id: string; prompt: string }, 'final ALPHA-42')
+    expect(pass).toMatchObject({ resolved: true, score: 1 })
+    const fail = await adapter.judge(tasks[0] as { id: string; prompt: string }, 'wrong')
+    expect(fail).toMatchObject({ resolved: false, score: 0 })
+
+    const subset = await adapter.loadTasks({ ids: ['case-beta'] })
+    expect(subset.map((t) => t.id)).toEqual(['case-beta'])
+    expect(await adapter.goldArtifact(tasks[0] as { id: string; prompt: string })).toBeUndefined()
+
+    // preflight fails loud on duplicate ids — the cheap corpus-integrity check.
+    const dup = defineLeaderboard<FakeCase>({
+      name: 'dup',
+      cases: [CASES[0] as FakeCase, CASES[0] as FakeCase],
+      prompt: (c) => c.id,
+      score: () => 0,
+    }).toBenchmarkAdapter()
+    await expect(dup.preflight()).rejects.toThrow(/duplicate case id/)
+  })
+})
diff --git a/src/runtime/define-leaderboard.ts b/src/runtime/define-leaderboard.ts
new file mode 100644
index 0000000..a31dffa
--- /dev/null
+++ b/src/runtime/define-leaderboard.ts
@@ -0,0 +1,551 @@
+/**
+ * `defineLeaderboard` — the declarative eval-leaderboard facade.
+ *
+ * A product's harness×model leaderboard is always the same assembly: expand a
+ * base profile across the harness×model axes (`expandProfileAxes`), run every
+ * (profile, case) cell as a driven loop (`loopDispatch` + `naiveDriver`), score
+ * with the domain's grader, and emit ONE `runProfileMatrix` call. Each product
+ * hand-rolled that assembly (~650 lines each) and re-hit the same footguns:
+ * stale cell-cache reuse, zero-token stub cells, missing model snapshots.
+ *
+ * This facade IS that assembly, once, with the domain reduced to a declarative
+ * spec: `cases` + `prompt` + `score`. It contains NO execution, judging, or
+ * metering logic of its own — every moving part is an existing primitive, and
+ * every default is overridable:
+ *
+ *   - LEVEL 0 (declarative): `cases` / `prompt` / `score` / `axis`.
+ *   - LEVEL 1 (seams): `backends`, `flags`, `parseOutput`, `onCellEvents`,
+ *     `setup`/`teardown`, `export`, `modelBackend`, `matrix` passthrough.
+ *   - LEVEL 2 (replacement): `dispatch` and `judges` swap out the whole
+ *     loop wiring or scoring; `runProfileMatrix` itself stays public as the
+ *     escape floor — a product overriding everything just writes what it has
+ *     today, no capability removed.
+ *
+ * `toBenchmarkAdapter()` exposes the same domain surface in the structural
+ * `BenchmarkAdapter` shape (`name`/`preflight`/`loadTasks`/`judge`/
+ * `goldArtifact`) so a product leaderboard can register into a benchmark
+ * registry without this module depending on one.
+ *
+ * @experimental
+ */
+import { execFileSync } from 'node:child_process'
+import { mkdirSync, writeFileSync } from 'node:fs'
+import { tmpdir } from 'node:os'
+import { join } from 'node:path'
+import {
+  type AgentProfile,
+  CODING_HARNESSES,
+  expandProfileAxes,
+  type HarnessType,
+  harnessAxisOf,
+} from '@tangle-network/agent-eval'
+import {
+  type JudgeConfig,
+  type ProfileDispatchFn,
+  type RunProfileMatrixOptions,
+  type RunProfileMatrixResult,
+  runProfileMatrix,
+  type Scenario,
+} from '@tangle-network/agent-eval/campaign'
+import { collectAgentResponseText, type SandboxEvent } from '@tangle-network/sandbox'
+import { leaderboard, renderLeaderboardMarkdown } from './benchmark-report'
+import { loopDispatch } from './loop-dispatch'
+import { resolveSandboxClient } from './resolve-sandbox-client'
+import { naiveDriver, type SteeringDecision } from './steering-drivers'
+import type { SandboxClient } from './types'
+
+/** Structured per-case verdict a `score` function may return (a bare number is
+ *  shorthand for `{ composite }`). `composite` is the [0,1] leaderboard score;
+ *  `dimensions` are recorded as extra judge dimensions. */
+export interface LeaderboardScore {
+  composite: number
+  dimensions?: Record<string, number>
+  notes?: string
+}
+
+/** The campaign scenario a case is wrapped into: the case rides along so
+ *  judges and hooks can reach the full domain payload, not just its id. */
+export interface LeaderboardScenario<TCase> extends Scenario {
+  case: TCase
+}
+
+/** One extra CLI flag a spec declares. Parsed by `run()` as `--<name> <value>`
+ *  and surfaced to every hook via `ctx.args`. */
+export interface LeaderboardFlagSpec {
+  default?: string
+  description: string
+}
+
+/** Resolved run configuration handed to `setup` / `teardown` / `export`. */
+export interface LeaderboardRunContext {
+  name: string
+  /** Execution backend name (`--backend`), a key of `backends`. */
+  backend: string
+  runDir: string
+  exportDir: string
+  /** Every parsed flag (standard + `spec.flags`), by name without `--`. */
+  args: Record<string, string | undefined>
+  harnesses: readonly HarnessType[]
+  /** Snapshot-stamped model ids (`name@snapshot`) — the eval identity models. */
+  models: readonly string[]
+  caseIds: readonly string[]
+  shots: number
+  reps: number
+}
+
+/** Structurally `BenchTask` (bench registry shape) — declared locally so this
+ *  module adds no dependency on a benchmark package. */
+export interface LeaderboardBenchTask {
+  id: string
+  prompt: string
+  split?: string
+  metadata?: Record<string, unknown>
+}
+
+/** Structurally `BenchScore` (bench registry shape). */
+export interface LeaderboardBenchScore {
+  resolved: boolean
+  score: number
+  detail?: string
+}
+
+/** Structurally `BenchmarkAdapter` (bench registry shape): `name`,
+ *  `preflight()`, `loadTasks()`, deterministic `judge()`, `goldArtifact()`. */
+export interface LeaderboardBenchmarkAdapter {
+  readonly name: string
+  preflight(): Promise<void>
+  loadTasks(opts?: {
+    limit?: number
+    split?: string
+    ids?: string[]
+  }): Promise<LeaderboardBenchTask[]>
+  judge(task: LeaderboardBenchTask, artifact: string): Promise<LeaderboardBenchScore>
+  goldArtifact(task: LeaderboardBenchTask): Promise<string | undefined>
+}
+
+export interface LeaderboardSpec<TCase> {
+  /** Leaderboard name — the scenario `kind`, default profile name, and report title. */
+  name: string
+  /** The case corpus. Every case needs a stable string id (see `caseId`). */
+  cases: TCase[]
+  /** Stable id extractor. Default: the case's own `id` property (fail-loud
+   *  when absent or not a string). */
+  caseId?: (c: TCase) => string
+  /** The per-case task prompt. May be async (e.g. built by shelling out to a
+   *  reference implementation); resolved ONCE per case before dispatch. */
+  prompt: (c: TCase) => string | Promise<string>
+  /** The domain grader: agent output text → score. Used BOTH as the per-shot
+   *  validator (a shot with `composite > 0` stops the naive retry loop) and,
+   *  wrapped as a campaign judge, as the recorded leaderboard score. */
+  score: (output: string, c: TCase) => number | LeaderboardScore
+  /** Harness × model axes for `expandProfileAxes`. Defaults: the canonical
+   *  `CODING_HARNESSES` × the base profile's `model.default`. `--harnesses` /
+   *  `--models` override per run. */
+  axis?: { harnesses?: readonly HarnessType[]; models?: readonly string[] }
+  /** Base profile the axes expand over (prompt/tools/skills held fixed).
+   *  Default: a minimal `{ name, model: { default: <first model> } }`. */
+  baseProfile?: AgentProfile
+  /**
+   * Execution-backend registry: `--backend <name>` picks the factory that
+   * yields the `SandboxClient` every cell runs on. Merged over the defaults:
+   *   - `sandbox` — throws with guidance (a product must supply its real
+   *     Sandbox-backed client; the facade has no credentials).
+   *   - `cli-bridge` — `resolveSandboxClient({ backend: 'bridge' })` reading
+   *     `CLI_BRIDGE_URL` + `BRIDGE_BEARER`/`CLI_BRIDGE_BEARER`; the per-cell
+   *     harness/model ride in via `sandboxOverrides.backend`.
+   */
+  backends?: Record<string, (() => SandboxClient) | undefined>
+  /** Extra `--flag value` CLI args `run()` parses and surfaces via `ctx.args`. */
+  flags?: Record<string, LeaderboardFlagSpec>
+  /** Extra fields merged into each cell's `backend.model` create override —
+   *  e.g. `{ provider: 'openai-compat', apiKey, baseUrl }` for a router-backed
+   *  sandbox. The cell's bare model id is set by the facade from the axis. */
+  modelBackend?: Record<string, unknown>
+  /** Runs once before the matrix (fetch fixtures, warm caches). */
+  setup?: (ctx: LeaderboardRunContext) => Promise<void> | void
+  /** Runs once after the matrix, even on failure (reap boxes, close handles). */
+  teardown?: (ctx: LeaderboardRunContext) => Promise<void> | void
+  /** Per-cell event tap: the raw sandbox events of each parsed iteration,
+   *  with the case — the seam for domain metric capture (search counts,
+   *  citations) without a substrate change. */
+  onCellEvents?: (events: readonly SandboxEvent[], c: TCase) => void
+  /** Output decode override: raw events → the scored output text. Default:
+   *  the sandbox SDK's `collectAgentResponseText` (final answer text; empty
+   *  string when the stream carried none — which then scores 0). */
+  parseOutput?: (events: readonly SandboxEvent[], c: TCase) => string
+  /** Result export. Default: write `matrix-result.json` under the run dir and
+   *  print (+ write) the ranked leaderboard markdown under the export dir. */
+  export?: (
+    result: RunProfileMatrixResult<string, LeaderboardScenario<TCase>>,
+    ctx: LeaderboardRunContext,
+  ) => Promise<void> | void
+  /** LEVEL 2 — full dispatch replacement (in-process products bring their own).
+   *  The default is `loopDispatch` + `naiveDriver` over the resolved backend. */
+  dispatch?: ProfileDispatchFn<LeaderboardScenario<TCase>, string>
+  /** LEVEL 2 — full judge replacement. Default: `score` wrapped as one judge. */
+  judges?: JudgeConfig<string, LeaderboardScenario<TCase>>[]
+  /** Naive-retry shot cap per cell (`--shots`). Default 1. */
+  shots?: number
+  /** Replicates per cell (`--reps`). Default 1. */
+  reps?: number
+  /** Passthrough overrides spread onto the final `runProfileMatrix` call
+   *  (e.g. `maxConcurrency`, `costCeiling`, `integrity`, `storage`) — spread
+   *  LAST, so anything the facade wired can be overridden. */
+  matrix?: Partial<RunProfileMatrixOptions<LeaderboardScenario<TCase>, string>>
+}
+
+export interface DefinedLeaderboard<TCase> {
+  /**
+   * Parse flags, run the matrix, export, and return the raw result.
+   *
+   * Standard flags: `--backend <name>` (default `sandbox`), `--harnesses a,b`,
+   * `--models m1,m2`, `--cases id1,id2`, `--shots N`, `--reps N`,
+   * `--model-snapshot <tag>`, `--run-dir <path>`, `--export-dir <path>`,
+   * plus every `spec.flags` entry. `argv` defaults to `process.argv.slice(2)`.
+   *
+   * The default run dir is FRESH per invocation (timestamp+pid under the OS
+   * tmpdir). `runProfileMatrix` caches cells by run dir, and a stable default
+   * would silently reuse a prior FAILED zero-token cell and skip dispatch —
+   * only an explicit `--run-dir` opts into that resume behavior.
+   */
+  run(argv?: string[]): Promise<RunProfileMatrixResult<string, LeaderboardScenario<TCase>>>
+  /** The same domain surface in the structural `BenchmarkAdapter` shape. */
+  toBenchmarkAdapter(): LeaderboardBenchmarkAdapter
+}
+
+/** Read `--name <value>` from an argv array. */
+function argOf(argv: readonly string[], name: string): string | undefined {
+  const i = argv.indexOf(`--${name}`)
+  if (i >= 0 && i + 1 < argv.length) return argv[i + 1]
+  return undefined
+}
+
+function splitList(v: string | undefined): string[] | undefined {
+  if (v === undefined) return undefined
+  const parts = v
+    .split(',')
+    .map((s) => s.trim())
+    .filter(Boolean)
+  return parts.length > 0 ? parts : undefined
+}
+
+/** RunRecords reject a bare model id — the eval IDENTITY model must carry a
+ *  snapshot (`name@<snapshot>`). Unchanged when already stamped. */
+function withSnapshot(model: string, snapshot: string): string {
+  return model.includes('@') ? model : `${model}@${snapshot}`
+}
+
+/** The bare model id the backend actually serves (identity snapshot stripped). */
+function bareModel(model: string): string {
+  return model.split('@')[0] ?? model
+}
+
+function gitSha(): string {
+  try {
+    return execFileSync('git', ['rev-parse', 'HEAD'], { encoding: 'utf8' }).trim()
+  } catch {
+    return 'unknown'
+  }
+}
+
+function normalizeScore(s: number | LeaderboardScore): LeaderboardScore {
+  return typeof s === 'number' ? { composite: s } : s
+}
+
+export function defineLeaderboard<TCase>(spec: LeaderboardSpec<TCase>): DefinedLeaderboard<TCase> {
+  const caseId = (c: TCase): string => {
+    const id = spec.caseId ? spec.caseId(c) : (c as { id?: unknown }).id
+    if (typeof id !== 'string' || id.length === 0) {
+      throw new Error(
+        `defineLeaderboard(${spec.name}): every case needs a stable string id — ` +
+          'give cases an `id` property or supply spec.caseId',
+      )
+    }
+    return id
+  }
+
+  const selectCases = (ids?: readonly string[]): TCase[] => {
+    if (!ids) return spec.cases
+    const byId = new Map(spec.cases.map((c) => [caseId(c), c]))
+    return ids.map((id) => {
+      const c = byId.get(id)
+      if (c === undefined) {
+        throw new Error(
+          `defineLeaderboard(${spec.name}): unknown case "${id}" (have: ${[...byId.keys()].join(', ')})`,
+        )
+      }
+      return c
+    })
+  }
+
+  const scoreJudge: JudgeConfig<string, LeaderboardScenario<TCase>> = {
+    name: `${spec.name}-score`,
+    dimensions: [{ key: 'composite', description: `${spec.name} case score` }],
+    score({ artifact, scenario }) {
+      const s = normalizeScore(spec.score(artifact, scenario.case))
+      return {
+        composite: s.composite,
+        dimensions: { composite: s.composite, ...s.dimensions },
+        notes: s.notes ?? '',
+      }
+    },
+  }
+
+  async function run(
+    argv: string[] = process.argv.slice(2),
+  ): Promise<RunProfileMatrixResult<string, LeaderboardScenario<TCase>>> {
+    const args: Record<string, string | undefined> = {}
+    for (const name of [
+      'backend',
+      'harnesses',
+      'models',
+      'cases',
+      'shots',
+      'reps',
+      'model-snapshot',
+      'run-dir',
+      'export-dir',
+    ]) {
+      args[name] = argOf(argv, name)
+    }
+    for (const [name, flag] of Object.entries(spec.flags ?? {})) {
+      args[name] = argOf(argv, name) ?? flag.default
+    }
+
+    const backendName = args.backend ?? 'sandbox'
+    const shots = Number(args.shots ?? spec.shots ?? 1)
+    const reps = Number(args.reps ?? spec.reps ?? 1)
+    const snapshot = args['model-snapshot'] ?? 'leaderboard'
+    // FRESH run dir per invocation: runProfileMatrix caches cells by run dir,
+    // and a stable default resumes a prior FAILED zero-token cell without
+    // re-dispatching. Only an explicit --run-dir opts into resume.
+    const runDir =
+      args['run-dir'] ?? join(tmpdir(), `leaderboard-${spec.name}-${Date.now()}-${process.pid}`)
+    const exportDir = args['export-dir'] ?? join(runDir, 'export')
+    mkdirSync(runDir, { recursive: true })
+
+    const cases = selectCases(splitList(args.cases))
+    if (cases.length === 0) throw new Error(`defineLeaderboard(${spec.name}): no cases to run`)
+    const scenarios: LeaderboardScenario<TCase>[] = cases.map((c) => ({
+      id: caseId(c),
+      kind: spec.name,
+      case: c,
+    }))
+
+    const harnesses =
+      (splitList(args.harnesses) as HarnessType[] | undefined) ??
+      spec.axis?.harnesses ??
+      CODING_HARNESSES
+    const rawModels =
+      splitList(args.models) ??
+      spec.axis?.models ??
+      (spec.baseProfile?.model?.default !== undefined
+        ? [spec.baseProfile.model.default]
+        : undefined)
+    if (!rawModels || rawModels.length === 0) {
+      throw new Error(
+        `defineLeaderboard(${spec.name}): no models — pass --models, set spec.axis.models, ` +
+          'or give spec.baseProfile a model.default',
+      )
+    }
+    const models = rawModels.map((m) => withSnapshot(m, snapshot))
+
+    const base: AgentProfile =
+      spec.baseProfile ??
+      ({ name: spec.name, model: { default: bareModel(models[0] ?? '') } } as AgentProfile)
+    const profiles = expandProfileAxes({ base, harnesses, models })
+
+    const ctx: LeaderboardRunContext = {
+      name: spec.name,
+      backend: backendName,
+      runDir,
+      exportDir,
+      args,
+      harnesses,
+      models,
+      caseIds: scenarios.map((s) => s.id),
+      shots,
+      reps,
+    }
+
+    // Backend registry: defaults + spec overrides (spec wins). Factories are
+    // lazy so an unused backend never resolves credentials.
+    const backends: Record<string, (() => SandboxClient) | undefined> = {
+      sandbox: () => {
+        throw new Error(
+          `defineLeaderboard(${spec.name}): the 'sandbox' backend needs your product's real ` +
+            'SandboxClient — supply spec.backends.sandbox (e.g. () => new SandboxClient({ apiKey, baseUrl }))',
+        )
+      },
+      'cli-bridge': () => {
+        const bearer = process.env.BRIDGE_BEARER ?? process.env.CLI_BRIDGE_BEARER
+        if (!bearer) {
+          throw new Error(
+            `defineLeaderboard(${spec.name}): backend 'cli-bridge' needs BRIDGE_BEARER or CLI_BRIDGE_BEARER set`,
+          )
+        }
+        return resolveSandboxClient({
+          backend: 'bridge',
+          bridge: {
+            url: process.env.CLI_BRIDGE_URL,
+            bearer,
+            model: bareModel(models[0] ?? ''),
+            timeoutMs: 900_000,
+          },
+        })
+      },
+      ...spec.backends,
+    }
+    const makeClient = backends[backendName]
+    if (!makeClient) {
+      throw new Error(
+        `defineLeaderboard(${spec.name}): unknown backend "${backendName}" (have: ${Object.keys(backends).join(', ')})`,
+      )
+    }
+    const sandboxClient = makeClient()
+
+    // Prompts resolve ONCE per case, up front — spec.prompt may be async
+    // (shelling out to a reference implementation) but the loop kernel's
+    // taskToPrompt is sync.
+    const promptById = new Map<string, string>()
+    for (const s of scenarios) promptById.set(s.id, await spec.prompt(s.case))
+    const promptOf = (s: LeaderboardScenario<TCase>): string => {
+      const p = promptById.get(s.id)
+      if (p === undefined)
+        throw new Error(`defineLeaderboard(${spec.name}): no prompt for case "${s.id}"`)
+      return p
+    }
+
+    // Monotonic per-shot nonce appended to each shot's prompt — defeats router
+    // response-caching of byte-identical prompts across naive-retry shots.
+    let shotNonce = 0
+
+    const dispatch =
+      spec.dispatch ??
+      loopDispatch<
+        LeaderboardScenario<TCase>,
+        string,
+        SteeringDecision,
+        LeaderboardScenario<TCase>,
+        string
+      >({
+        sandboxClient,
+        toLoopOptions: (scenario, profile) => {
+          // The cell's harness + model come off the profile's axis stamp set
+          // by expandProfileAxes; the sandbox create override carries them to
+          // whichever backend client runs the cell.
+          const axis = harnessAxisOf(profile)
+          const modelId = bareModel(axis?.model ?? models[0] ?? '')
+          return {
+            // naiveDriver = the no-signal retry floor: re-run the same case as
+            // an independent attempt until one scores (>0) or the shot cap.
+            driver: naiveDriver<LeaderboardScenario<TCase>, string>({
+              continuation: '',
+              applyContinuation: (task) => task,
+              maxIterations: shots,
+            }),
+            agentRun: {
+              profile,
+              taskToPrompt: (s) => `${promptOf(s)}\n\n<!-- independent-attempt:${shotNonce++} -->`,
+              ...(axis
+                ? {
+                    sandboxOverrides: {
+                      backend: {
+                        type: axis.harness,
+                        model: { ...spec.modelBackend, model: modelId },
+                      },
+                    } as never,
+                  }
+                : {}),
+            },
+            output: {
+              parse: (events) => {
+                spec.onCellEvents?.(events, scenario.case)
+                return spec.parseOutput
+                  ? spec.parseOutput(events, scenario.case)
+                  : (collectAgentResponseText(events) ?? '')
+              },
+            },
+            validator: {
+              validate: async (output: string) => {
+                const s = normalizeScore(spec.score(output, scenario.case))
+                return { valid: s.composite > 0, score: s.composite }
+              },
+            },
+            task: scenario,
+            maxIterations: shots,
+          }
+        },
+      })
+
+    await spec.setup?.(ctx)
+    try {
+      const result = await runProfileMatrix<LeaderboardScenario<TCase>, string>({
+        profiles,
+        scenarios,
+        dispatch,
+        judges: spec.judges ?? [scoreJudge],
+        runDir,
+        commitSha: gitSha(),
+        reps,
+        ...spec.matrix,
+      })
+
+      if (spec.export) {
+        await spec.export(result, ctx)
+      } else {
+        mkdirSync(exportDir, { recursive: true })
+        writeFileSync(join(runDir, 'matrix-result.json'), `${JSON.stringify(result, null, 2)}\n`)
+        const table = renderLeaderboardMarkdown(
+          leaderboard(result.records, { title: spec.name, meta: { backend: backendName } }),
+        )
+        writeFileSync(join(exportDir, 'leaderboard.md'), table)
+        console.log(table)
+      }
+      return result
+    } finally {
+      await spec.teardown?.(ctx)
+    }
+  }
+
+  function toBenchmarkAdapter(): LeaderboardBenchmarkAdapter {
+    return {
+      name: spec.name,
+      async preflight(): Promise<void> {
+        // Case-id integrity is the cheap, real check: duplicate or missing ids
+        // corrupt every downstream join.
+        const seen = new Set<string>()
+        for (const c of spec.cases) {
+          const id = caseId(c)
+          if (seen.has(id)) {
+            throw new Error(`defineLeaderboard(${spec.name}): duplicate case id "${id}"`)
+          }
+          seen.add(id)
+        }
+      },
+      async loadTasks(opts): Promise<LeaderboardBenchTask[]> {
+        const selected = selectCases(opts?.ids)
+        const limited = opts?.limit !== undefined ? selected.slice(0, opts.limit) : selected
+        return Promise.all(
+          limited.map(async (c) => ({
+            id: caseId(c),
+            prompt: await spec.prompt(c),
+            metadata: { case: c },
+          })),
+        )
+      },
+      async judge(task, artifact): Promise<LeaderboardBenchScore> {
+        const [c] = selectCases([task.id])
+        if (c === undefined)
+          throw new Error(`defineLeaderboard(${spec.name}): no case "${task.id}"`)
+        const s = normalizeScore(spec.score(artifact, c))
+        return { resolved: s.composite > 0, score: s.composite, detail: s.notes }
+      },
+      async goldArtifact(): Promise<string | undefined> {
+        return undefined
+      },
+    }
+  }
+
+  return { run, toBenchmarkAdapter }
+}
diff --git a/src/runtime/index.ts b/src/runtime/index.ts
index da18bc2..72c0739 100644
--- a/src/runtime/index.ts
+++ b/src/runtime/index.ts
@@ -87,6 +87,21 @@ export {
   sentinelCompletion,
   stopSentinel,
 } from './completion'
+// The declarative eval-leaderboard facade: cases + prompt + score → one
+// runProfileMatrix call (expandProfileAxes × loopDispatch × naiveDriver),
+// with a structural BenchmarkAdapter view via toBenchmarkAdapter().
+export {
+  type DefinedLeaderboard,
+  defineLeaderboard,
+  type LeaderboardBenchmarkAdapter,
+  type LeaderboardBenchScore,
+  type LeaderboardBenchTask,
+  type LeaderboardFlagSpec,
+  type LeaderboardRunContext,
+  type LeaderboardScenario,
+  type LeaderboardScore,
+  type LeaderboardSpec,
+} from './define-leaderboard'
 export {
   type AgentEnvironment,
   type AgentEnvironmentCapabilities,

From 76d764d078b69c3cef107fd06c88b96392fe9ce2 Mon Sep 17 00:00:00 2001
From: Drew Stone <drewstone329@gmail.com>
Date: Fri, 3 Jul 2026 00:47:08 -0600
Subject: [PATCH 2/2] docs(api): regenerate for defineLeaderboard

---
 docs/api/runtime.md | 120 ++++++++++++++++++++++----------------------
 1 file changed, 60 insertions(+), 60 deletions(-)

diff --git a/docs/api/runtime.md b/docs/api/runtime.md
index 067d1e5..265a155 100644
--- a/docs/api/runtime.md
+++ b/docs/api/runtime.md
@@ -1310,7 +1310,7 @@ Minimum confidence a PROBABILISTIC verdict must clear to end. Default 0.8.
 
 ### LeaderboardScore
 
-Defined in: runtime/define-leaderboard.ts:60
+Defined in: [runtime/define-leaderboard.ts:60](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L60)
 
 Structured per-case verdict a `score` function may return (a bare number is
  shorthand for `{ composite }`). `composite` is the [0,1] leaderboard score;
@@ -1322,25 +1322,25 @@ Structured per-case verdict a `score` function may return (a bare number is
 
 > **composite**: `number`
 
-Defined in: runtime/define-leaderboard.ts:61
+Defined in: [runtime/define-leaderboard.ts:61](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L61)
 
 ##### dimensions?
 
 > `optional` **dimensions?**: `Record`\<`string`, `number`\>
 
-Defined in: runtime/define-leaderboard.ts:62
+Defined in: [runtime/define-leaderboard.ts:62](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L62)
 
 ##### notes?
 
 > `optional` **notes?**: `string`
 
-Defined in: runtime/define-leaderboard.ts:63
+Defined in: [runtime/define-leaderboard.ts:63](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L63)
 
 ***
 
 ### LeaderboardScenario
 
-Defined in: runtime/define-leaderboard.ts:68
+Defined in: [runtime/define-leaderboard.ts:68](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L68)
 
 The campaign scenario a case is wrapped into: the case rides along so
  judges and hooks can reach the full domain payload, not just its id.
@@ -1361,13 +1361,13 @@ The campaign scenario a case is wrapped into: the case rides along so
 
 > **case**: `TCase`
 
-Defined in: runtime/define-leaderboard.ts:69
+Defined in: [runtime/define-leaderboard.ts:69](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L69)
 
 ***
 
 ### LeaderboardFlagSpec
 
-Defined in: runtime/define-leaderboard.ts:74
+Defined in: [runtime/define-leaderboard.ts:74](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L74)
 
 One extra CLI flag a spec declares. Parsed by `run()` as `--<name> <value>`
  and surfaced to every hook via `ctx.args`.
@@ -1378,19 +1378,19 @@ One extra CLI flag a spec declares. Parsed by `run()` as `--<name> <value>`
 
 > `optional` **default?**: `string`
 
-Defined in: runtime/define-leaderboard.ts:75
+Defined in: [runtime/define-leaderboard.ts:75](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L75)
 
 ##### description
 
 > **description**: `string`
 
-Defined in: runtime/define-leaderboard.ts:76
+Defined in: [runtime/define-leaderboard.ts:76](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L76)
 
 ***
 
 ### LeaderboardRunContext
 
-Defined in: runtime/define-leaderboard.ts:80
+Defined in: [runtime/define-leaderboard.ts:80](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L80)
 
 Resolved run configuration handed to `setup` / `teardown` / `export`.
 
@@ -1400,13 +1400,13 @@ Resolved run configuration handed to `setup` / `teardown` / `export`.
 
 > **name**: `string`
 
-Defined in: runtime/define-leaderboard.ts:81
+Defined in: [runtime/define-leaderboard.ts:81](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L81)
 
 ##### backend
 
 > **backend**: `string`
 
-Defined in: runtime/define-leaderboard.ts:83
+Defined in: [runtime/define-leaderboard.ts:83](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L83)
 
 Execution backend name (`--backend`), a key of `backends`.
 
@@ -1414,19 +1414,19 @@ Execution backend name (`--backend`), a key of `backends`.
 
 > **runDir**: `string`
 
-Defined in: runtime/define-leaderboard.ts:84
+Defined in: [runtime/define-leaderboard.ts:84](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L84)
 
 ##### exportDir
 
 > **exportDir**: `string`
 
-Defined in: runtime/define-leaderboard.ts:85
+Defined in: [runtime/define-leaderboard.ts:85](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L85)
 
 ##### args
 
 > **args**: `Record`\<`string`, `string` \| `undefined`\>
 
-Defined in: runtime/define-leaderboard.ts:87
+Defined in: [runtime/define-leaderboard.ts:87](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L87)
 
 Every parsed flag (standard + `spec.flags`), by name without `--`.
 
@@ -1434,13 +1434,13 @@ Every parsed flag (standard + `spec.flags`), by name without `--`.
 
 > **harnesses**: readonly `HarnessType`[]
 
-Defined in: runtime/define-leaderboard.ts:88
+Defined in: [runtime/define-leaderboard.ts:88](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L88)
 
 ##### models
 
 > **models**: readonly `string`[]
 
-Defined in: runtime/define-leaderboard.ts:90
+Defined in: [runtime/define-leaderboard.ts:90](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L90)
 
 Snapshot-stamped model ids (`name@snapshot`) — the eval identity models.
 
@@ -1448,25 +1448,25 @@ Snapshot-stamped model ids (`name@snapshot`) — the eval identity models.
 
 > **caseIds**: readonly `string`[]
 
-Defined in: runtime/define-leaderboard.ts:91
+Defined in: [runtime/define-leaderboard.ts:91](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L91)
 
 ##### shots
 
 > **shots**: `number`
 
-Defined in: runtime/define-leaderboard.ts:92
+Defined in: [runtime/define-leaderboard.ts:92](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L92)
 
 ##### reps
 
 > **reps**: `number`
 
-Defined in: runtime/define-leaderboard.ts:93
+Defined in: [runtime/define-leaderboard.ts:93](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L93)
 
 ***
 
 ### LeaderboardBenchTask
 
-Defined in: runtime/define-leaderboard.ts:98
+Defined in: [runtime/define-leaderboard.ts:98](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L98)
 
 Structurally `BenchTask` (bench registry shape) — declared locally so this
  module adds no dependency on a benchmark package.
@@ -1477,31 +1477,31 @@ Structurally `BenchTask` (bench registry shape) — declared locally so this
 
 > **id**: `string`
 
-Defined in: runtime/define-leaderboard.ts:99
+Defined in: [runtime/define-leaderboard.ts:99](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L99)
 
 ##### prompt
 
 > **prompt**: `string`
 
-Defined in: runtime/define-leaderboard.ts:100
+Defined in: [runtime/define-leaderboard.ts:100](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L100)
 
 ##### split?
 
 > `optional` **split?**: `string`
 
-Defined in: runtime/define-leaderboard.ts:101
+Defined in: [runtime/define-leaderboard.ts:101](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L101)
 
 ##### metadata?
 
 > `optional` **metadata?**: `Record`\<`string`, `unknown`\>
 
-Defined in: runtime/define-leaderboard.ts:102
+Defined in: [runtime/define-leaderboard.ts:102](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L102)
 
 ***
 
 ### LeaderboardBenchScore
 
-Defined in: runtime/define-leaderboard.ts:106
+Defined in: [runtime/define-leaderboard.ts:106](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L106)
 
 Structurally `BenchScore` (bench registry shape).
 
@@ -1511,25 +1511,25 @@ Structurally `BenchScore` (bench registry shape).
 
 > **resolved**: `boolean`
 
-Defined in: runtime/define-leaderboard.ts:107
+Defined in: [runtime/define-leaderboard.ts:107](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L107)
 
 ##### score
 
 > **score**: `number`
 
-Defined in: runtime/define-leaderboard.ts:108
+Defined in: [runtime/define-leaderboard.ts:108](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L108)
 
 ##### detail?
 
 > `optional` **detail?**: `string`
 
-Defined in: runtime/define-leaderboard.ts:109
+Defined in: [runtime/define-leaderboard.ts:109](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L109)
 
 ***
 
 ### LeaderboardBenchmarkAdapter
 
-Defined in: runtime/define-leaderboard.ts:114
+Defined in: [runtime/define-leaderboard.ts:114](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L114)
 
 Structurally `BenchmarkAdapter` (bench registry shape): `name`,
  `preflight()`, `loadTasks()`, deterministic `judge()`, `goldArtifact()`.
@@ -1540,7 +1540,7 @@ Structurally `BenchmarkAdapter` (bench registry shape): `name`,
 
 > `readonly` **name**: `string`
 
-Defined in: runtime/define-leaderboard.ts:115
+Defined in: [runtime/define-leaderboard.ts:115](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L115)
 
 #### Methods
 
@@ -1548,7 +1548,7 @@ Defined in: runtime/define-leaderboard.ts:115
 
 > **preflight**(): `Promise`\<`void`\>
 
-Defined in: runtime/define-leaderboard.ts:116
+Defined in: [runtime/define-leaderboard.ts:116](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L116)
 
 ###### Returns
 
@@ -1558,7 +1558,7 @@ Defined in: runtime/define-leaderboard.ts:116
 
 > **loadTasks**(`opts?`): `Promise`\<[`LeaderboardBenchTask`](#leaderboardbenchtask)[]\>
 
-Defined in: runtime/define-leaderboard.ts:117
+Defined in: [runtime/define-leaderboard.ts:117](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L117)
 
 ###### Parameters
 
@@ -1584,7 +1584,7 @@ Defined in: runtime/define-leaderboard.ts:117
 
 > **judge**(`task`, `artifact`): `Promise`\<[`LeaderboardBenchScore`](#leaderboardbenchscore)\>
 
-Defined in: runtime/define-leaderboard.ts:122
+Defined in: [runtime/define-leaderboard.ts:122](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L122)
 
 ###### Parameters
 
@@ -1604,7 +1604,7 @@ Defined in: runtime/define-leaderboard.ts:122
 
 > **goldArtifact**(`task`): `Promise`\<`string` \| `undefined`\>
 
-Defined in: runtime/define-leaderboard.ts:123
+Defined in: [runtime/define-leaderboard.ts:123](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L123)
 
 ###### Parameters
 
@@ -1620,7 +1620,7 @@ Defined in: runtime/define-leaderboard.ts:123
 
 ### LeaderboardSpec
 
-Defined in: runtime/define-leaderboard.ts:126
+Defined in: [runtime/define-leaderboard.ts:126](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L126)
 
 #### Type Parameters
 
@@ -1634,7 +1634,7 @@ Defined in: runtime/define-leaderboard.ts:126
 
 > **name**: `string`
 
-Defined in: runtime/define-leaderboard.ts:128
+Defined in: [runtime/define-leaderboard.ts:128](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L128)
 
 Leaderboard name — the scenario `kind`, default profile name, and report title.
 
@@ -1642,7 +1642,7 @@ Leaderboard name — the scenario `kind`, default profile name, and report title
 
 > **cases**: `TCase`[]
 
-Defined in: runtime/define-leaderboard.ts:130
+Defined in: [runtime/define-leaderboard.ts:130](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L130)
 
 The case corpus. Every case needs a stable string id (see `caseId`).
 
@@ -1650,7 +1650,7 @@ The case corpus. Every case needs a stable string id (see `caseId`).
 
 > `optional` **caseId?**: (`c`) => `string`
 
-Defined in: runtime/define-leaderboard.ts:133
+Defined in: [runtime/define-leaderboard.ts:133](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L133)
 
 Stable id extractor. Default: the case's own `id` property (fail-loud
  when absent or not a string).
@@ -1669,7 +1669,7 @@ Stable id extractor. Default: the case's own `id` property (fail-loud
 
 > **prompt**: (`c`) => `string` \| `Promise`\<`string`\>
 
-Defined in: runtime/define-leaderboard.ts:136
+Defined in: [runtime/define-leaderboard.ts:136](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L136)
 
 The per-case task prompt. May be async (e.g. built by shelling out to a
  reference implementation); resolved ONCE per case before dispatch.
@@ -1688,7 +1688,7 @@ The per-case task prompt. May be async (e.g. built by shelling out to a
 
 > **score**: (`output`, `c`) => `number` \| [`LeaderboardScore`](#leaderboardscore)
 
-Defined in: runtime/define-leaderboard.ts:140
+Defined in: [runtime/define-leaderboard.ts:140](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L140)
 
 The domain grader: agent output text → score. Used BOTH as the per-shot
  validator (a shot with `composite > 0` stops the naive retry loop) and,
@@ -1712,7 +1712,7 @@ The domain grader: agent output text → score. Used BOTH as the per-shot
 
 > `optional` **axis?**: `object`
 
-Defined in: runtime/define-leaderboard.ts:144
+Defined in: [runtime/define-leaderboard.ts:144](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L144)
 
 Harness × model axes for `expandProfileAxes`. Defaults: the canonical
  `CODING_HARNESSES` × the base profile's `model.default`. `--harnesses` /
@@ -1730,7 +1730,7 @@ Harness × model axes for `expandProfileAxes`. Defaults: the canonical
 
 > `optional` **baseProfile?**: `AgentProfile`
 
-Defined in: runtime/define-leaderboard.ts:147
+Defined in: [runtime/define-leaderboard.ts:147](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L147)
 
 Base profile the axes expand over (prompt/tools/skills held fixed).
  Default: a minimal `{ name, model: { default: <first model> } }`.
@@ -1739,7 +1739,7 @@ Base profile the axes expand over (prompt/tools/skills held fixed).
 
 > `optional` **backends?**: `Record`\<`string`, (() => [`SandboxClient`](#sandboxclient-3)) \| `undefined`\>
 
-Defined in: runtime/define-leaderboard.ts:157
+Defined in: [runtime/define-leaderboard.ts:157](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L157)
 
 Execution-backend registry: `--backend <name>` picks the factory that
 yields the `SandboxClient` every cell runs on. Merged over the defaults:
@@ -1753,7 +1753,7 @@ yields the `SandboxClient` every cell runs on. Merged over the defaults:
 
 > `optional` **flags?**: `Record`\<`string`, [`LeaderboardFlagSpec`](#leaderboardflagspec)\>
 
-Defined in: runtime/define-leaderboard.ts:159
+Defined in: [runtime/define-leaderboard.ts:159](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L159)
 
 Extra `--flag value` CLI args `run()` parses and surfaces via `ctx.args`.
 
@@ -1761,7 +1761,7 @@ Extra `--flag value` CLI args `run()` parses and surfaces via `ctx.args`.
 
 > `optional` **modelBackend?**: `Record`\<`string`, `unknown`\>
 
-Defined in: runtime/define-leaderboard.ts:163
+Defined in: [runtime/define-leaderboard.ts:163](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L163)
 
 Extra fields merged into each cell's `backend.model` create override —
  e.g. `{ provider: 'openai-compat', apiKey, baseUrl }` for a router-backed
@@ -1771,7 +1771,7 @@ Extra fields merged into each cell's `backend.model` create override —
 
 > `optional` **setup?**: (`ctx`) => `void` \| `Promise`\<`void`\>
 
-Defined in: runtime/define-leaderboard.ts:165
+Defined in: [runtime/define-leaderboard.ts:165](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L165)
 
 Runs once before the matrix (fetch fixtures, warm caches).
 
@@ -1789,7 +1789,7 @@ Runs once before the matrix (fetch fixtures, warm caches).
 
 > `optional` **teardown?**: (`ctx`) => `void` \| `Promise`\<`void`\>
 
-Defined in: runtime/define-leaderboard.ts:167
+Defined in: [runtime/define-leaderboard.ts:167](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L167)
 
 Runs once after the matrix, even on failure (reap boxes, close handles).
 
@@ -1807,7 +1807,7 @@ Runs once after the matrix, even on failure (reap boxes, close handles).
 
 > `optional` **onCellEvents?**: (`events`, `c`) => `void`
 
-Defined in: runtime/define-leaderboard.ts:171
+Defined in: [runtime/define-leaderboard.ts:171](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L171)
 
 Per-cell event tap: the raw sandbox events of each parsed iteration,
  with the case — the seam for domain metric capture (search counts,
@@ -1831,7 +1831,7 @@ readonly `SandboxEvent`[]
 
 > `optional` **parseOutput?**: (`events`, `c`) => `string`
 
-Defined in: runtime/define-leaderboard.ts:175
+Defined in: [runtime/define-leaderboard.ts:175](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L175)
 
 Output decode override: raw events → the scored output text. Default:
  the sandbox SDK's `collectAgentResponseText` (final answer text; empty
@@ -1855,7 +1855,7 @@ readonly `SandboxEvent`[]
 
 > `optional` **export?**: (`result`, `ctx`) => `void` \| `Promise`\<`void`\>
 
-Defined in: runtime/define-leaderboard.ts:178
+Defined in: [runtime/define-leaderboard.ts:178](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L178)
 
 Result export. Default: write `matrix-result.json` under the run dir and
  print (+ write) the ranked leaderboard markdown under the export dir.
@@ -1878,7 +1878,7 @@ Result export. Default: write `matrix-result.json` under the run dir and
 
 > `optional` **dispatch?**: `ProfileDispatchFn`\<[`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>, `string`\>
 
-Defined in: runtime/define-leaderboard.ts:184
+Defined in: [runtime/define-leaderboard.ts:184](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L184)
 
 LEVEL 2 — full dispatch replacement (in-process products bring their own).
  The default is `loopDispatch` + `naiveDriver` over the resolved backend.
@@ -1887,7 +1887,7 @@ LEVEL 2 — full dispatch replacement (in-process products bring their own).
 
 > `optional` **judges?**: `JudgeConfig`\<`string`, [`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>\>[]
 
-Defined in: runtime/define-leaderboard.ts:186
+Defined in: [runtime/define-leaderboard.ts:186](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L186)
 
 LEVEL 2 — full judge replacement. Default: `score` wrapped as one judge.
 
@@ -1895,7 +1895,7 @@ LEVEL 2 — full judge replacement. Default: `score` wrapped as one judge.
 
 > `optional` **shots?**: `number`
 
-Defined in: runtime/define-leaderboard.ts:188
+Defined in: [runtime/define-leaderboard.ts:188](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L188)
 
 Naive-retry shot cap per cell (`--shots`). Default 1.
 
@@ -1903,7 +1903,7 @@ Naive-retry shot cap per cell (`--shots`). Default 1.
 
 > `optional` **reps?**: `number`
 
-Defined in: runtime/define-leaderboard.ts:190
+Defined in: [runtime/define-leaderboard.ts:190](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L190)
 
 Replicates per cell (`--reps`). Default 1.
 
@@ -1911,7 +1911,7 @@ Replicates per cell (`--reps`). Default 1.
 
 > `optional` **matrix?**: `Partial`\<`RunProfileMatrixOptions`\<[`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>, `string`\>\>
 
-Defined in: runtime/define-leaderboard.ts:194
+Defined in: [runtime/define-leaderboard.ts:194](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L194)
 
 Passthrough overrides spread onto the final `runProfileMatrix` call
  (e.g. `maxConcurrency`, `costCeiling`, `integrity`, `storage`) — spread
@@ -1921,7 +1921,7 @@ Passthrough overrides spread onto the final `runProfileMatrix` call
 
 ### DefinedLeaderboard
 
-Defined in: runtime/define-leaderboard.ts:197
+Defined in: [runtime/define-leaderboard.ts:197](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L197)
 
 #### Type Parameters
 
@@ -1935,7 +1935,7 @@ Defined in: runtime/define-leaderboard.ts:197
 
 > **run**(`argv?`): `Promise`\<`RunProfileMatrixResult`\<`string`, [`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>\>\>
 
-Defined in: runtime/define-leaderboard.ts:211
+Defined in: [runtime/define-leaderboard.ts:211](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L211)
 
 Parse flags, run the matrix, export, and return the raw result.
 
@@ -1963,7 +1963,7 @@ only an explicit `--run-dir` opts into that resume behavior.
 
 > **toBenchmarkAdapter**(): [`LeaderboardBenchmarkAdapter`](#leaderboardbenchmarkadapter)
 
-Defined in: runtime/define-leaderboard.ts:213
+Defined in: [runtime/define-leaderboard.ts:213](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L213)
 
 The same domain surface in the structural `BenchmarkAdapter` shape.
 
@@ -15319,7 +15319,7 @@ passes. Ground truth — the driver ends directly, no validation. The check read
 
 > **defineLeaderboard**\<`TCase`\>(`spec`): [`DefinedLeaderboard`](#definedleaderboard)\<`TCase`\>
 
-Defined in: runtime/define-leaderboard.ts:255
+Defined in: [runtime/define-leaderboard.ts:255](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L255)
 
 #### Type Parameters