From 0d1936d659134df326a4ab0d8c8070e8c0c3c159 Mon Sep 17 00:00:00 2001 From: Drew Stone Date: Thu, 2 Jul 2026 22:57:50 -0600 Subject: [PATCH 1/2] =?UTF-8?q?feat(runtime):=20defineLeaderboard=20?= =?UTF-8?q?=E2=80=94=20declarative=20eval=20leaderboard=20facade=20over=20?= =?UTF-8?q?runProfileMatrix?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A product's harness×model leaderboard reduces to cases + prompt + score: defineLeaderboard assembles expandProfileAxes × loopDispatch/naiveDriver × the backend resolvers into ONE runProfileMatrix call, with toBenchmarkAdapter() exposing the same domain surface in the structural BenchmarkAdapter shape. No new execution/judging/metering code — every default is overridable (backends, flags, parseOutput, onCellEvents, export, dispatch, judges, matrix passthrough) and runProfileMatrix stays public as the escape floor. The default run dir is FRESH per invocation (timestamp+pid under tmpdir): runProfileMatrix caches cells by run dir, and a stable default silently resumes a prior failed zero-token cell — only an explicit --run-dir opts in. Requires agent-eval >=0.101.0 (expandProfileAxes / harnessAxisOf / CODING_HARNESSES); peer floor raised accordingly. 10 offline tests via inProcessSandboxClient. Minor bump to 0.84.0. --- docs/api/primitive-catalog.md | 88 ++-- docs/api/runtime.md | 689 +++++++++++++++++++++++++ docs/canonical-api.md | 2 +- package.json | 6 +- pnpm-lock.yaml | 53 +- src/runtime/define-leaderboard.test.ts | 182 +++++++ src/runtime/define-leaderboard.ts | 551 ++++++++++++++++++++ src/runtime/index.ts | 15 + 8 files changed, 1503 insertions(+), 83 deletions(-) create mode 100644 src/runtime/define-leaderboard.test.ts create mode 100644 src/runtime/define-leaderboard.ts diff --git a/docs/api/primitive-catalog.md b/docs/api/primitive-catalog.md index 8320698..01ed065 100644 --- a/docs/api/primitive-catalog.md +++ b/docs/api/primitive-catalog.md @@ -7,7 +7,7 @@ # Primitive catalog — the never-stale anti-reinvention inventory -> **GENERATED** from `@tangle-network/agent-runtime@0.83.0` and `@tangle-network/agent-eval@0.100.0` by `scripts/gen-primitive-catalog.mjs`. Do NOT hand-edit — run `pnpm run docs:api`. This is the mechanical companion to the JUDGMENT in `canonical-api.md` (§2 decision table + §1.5 AgentProfile law): that doc says WHICH primitive to reach for and what NOT to build; this catalog proves WHAT exists. Per-symbol signatures + `file:line` live in the per-module pages under `docs/api/`. +> **GENERATED** from `@tangle-network/agent-runtime@0.84.0` and `@tangle-network/agent-eval@0.103.1` by `scripts/gen-primitive-catalog.mjs`. Do NOT hand-edit — run `pnpm run docs:api`. This is the mechanical companion to the JUDGMENT in `canonical-api.md` (§2 decision table + §1.5 AgentProfile law): that doc says WHICH primitive to reach for and what NOT to build; this catalog proves WHAT exists. Per-symbol signatures + `file:line` live in the per-module pages under `docs/api/`. ## 1. agent-runtime — own public surface @@ -246,7 +246,7 @@ Import from `@tangle-network/agent-runtime/intelligence` — 63 exports. ### Recursive atom + loop kernel (alias of ./runtime) -Import from `@tangle-network/agent-runtime/loops` — 409 exports. +Import from `@tangle-network/agent-runtime/loops` — 419 exports. | Symbol | Kind | Summary | |---|---|---| @@ -285,6 +285,7 @@ Import from `@tangle-network/agent-runtime/loops` — 409 exports. | `decodeToolPart` | function | Decode a part with a specific harness's adapter when known, else try every registered adapter | | `defaultSelectWinner` | function | The kernel's winner argmax — best-valid-score, ties broken by earliest index, | | `defaultToolDetectors` | function | The default online panel for a tool-call pipe: a worker repeating the same call, or hammering | +| `defineLeaderboard` | function | _(no summary — add a TSDoc line at the declaration)_ | | `definePersona` | function | Build a frozen `Persona`. Fails loud on the executors-supplied invariant: a persona with | | `defineStrategy` | function | Author a Strategy from the composable steps — the open, compact way. | | `delegate` | function | Delegate an INTENT to a default authoring supervisor and return its `SupervisedResult` unchanged. | @@ -438,7 +439,14 @@ Import from `@tangle-network/agent-runtime/loops` — 409 exports. | `InMemoryRunContextOptions` | interface | Options for the in-memory run context. | | `InProcessPromptCtx` | interface | Context handed to each `onPrompt` call. | | `Interval` | interface | A 95%-by-default confidence interval. | +| `LeaderboardBenchmarkAdapter` | interface | Structurally `BenchmarkAdapter` (bench registry shape): `name`, | +| `LeaderboardBenchScore` | interface | Structurally `BenchScore` (bench registry shape). | +| `LeaderboardBenchTask` | interface | Structurally `BenchTask` (bench registry shape) — declared locally so this | +| `LeaderboardFlagSpec` | interface | One extra CLI flag a spec declares. Parsed by `run()` as `-- ` | | `LeaderboardRow` | interface | One leaderboard row — a harness×model profile, every measured column. | +| `LeaderboardRunContext` | interface | Resolved run configuration handed to `setup` / `teardown` / `export`. | +| `LeaderboardScenario` | interface | The campaign scenario a case is wrapped into: the case rides along so | +| `LeaderboardScore` | interface | Structured per-case verdict a `score` function may return (a bare number is | | `LoopCampaignDispatchOptions` | interface | Options for adapting plain agent-eval campaign scenarios into runtime `runLoop` cells. | | `LoopIterationDispatchPayload` | interface | Where the iteration's worker was placed. `sibling` = a fresh sandbox the | | `LoopLineageOptions` | interface | Opt-in box-lineage controls for `runLoop`. Default OFF — with both flags | @@ -558,7 +566,7 @@ Import from `@tangle-network/agent-runtime/loops` — 409 exports. | `WinnerStrategy` | type | Built-in valid-only winner strategies for `selectValidWinner` (selector≠judge): best gated-valid | | `WorktreePatchArtifact` | type | Terminal artifact of one worktree-CLI run — the canonical worktree-harness result (the captured | -**Undocumented supporting types** (add a TSDoc line at the declaration to earn a table row): `AgentEnvironment`, `AgentEnvironmentCapabilities`, `AgentEnvironmentEvent`, `AgentEnvironmentProvider`, `AgentEnvironmentQuery`, `AgentEnvironmentSummary`, `AgenticOptions`, `AgenticRunResult`, `AgenticTool`, `AgentSession`, `AgentSessionRef`, `AgentTurnInput`, `AgentTurnResult`, `AnalystRegistry`, `AnytimeReport`, `AnytimeStrategySummary`, `ArtifactHandle`, `AuditIntentOptions`, `AuthoredHarness`, `AuthoredStrategy`, `AuthorStrategyOptions`, `BenchmarkConfig`, `BenchmarkLift`, `BenchmarkStrategySummary`, `BenchmarkTaskRow`, `BudgetPool`, `BusStats`, `ChampionPick`, `CheckpointRef`, `CheckpointRequest`, `CreateAgentEnvironmentInput`, `Driver`, `EventBus`, `EvolutionArchiveNode`, `EvolutionBandInfo`, `EvolutionCandidate`, `EvolutionGeneration`, `EvolutionReport`, `ExecRequest`, `ExecResult`, `ForkRequest`, `GitWorkspaceOptions`, `HarvestFailure`, `HarvestReport`, `Inbox`, `InProcessSandboxClientOptions`, `IntentAudit`, `Iteration`, `Leaderboard`, `LeaderboardOptions`, `LoopDecisionPayload`, `LoopDispatchOptions`, `LoopEndedPayload`, `LoopIterationEndedPayload`, `LoopIterationStartedPayload`, `LoopPlanDescription`, `LoopResult`, `LoopSandboxPlacement`, `LoopStartedPayload`, `LoopTraceEmitter`, `LoopWinner`, `McpEnvironmentOptions`, `Observation`, `ObserveOptions`, `OpenSandboxRunOptions`, `PairwiseOptions`, `PatchDeliverableOptions`, `PlacementInfo`, `PromotionGateOptions`, `PromotionVerdict`, `PublishOptions`, `ResourceRequest`, `RouterChatResult`, `RouterChatToolsResult`, `RouterToolLoopResult`, `RunAgenticOptions`, `SandboxRun`, `ShotSpec`, `StrategyEvolutionConfig`, `StrategyResult`, `SuperviseOptions`, `SuperviseSurfaceOptions`, `SupervisorAgentDeps`, `SupervisorOpts`, `SurfaceScore`, `ToolSpec`, `TraceSource`, `ValidationCtx`, `Validator`, `WaterfallCollector`, `WaterfallReport`, `Workspace`, `WorkspaceRequest`, `WorkspaceRun`, `WorktreeCliExecutorOptions`, `WorktreeFanoutOptions`, `AgentEnvironmentStatus`, `AgentSessionStatus`, `ChampionPolicy`, `LoopTraceEvent`, `MakeWorkerAgent`, `WorkspaceCommit`. +**Undocumented supporting types** (add a TSDoc line at the declaration to earn a table row): `AgentEnvironment`, `AgentEnvironmentCapabilities`, `AgentEnvironmentEvent`, `AgentEnvironmentProvider`, `AgentEnvironmentQuery`, `AgentEnvironmentSummary`, `AgenticOptions`, `AgenticRunResult`, `AgenticTool`, `AgentSession`, `AgentSessionRef`, `AgentTurnInput`, `AgentTurnResult`, `AnalystRegistry`, `AnytimeReport`, `AnytimeStrategySummary`, `ArtifactHandle`, `AuditIntentOptions`, `AuthoredHarness`, `AuthoredStrategy`, `AuthorStrategyOptions`, `BenchmarkConfig`, `BenchmarkLift`, `BenchmarkStrategySummary`, `BenchmarkTaskRow`, `BudgetPool`, `BusStats`, `ChampionPick`, `CheckpointRef`, `CheckpointRequest`, `CreateAgentEnvironmentInput`, `DefinedLeaderboard`, `Driver`, `EventBus`, `EvolutionArchiveNode`, `EvolutionBandInfo`, `EvolutionCandidate`, `EvolutionGeneration`, `EvolutionReport`, `ExecRequest`, `ExecResult`, `ForkRequest`, `GitWorkspaceOptions`, `HarvestFailure`, `HarvestReport`, `Inbox`, `InProcessSandboxClientOptions`, `IntentAudit`, `Iteration`, `Leaderboard`, `LeaderboardOptions`, `LeaderboardSpec`, `LoopDecisionPayload`, `LoopDispatchOptions`, `LoopEndedPayload`, `LoopIterationEndedPayload`, `LoopIterationStartedPayload`, `LoopPlanDescription`, `LoopResult`, `LoopSandboxPlacement`, `LoopStartedPayload`, `LoopTraceEmitter`, `LoopWinner`, `McpEnvironmentOptions`, `Observation`, `ObserveOptions`, `OpenSandboxRunOptions`, `PairwiseOptions`, `PatchDeliverableOptions`, `PlacementInfo`, `PromotionGateOptions`, `PromotionVerdict`, `PublishOptions`, `ResourceRequest`, `RouterChatResult`, `RouterChatToolsResult`, `RouterToolLoopResult`, `RunAgenticOptions`, `SandboxRun`, `ShotSpec`, `StrategyEvolutionConfig`, `StrategyResult`, `SuperviseOptions`, `SuperviseSurfaceOptions`, `SupervisorAgentDeps`, `SupervisorOpts`, `SurfaceScore`, `ToolSpec`, `TraceSource`, `ValidationCtx`, `Validator`, `WaterfallCollector`, `WaterfallReport`, `Workspace`, `WorkspaceRequest`, `WorkspaceRun`, `WorktreeCliExecutorOptions`, `WorktreeFanoutOptions`, `AgentEnvironmentStatus`, `AgentSessionStatus`, `ChampionPolicy`, `LoopTraceEvent`, `MakeWorkerAgent`, `WorkspaceCommit`. ### Environment provider adapters — generic sandbox/compute bridge @@ -824,8 +832,8 @@ Import from `@tangle-network/agent-eval` — 30 exports. |---|---|---| | `buildAgreementJudge` | function | Build a `JudgeConfig` that scores a produced student artifact against the | | `cachedJudge` | function | Wrap a `JudgeConfig` so repeat judgments of the same artifact are served | -| `calibrateJudge` | function | _(no summary — add a TSDoc line at the declaration)_ | -| `compilerJudge` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `calibrateJudge` | function | Measure judge quality against human gold labels: computes Cohen's κ, Pearson correlation, and MAE over matched item ids. | +| `compilerJudge` | function | Build a `SandboxJudgeSpec` that scores whether the harness compiles without errors. | | `contractJudge` | function | Adapt trace contracts to a campaign `JudgeConfig`. One judge dimension per | | `createAntiSlopJudge` | function | Create a reusable Judge function from an anti-slop config. | | `createCustomJudge` | function | Create a custom judge with a fully custom prompt. | @@ -834,16 +842,16 @@ Import from `@tangle-network/agent-eval` — 30 exports. | `createSemanticConceptJudge` | function | Factory: pin LLM options once, return a closure that accepts inputs. | | `ensembleJudge` | function | Build a campaign-shaped `JudgeConfig` whose `score()` runs every panel | | `judgeFamily` | function | Classify a model id into its provider family. Strips a `@snapshot` suffix | -| `judgeReplayGate` | function | _(no summary — add a TSDoc line at the declaration)_ | -| `judgeSpans` | function | _(no summary — add a TSDoc line at the declaration)_ | -| `linterJudge` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `judgeReplayGate` | function | Confirm a candidate's win with a stronger judge: score baseline and candidate outputs independently, then bootstrap a CI to verify the lift generalises beyond the inner loop. | +| `judgeSpans` | function | Query judge-kind spans from the trace store, optionally scoped to a single run. | +| `linterJudge` | function | Build a `SandboxJudgeSpec` that scores the harness by linter rule violations. | | `llmJudge` | function | Build a campaign-shaped `JudgeConfig` whose `score()` makes ONE LLM call | | `replayTraceThroughJudge` | function | Apply a judge function to every LLM span in a run and record the | | `runIntentMatchJudge` | function | Run the intent-match judge. Soft-fails to available=false on error. | | `runKeywordCoverageJudge` | function | Score expected concepts against an already-fetched HTML payload + any | | `runSemanticConceptJudge` | function | Run the semantic concept judge. Soft-fails to available=false on | -| `securityJudge` | function | _(no summary — add a TSDoc line at the declaration)_ | -| `testJudge` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `securityJudge` | function | Build a `SandboxJudgeSpec` that scores the harness output for security issues via a security scanner. | +| `testJudge` | function | Build a `SandboxJudgeSpec` that scores the harness by its test-suite pass rate. | | `traceJudge` | function | Wrap a single JudgeFn so its LLM call emits a traced span. | | `adversarialJudge` | const | Adversarial judge — red-teams agent responses. | | `codeExecutionJudge` | const | Code execution judge — evaluates whether code blocks are valid and runnable. | @@ -875,11 +883,11 @@ Import from `@tangle-network/agent-eval` — 10 exports. | Symbol | Kind | Summary | |---|---|---| | `gradeSemanticStatus` | function | Grade a semantic-concept-style judge result into a single layer status. | -| `verifyAgentProfileCell` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `verifyAgentProfileCell` | function | Verify an `AgentProfileCell`'s `cellId` matches the sha256 of its hash-material fields, confirming the record has not been tampered with. | | `verifyAttestation` | function | Verify a report against its attestation. Returns a typed outcome rather | | `verifyCompletion` | function | Verify whether a run completed the task. `checkCorrectness` is injected — | | `verifyManifest` | function | Verify that a signed manifest has not been tampered with. | -| `MultiLayerVerifier` | class | _(no summary — add a TSDoc line at the declaration)_ | +| `MultiLayerVerifier` | class | Ordered DAG of verification layers with dependency-based skipping, per-layer findings, soft-fail semantics, and a blended composite score across all passed layers. | | `VerificationReport` | interface | Extends the substrate verdict spine: `valid` = `allPass` and `score` = | | `LayerStatus` | type | Multi-layer verifier — ordered pipeline of verification layers. | @@ -930,70 +938,78 @@ Import from `@tangle-network/agent-eval` — 49 exports. ### CAMPAIGN — profile matrix, gates, improvement loop -Import from `@tangle-network/agent-eval/campaign` — 206 exports. +Import from `@tangle-network/agent-eval/campaign` — 226 exports. | Symbol | Kind | Summary | |---|---|---| -| `aceProposer` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `aceProposer` | function | Append-only context engineering proposer: grows a skill playbook by appending generation-tagged lessons without merging or overwriting prior entries. | | `applySkillPatch` | function | Apply a SkillOpt patch to a text surface. Ops apply in array order against | | `buildAnalystSurfaceDispatch` | function | Build the `dispatchWithSurface(surface, scenario, ctx)` the improvement loop | | `buildEvidenceVector` | function | The Evidence Bus. For each objective, pair candidate vs baseline by full | | `buildLoopProvenanceRecord` | function | Build the durable provenance record from a completed loop result. | | `campaignBreakdown` | function | Per-candidate evidence a reflective/patch proposer grounds its next proposal | | `campaignMeanComposite` | function | Mean composite across a campaign: per cell, the mean of its judges' | -| `compareProposers` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `compareProposers` | function | Run a head-to-head lift benchmark across surface proposers on a shared holdout, returning per-proposer lift CIs and pairwise "who wins" verdicts. | | `composeGate` | function | Compose gates — all must `ship` for the composite to `ship`. First | | `countSentenceEdits` | function | Sentence-level edit distance — count distinct add/remove ops between | -| `defaultProductionGate` | function | _(no summary — add a TSDoc line at the declaration)_ | -| `defaultRenderDiff` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `defaultProductionGate` | function | Opinionated production gate composing held-out significance, red-team, reward-hacking, and canary checks into a single `Gate.decide` decision. | +| `defaultRenderDiff` | function | Default surface diff renderer: produces a unified baseline/winner text diff for prompt surfaces or a worktree-ref summary for code surfaces. | | `detectScale` | function | Detect the native scale of a set of scores: 0-100 when any magnitude clears | | `dimensionRegressions` | function | Per-critical-dimension regression guard. For each dimension, pair the | +| `discoverEvalFixtures` | function | _(no summary — add a TSDoc line at the declaration)_ | | `emitLoopProvenance` | function | Build the provenance record + OTel spans and persist them durably under the | -| `evolutionaryProposer` | function | _(no summary — add a TSDoc line at the declaration)_ | -| `extractFapoAttributionSignals` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `evolutionaryProposer` | function | Wrap a stateless `Mutator` (GEPA, AxGEPA, reflective-mutation) as a `SurfaceProposer` that mutates the current best surface into N candidates each generation. | +| `extractFapoAttributionSignals` | function | Scan a findings array and extract FAPO attribution signals — per-level counts and failure clusters used to decide which optimization level to escalate to next. | | `extractH2Sections` | function | Extract H2 headings (`## Foo`) from a markdown surface. Exported for | | `failureModeRecallJudge` | function | Deterministic, ground-truth judge for analyst findings. Composite = | -| `fapoEscalationEntry` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `fapoEscalationEntry` | function | Build a `ProposerEntry` that runs the full FAPO escalation policy (prompt → parameter → structural) as a single comparable optimizer entry. | | `fapoProposer` | function | Build a FAPO policy proposer from level-specific candidate generators. | | `fsCampaignStorage` | function | Node-filesystem storage — the default. Lazily requires `node:fs` so the | | `gepaParetoEntry` | function | GEPA with the Pareto frontier + combine-complementary-lessons. | -| `gepaProposer` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `gepaProposer` | function | GEPA reflective proposer: each generation reflects on the weakest scenarios and dimensions to produce targeted prompt rewrites, optionally combining Pareto-frontier parents. | | `gepaReflectionEntry` | function | GEPA, reflection-only (single-parent, no Pareto combine). | -| `gitWorktreeAdapter` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `gitWorktreeAdapter` | function | Git-backed `WorktreeAdapter`: creates isolated worktrees on fresh branches, commits agent changes, and discards losers. | | `haloProposer` | function | Wrap the real halo-engine CLI as a SurfaceProposer (prompt-tier). | -| `heldOutGate` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `heldOutGate` | function | Composable held-out delta gate: ships only when the candidate's mean composite on `scenarios` beats the baseline by at least `deltaThreshold`. | | `heldoutSignificance` | function | Significance of the held-out composite lift: ship only when the paired | | `inMemoryCampaignStorage` | function | In-memory storage for filesystem-less runtimes. Artifacts + trace spans | | `isProposedCandidate` | function | Type guard: a proposal carrying its rationale vs a bare | | `labelTrustRank` | function | Ordinal rank for a label-trust tier; absent ⇒ `unverified` (rank 0). | | `llmJudge` | function | Build a campaign-shaped `JudgeConfig` whose `score()` makes ONE LLM call | +| `loadEvalFixture` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `loadEvalFixtureScenarios` | function | _(no summary — add a TSDoc line at the declaration)_ | | `loopProvenanceSpans` | function | Build the loop's OTLP-ingestable spans from a provenance record. One root | | `makePlaybackDispatch` | function | Adapt a `PlaybackDriver` into a `runProfileMatrix` dispatch. The artifact the | | `memoryCurationProposer` | function | Build the CURATOR proposer. | -| `openAutoPr` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `openAutoPr` | function | Open a GitHub PR for a gate-approved surface promotion, attaching the manifest hash, gate verdict, and diff as the PR body. | | `pairHoldout` | function | Pair candidate vs baseline holdout observations by FULL cellId. `select` | | `parameterSweepProposer` | function | Config/parameter-level proposer for FAPO's middle escalation level. | | `paretoSignificanceGate` | function | Wrap the bus + a policy as a `Gate`. Plugs into the existing | -| `parseSkillPatchResponse` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `parseSkillPatchResponse` | function | Parse a SkillOpt LLM response into validated `SkillPatch` objects, throwing `SkillPatchParseError` on malformed JSON and silently dropping ops that violate the edit budget. | | `patchEditCount` | function | Total ops in a patch — the edit-budget axis (SkillOpt's "textual learning | +| `planCampaignRun` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `planEvalFixtureRun` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `policyEditProposer` | function | _(no summary — add a TSDoc line at the declaration)_ | | `provenanceRecordPath` | function | Canonical durable paths under the run dir. | -| `provenanceSpansPath` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `provenanceSpansPath` | function | Canonical path for the durable OTLP spans JSONL file under a loop run directory. | | `renderScoreboardMarkdown` | function | Render the scoreboard as a launch-readiness Markdown document — the literal | +| `resolveRunDir` | function | Resolve a campaign `runDir`. An absolute path is honored as-is (the caller | | `resolveWorktreePath` | function | Resolve a `CodeSurface`'s worktreeRef to a directory the measurement can | -| `runCampaign` | function | _(no summary — add a TSDoc line at the declaration)_ | -| `runEval` | function | _(no summary — add a TSDoc line at the declaration)_ | -| `runImprovementLoop` | function | _(no summary — add a TSDoc line at the declaration)_ | -| `runOptimization` | function | _(no summary — add a TSDoc line at the declaration)_ | -| `runProfileMatrix` | function | _(no summary — add a TSDoc line at the declaration)_ | -| `runSkillOpt` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `runCampaign` | function | Core campaign orchestrator: fan scenarios through dispatch, score with judges, aggregate bootstrap CIs, and persist reproducible `CampaignResult` records. | +| `runEval` | function | Simplest evaluation preset: run scenarios through dispatch, score with judges, and return a `CampaignResult` — no optimizer, no gate, no PR. | +| `runImprovementLoop` | function | Gated-promotion shell over `runOptimization`: scores the winner against the baseline on a holdout set, runs the release gate, and optionally opens a PR. | +| `runOptimization` | function | Improvement loop body: N generations of propose → campaign → rank, maintaining a Pareto frontier and promoting the top-scoring candidates to the next generation. | +| `runProfileMatrix` | function | Profile × scenario matrix runner: fan N agent profiles across M scenarios, project each cell to a validated `RunRecord` with real token usage, and enforce the backend-integrity guard before returning. | +| `runSkillOpt` | function | SkillOpt sequential hill-climb: each epoch reflects on train-scenario weaknesses, proposes bounded patches, accepts the first patch that strictly improves the held-out composite, and anneals the edit | | `scoreboardSummary` | function | Roll the per-requirement rows up into the launch headline counts. | | `scoreUserStory` | function | Score one story's produced state against its requirements. Thin wrapper over | | `sequentialDecide` | function | `SurfaceProposer.decide` adapter — stops the optimization loop the moment | | `sequentialPairedGate` | function | Anytime-valid sequential paired gate. Conforms to the existing `Gate` | | `skillOptEntry` | function | SkillOpt patch-mode hill-climb. Runs findings-BLIND: `runSkillOpt` owns its | -| `skillOptProposer` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `skillOptProposer` | function | SkillOpt proposer: proposes bounded, anchored patch operations (add/delete/replace) on a skill document, conforming to both the patch-native `SkillOptProposer` and the generic `SurfaceProposer` interf | | `surfaceContentHash` | function | Stable sha256 (full hex) of a surface's effective text. Code surfaces hash | -| `surfaceHash` | function | _(no summary — add a TSDoc line at the declaration)_ | +| `surfaceHash` | function | Short (16-char) sha256 fingerprint of a `MutableSurface`: hashes text content for prompt surfaces, or the worktree + base ref pair for code surfaces. | +| `tangleTracesRoot` | function | The shared, out-of-repo root for campaign/benchmark run bundles. Keeping run | | `traceAnalystProposer` | function | Wrap agent-eval's trace-analyst registry as a SurfaceProposer (prompt-tier). | | `userStoryScoreboard` | function | Flatten story verdicts into the per-requirement scoreboard — the literal | | `paretoPolicy` | const | The default strategy: symmetric multi-objective Pareto significance. Ship iff | @@ -1019,7 +1035,6 @@ Import from `@tangle-network/agent-eval/campaign` — 206 exports. | `GenerationCandidate` | interface | One scored candidate surface in a generation. `dimensions` + `scenarios` | | `GepaProposerConstraints` | interface | `gepaProposer` — a reflective `SurfaceProposer` for prompt-tier surfaces. | | `HaloProposerOptions` | interface | `haloProposer` — wraps the REAL halo-engine (Inference.net's hierarchical | -| `HeldOutGateOptions` | interface | Thin Gate adapter — exposes delta-threshold-on-holdout as a composable | | `JudgeConfig` | interface | Pluggable dimensional scorer. `score` is the contract: | | `JudgeScore` | interface | The canonical judge verdict shape — one declaration, shared by campaign | | `LabeledScenarioWrite` | interface | Required-provenance write. The store rejects writes that | @@ -1033,6 +1048,7 @@ Import from `@tangle-network/agent-eval/campaign` — 206 exports. | `PlaybackContext` | interface | Dispatch context plus the profile under test (which cheap model, etc.). | | `PlaybackDriver` | interface | Drives the real product through a story and returns the runtime event stream | | `PlaybackStep` | interface | One step of a user story — what the user does. The driver interprets | +| `PolicyEditProposerOptions` | interface | `policyEditProposer` turns typed analyst policy edits into measured candidate | | `ProposeContext` | interface | Everything a proposer may read to plan the next | | `ProposedCandidate` | interface | A proposer output carrying the surface AND the WHY behind | | `ProposerEntry` | interface | What an optimizer produced: the surface it promoted + what it cost to get | @@ -1068,7 +1084,7 @@ Import from `@tangle-network/agent-eval/campaign` — 206 exports. | `SequentialDecision` | type | Anytime-valid sequential promotion gate — an e-process (betting | | `SkillPatchOp` | type | A single bounded edit against a skill surface. | -**Undocumented supporting types** (add a TSDoc line at the declaration to earn a table row): `AcceptedEdit`, `ApplySkillPatchResult`, `AxisEvidence`, `BuildAnalystSurfaceDispatchOptions`, `BuildEvidenceVectorOptions`, `BuildLoopProvenanceArgs`, `CampaignAggregates`, `CampaignBreakdown`, `CampaignCellResult`, `CampaignResult`, `CompareProposersOptions`, `DimensionRegression`, `EmitLoopProvenanceArgs`, `EmitLoopProvenanceResult`, `EvidenceVector`, `FailureModeRecallJudgeOptions`, `FapoAttributionSignals`, `FapoFailureCluster`, `FapoProposerOptions`, `FapoReviewInput`, `FapoReviewIssue`, `FapoReviewResult`, `FapoScopeContract`, `GateContext`, `GateResult`, `GenerationRecord`, `GepaProposerOptions`, `GitWorktreeAdapterOptions`, `HeldoutSignificance`, `HeldoutSignificanceOptions`, `JudgeAggregate`, `JudgeDimension`, `LabeledScenarioRecord`, `LabeledScenarioSampleArgs`, `LabeledScenarioStore`, `LlmJudgeOptions`, `LoopProvenanceBackend`, `LoopProvenanceCandidate`, `OpenAutoPrResult`, `OptimizerConfig`, `ParameterCandidate`, `ParameterChange`, `ParameterSweepProposerOptions`, `ParetoSignificanceGateOptions`, `ProfileSummary`, `PromotionObjective`, `ProposePatchesArgs`, `ProposerComparison`, `ProposerPairwise`, `ProposerScore`, `RunImprovementLoopResult`, `RunOptimizationResult`, `RunProfileMatrixOptions`, `RunProfileMatrixResult`, `RunSkillOptResult`, `ScenarioAggregate`, `ScenarioRollup`, `ScoreboardRenderOptions`, `SequentialDecideFn`, `SequentialDecideOptions`, `SequentialObservation`, `SequentialPairedGate`, `SequentialPairedGateOptions`, `SkillOptEpochRecord`, `SkillOptProposer`, `SkillOptProposerOptions`, `SkillPatchRejection`, `TraceSpan`, `WorktreeAdapter`, `JsonPrimitive`, `JsonValue`, `RedactionStatus`, `RunOptimizationOptions`. +**Undocumented supporting types** (add a TSDoc line at the declaration to earn a table row): `AcceptedEdit`, `ApplySkillPatchResult`, `AxisEvidence`, `BuildAnalystSurfaceDispatchOptions`, `BuildEvidenceVectorOptions`, `BuildLoopProvenanceArgs`, `CampaignAggregates`, `CampaignBreakdown`, `CampaignCellResult`, `CampaignResult`, `CampaignRunPlan`, `CampaignRunPlanCell`, `CompareProposersOptions`, `DimensionRegression`, `EmitLoopProvenanceArgs`, `EmitLoopProvenanceResult`, `EvalFixture`, `EvalFixtureFile`, `EvalFixtureLoadOptions`, `EvalFixtureScenario`, `EvidenceVector`, `FailureModeRecallJudgeOptions`, `FapoAttributionSignals`, `FapoFailureCluster`, `FapoProposerOptions`, `FapoReviewInput`, `FapoReviewIssue`, `FapoReviewResult`, `FapoScopeContract`, `GateContext`, `GateResult`, `GenerationRecord`, `GepaProposerOptions`, `GitWorktreeAdapterOptions`, `HeldOutGateOptions`, `HeldoutSignificance`, `HeldoutSignificanceOptions`, `JudgeAggregate`, `JudgeDimension`, `LabeledScenarioRecord`, `LabeledScenarioSampleArgs`, `LabeledScenarioStore`, `LlmJudgeOptions`, `LoadEvalFixtureScenariosOptions`, `LoopProvenanceBackend`, `LoopProvenanceCandidate`, `OpenAutoPrResult`, `OptimizerConfig`, `ParameterCandidate`, `ParameterChange`, `ParameterSweepProposerOptions`, `ParetoSignificanceGateOptions`, `PlanCampaignRunOptions`, `PlanEvalFixtureRunOptions`, `ProfileSummary`, `PromotionObjective`, `ProposePatchesArgs`, `ProposerComparison`, `ProposerPairwise`, `ProposerScore`, `RunImprovementLoopResult`, `RunOptimizationResult`, `RunProfileMatrixOptions`, `RunProfileMatrixResult`, `RunSkillOptResult`, `ScenarioAggregate`, `ScenarioRollup`, `ScoreboardRenderOptions`, `SequentialDecideFn`, `SequentialDecideOptions`, `SequentialObservation`, `SequentialPairedGate`, `SequentialPairedGateOptions`, `SkillOptEpochRecord`, `SkillOptProposer`, `SkillOptProposerOptions`, `SkillPatchRejection`, `TraceSpan`, `WorktreeAdapter`, `EvalFixtureRunPlan`, `EvalFixtureValidationMode`, `JsonPrimitive`, `JsonValue`, `RedactionStatus`, `RunOptimizationOptions`. ### TOKEN / USAGE — usage extraction + run-record usage types diff --git a/docs/api/runtime.md b/docs/api/runtime.md index ac4fb41..067d1e5 100644 --- a/docs/api/runtime.md +++ b/docs/api/runtime.md @@ -1308,6 +1308,671 @@ Minimum confidence a PROBABILISTIC verdict must clear to end. Default 0.8. *** +### LeaderboardScore + +Defined in: runtime/define-leaderboard.ts:60 + +Structured per-case verdict a `score` function may return (a bare number is + shorthand for `{ composite }`). `composite` is the [0,1] leaderboard score; + `dimensions` are recorded as extra judge dimensions. + +#### Properties + +##### composite + +> **composite**: `number` + +Defined in: runtime/define-leaderboard.ts:61 + +##### dimensions? + +> `optional` **dimensions?**: `Record`\<`string`, `number`\> + +Defined in: runtime/define-leaderboard.ts:62 + +##### notes? + +> `optional` **notes?**: `string` + +Defined in: runtime/define-leaderboard.ts:63 + +*** + +### LeaderboardScenario + +Defined in: runtime/define-leaderboard.ts:68 + +The campaign scenario a case is wrapped into: the case rides along so + judges and hooks can reach the full domain payload, not just its id. + +#### Extends + +- `Scenario` + +#### Type Parameters + +##### TCase + +`TCase` + +#### Properties + +##### case + +> **case**: `TCase` + +Defined in: runtime/define-leaderboard.ts:69 + +*** + +### LeaderboardFlagSpec + +Defined in: runtime/define-leaderboard.ts:74 + +One extra CLI flag a spec declares. Parsed by `run()` as `-- ` + and surfaced to every hook via `ctx.args`. + +#### Properties + +##### default? + +> `optional` **default?**: `string` + +Defined in: runtime/define-leaderboard.ts:75 + +##### description + +> **description**: `string` + +Defined in: runtime/define-leaderboard.ts:76 + +*** + +### LeaderboardRunContext + +Defined in: runtime/define-leaderboard.ts:80 + +Resolved run configuration handed to `setup` / `teardown` / `export`. + +#### Properties + +##### name + +> **name**: `string` + +Defined in: runtime/define-leaderboard.ts:81 + +##### backend + +> **backend**: `string` + +Defined in: runtime/define-leaderboard.ts:83 + +Execution backend name (`--backend`), a key of `backends`. + +##### runDir + +> **runDir**: `string` + +Defined in: runtime/define-leaderboard.ts:84 + +##### exportDir + +> **exportDir**: `string` + +Defined in: runtime/define-leaderboard.ts:85 + +##### args + +> **args**: `Record`\<`string`, `string` \| `undefined`\> + +Defined in: runtime/define-leaderboard.ts:87 + +Every parsed flag (standard + `spec.flags`), by name without `--`. + +##### harnesses + +> **harnesses**: readonly `HarnessType`[] + +Defined in: runtime/define-leaderboard.ts:88 + +##### models + +> **models**: readonly `string`[] + +Defined in: runtime/define-leaderboard.ts:90 + +Snapshot-stamped model ids (`name@snapshot`) — the eval identity models. + +##### caseIds + +> **caseIds**: readonly `string`[] + +Defined in: runtime/define-leaderboard.ts:91 + +##### shots + +> **shots**: `number` + +Defined in: runtime/define-leaderboard.ts:92 + +##### reps + +> **reps**: `number` + +Defined in: runtime/define-leaderboard.ts:93 + +*** + +### LeaderboardBenchTask + +Defined in: runtime/define-leaderboard.ts:98 + +Structurally `BenchTask` (bench registry shape) — declared locally so this + module adds no dependency on a benchmark package. + +#### Properties + +##### id + +> **id**: `string` + +Defined in: runtime/define-leaderboard.ts:99 + +##### prompt + +> **prompt**: `string` + +Defined in: runtime/define-leaderboard.ts:100 + +##### split? + +> `optional` **split?**: `string` + +Defined in: runtime/define-leaderboard.ts:101 + +##### metadata? + +> `optional` **metadata?**: `Record`\<`string`, `unknown`\> + +Defined in: runtime/define-leaderboard.ts:102 + +*** + +### LeaderboardBenchScore + +Defined in: runtime/define-leaderboard.ts:106 + +Structurally `BenchScore` (bench registry shape). + +#### Properties + +##### resolved + +> **resolved**: `boolean` + +Defined in: runtime/define-leaderboard.ts:107 + +##### score + +> **score**: `number` + +Defined in: runtime/define-leaderboard.ts:108 + +##### detail? + +> `optional` **detail?**: `string` + +Defined in: runtime/define-leaderboard.ts:109 + +*** + +### LeaderboardBenchmarkAdapter + +Defined in: runtime/define-leaderboard.ts:114 + +Structurally `BenchmarkAdapter` (bench registry shape): `name`, + `preflight()`, `loadTasks()`, deterministic `judge()`, `goldArtifact()`. + +#### Properties + +##### name + +> `readonly` **name**: `string` + +Defined in: runtime/define-leaderboard.ts:115 + +#### Methods + +##### preflight() + +> **preflight**(): `Promise`\<`void`\> + +Defined in: runtime/define-leaderboard.ts:116 + +###### Returns + +`Promise`\<`void`\> + +##### loadTasks() + +> **loadTasks**(`opts?`): `Promise`\<[`LeaderboardBenchTask`](#leaderboardbenchtask)[]\> + +Defined in: runtime/define-leaderboard.ts:117 + +###### Parameters + +###### opts? + +###### limit? + +`number` + +###### split? + +`string` + +###### ids? + +`string`[] + +###### Returns + +`Promise`\<[`LeaderboardBenchTask`](#leaderboardbenchtask)[]\> + +##### judge() + +> **judge**(`task`, `artifact`): `Promise`\<[`LeaderboardBenchScore`](#leaderboardbenchscore)\> + +Defined in: runtime/define-leaderboard.ts:122 + +###### Parameters + +###### task + +[`LeaderboardBenchTask`](#leaderboardbenchtask) + +###### artifact + +`string` + +###### Returns + +`Promise`\<[`LeaderboardBenchScore`](#leaderboardbenchscore)\> + +##### goldArtifact() + +> **goldArtifact**(`task`): `Promise`\<`string` \| `undefined`\> + +Defined in: runtime/define-leaderboard.ts:123 + +###### Parameters + +###### task + +[`LeaderboardBenchTask`](#leaderboardbenchtask) + +###### Returns + +`Promise`\<`string` \| `undefined`\> + +*** + +### LeaderboardSpec + +Defined in: runtime/define-leaderboard.ts:126 + +#### Type Parameters + +##### TCase + +`TCase` + +#### Properties + +##### name + +> **name**: `string` + +Defined in: runtime/define-leaderboard.ts:128 + +Leaderboard name — the scenario `kind`, default profile name, and report title. + +##### cases + +> **cases**: `TCase`[] + +Defined in: runtime/define-leaderboard.ts:130 + +The case corpus. Every case needs a stable string id (see `caseId`). + +##### caseId? + +> `optional` **caseId?**: (`c`) => `string` + +Defined in: runtime/define-leaderboard.ts:133 + +Stable id extractor. Default: the case's own `id` property (fail-loud + when absent or not a string). + +###### Parameters + +###### c + +`TCase` + +###### Returns + +`string` + +##### prompt + +> **prompt**: (`c`) => `string` \| `Promise`\<`string`\> + +Defined in: runtime/define-leaderboard.ts:136 + +The per-case task prompt. May be async (e.g. built by shelling out to a + reference implementation); resolved ONCE per case before dispatch. + +###### Parameters + +###### c + +`TCase` + +###### Returns + +`string` \| `Promise`\<`string`\> + +##### score + +> **score**: (`output`, `c`) => `number` \| [`LeaderboardScore`](#leaderboardscore) + +Defined in: runtime/define-leaderboard.ts:140 + +The domain grader: agent output text → score. Used BOTH as the per-shot + validator (a shot with `composite > 0` stops the naive retry loop) and, + wrapped as a campaign judge, as the recorded leaderboard score. + +###### Parameters + +###### output + +`string` + +###### c + +`TCase` + +###### Returns + +`number` \| [`LeaderboardScore`](#leaderboardscore) + +##### axis? + +> `optional` **axis?**: `object` + +Defined in: runtime/define-leaderboard.ts:144 + +Harness × model axes for `expandProfileAxes`. Defaults: the canonical + `CODING_HARNESSES` × the base profile's `model.default`. `--harnesses` / + `--models` override per run. + +###### harnesses? + +> `optional` **harnesses?**: readonly `HarnessType`[] + +###### models? + +> `optional` **models?**: readonly `string`[] + +##### baseProfile? + +> `optional` **baseProfile?**: `AgentProfile` + +Defined in: runtime/define-leaderboard.ts:147 + +Base profile the axes expand over (prompt/tools/skills held fixed). + Default: a minimal `{ name, model: { default: } }`. + +##### backends? + +> `optional` **backends?**: `Record`\<`string`, (() => [`SandboxClient`](#sandboxclient-3)) \| `undefined`\> + +Defined in: runtime/define-leaderboard.ts:157 + +Execution-backend registry: `--backend ` picks the factory that +yields the `SandboxClient` every cell runs on. Merged over the defaults: + - `sandbox` — throws with guidance (a product must supply its real + Sandbox-backed client; the facade has no credentials). + - `cli-bridge` — `resolveSandboxClient({ backend: 'bridge' })` reading + `CLI_BRIDGE_URL` + `BRIDGE_BEARER`/`CLI_BRIDGE_BEARER`; the per-cell + harness/model ride in via `sandboxOverrides.backend`. + +##### flags? + +> `optional` **flags?**: `Record`\<`string`, [`LeaderboardFlagSpec`](#leaderboardflagspec)\> + +Defined in: runtime/define-leaderboard.ts:159 + +Extra `--flag value` CLI args `run()` parses and surfaces via `ctx.args`. + +##### modelBackend? + +> `optional` **modelBackend?**: `Record`\<`string`, `unknown`\> + +Defined in: runtime/define-leaderboard.ts:163 + +Extra fields merged into each cell's `backend.model` create override — + e.g. `{ provider: 'openai-compat', apiKey, baseUrl }` for a router-backed + sandbox. The cell's bare model id is set by the facade from the axis. + +##### setup? + +> `optional` **setup?**: (`ctx`) => `void` \| `Promise`\<`void`\> + +Defined in: runtime/define-leaderboard.ts:165 + +Runs once before the matrix (fetch fixtures, warm caches). + +###### Parameters + +###### ctx + +[`LeaderboardRunContext`](#leaderboardruncontext) + +###### Returns + +`void` \| `Promise`\<`void`\> + +##### teardown? + +> `optional` **teardown?**: (`ctx`) => `void` \| `Promise`\<`void`\> + +Defined in: runtime/define-leaderboard.ts:167 + +Runs once after the matrix, even on failure (reap boxes, close handles). + +###### Parameters + +###### ctx + +[`LeaderboardRunContext`](#leaderboardruncontext) + +###### Returns + +`void` \| `Promise`\<`void`\> + +##### onCellEvents? + +> `optional` **onCellEvents?**: (`events`, `c`) => `void` + +Defined in: runtime/define-leaderboard.ts:171 + +Per-cell event tap: the raw sandbox events of each parsed iteration, + with the case — the seam for domain metric capture (search counts, + citations) without a substrate change. + +###### Parameters + +###### events + +readonly `SandboxEvent`[] + +###### c + +`TCase` + +###### Returns + +`void` + +##### parseOutput? + +> `optional` **parseOutput?**: (`events`, `c`) => `string` + +Defined in: runtime/define-leaderboard.ts:175 + +Output decode override: raw events → the scored output text. Default: + the sandbox SDK's `collectAgentResponseText` (final answer text; empty + string when the stream carried none — which then scores 0). + +###### Parameters + +###### events + +readonly `SandboxEvent`[] + +###### c + +`TCase` + +###### Returns + +`string` + +##### export? + +> `optional` **export?**: (`result`, `ctx`) => `void` \| `Promise`\<`void`\> + +Defined in: runtime/define-leaderboard.ts:178 + +Result export. Default: write `matrix-result.json` under the run dir and + print (+ write) the ranked leaderboard markdown under the export dir. + +###### Parameters + +###### result + +`RunProfileMatrixResult`\<`string`, [`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>\> + +###### ctx + +[`LeaderboardRunContext`](#leaderboardruncontext) + +###### Returns + +`void` \| `Promise`\<`void`\> + +##### dispatch? + +> `optional` **dispatch?**: `ProfileDispatchFn`\<[`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>, `string`\> + +Defined in: runtime/define-leaderboard.ts:184 + +LEVEL 2 — full dispatch replacement (in-process products bring their own). + The default is `loopDispatch` + `naiveDriver` over the resolved backend. + +##### judges? + +> `optional` **judges?**: `JudgeConfig`\<`string`, [`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>\>[] + +Defined in: runtime/define-leaderboard.ts:186 + +LEVEL 2 — full judge replacement. Default: `score` wrapped as one judge. + +##### shots? + +> `optional` **shots?**: `number` + +Defined in: runtime/define-leaderboard.ts:188 + +Naive-retry shot cap per cell (`--shots`). Default 1. + +##### reps? + +> `optional` **reps?**: `number` + +Defined in: runtime/define-leaderboard.ts:190 + +Replicates per cell (`--reps`). Default 1. + +##### matrix? + +> `optional` **matrix?**: `Partial`\<`RunProfileMatrixOptions`\<[`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>, `string`\>\> + +Defined in: runtime/define-leaderboard.ts:194 + +Passthrough overrides spread onto the final `runProfileMatrix` call + (e.g. `maxConcurrency`, `costCeiling`, `integrity`, `storage`) — spread + LAST, so anything the facade wired can be overridden. + +*** + +### DefinedLeaderboard + +Defined in: runtime/define-leaderboard.ts:197 + +#### Type Parameters + +##### TCase + +`TCase` + +#### Methods + +##### run() + +> **run**(`argv?`): `Promise`\<`RunProfileMatrixResult`\<`string`, [`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>\>\> + +Defined in: runtime/define-leaderboard.ts:211 + +Parse flags, run the matrix, export, and return the raw result. + +Standard flags: `--backend ` (default `sandbox`), `--harnesses a,b`, +`--models m1,m2`, `--cases id1,id2`, `--shots N`, `--reps N`, +`--model-snapshot `, `--run-dir `, `--export-dir `, +plus every `spec.flags` entry. `argv` defaults to `process.argv.slice(2)`. + +The default run dir is FRESH per invocation (timestamp+pid under the OS +tmpdir). `runProfileMatrix` caches cells by run dir, and a stable default +would silently reuse a prior FAILED zero-token cell and skip dispatch — +only an explicit `--run-dir` opts into that resume behavior. + +###### Parameters + +###### argv? + +`string`[] + +###### Returns + +`Promise`\<`RunProfileMatrixResult`\<`string`, [`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>\>\> + +##### toBenchmarkAdapter() + +> **toBenchmarkAdapter**(): [`LeaderboardBenchmarkAdapter`](#leaderboardbenchmarkadapter) + +Defined in: runtime/define-leaderboard.ts:213 + +The same domain surface in the structural `BenchmarkAdapter` shape. + +###### Returns + +[`LeaderboardBenchmarkAdapter`](#leaderboardbenchmarkadapter) + +*** + ### HarvestCorpusOptions Defined in: [runtime/harvest-corpus.ts:28](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/harvest-corpus.ts#L28) @@ -14650,6 +15315,30 @@ passes. Ground truth — the driver ends directly, no validation. The check read *** +### defineLeaderboard() + +> **defineLeaderboard**\<`TCase`\>(`spec`): [`DefinedLeaderboard`](#definedleaderboard)\<`TCase`\> + +Defined in: runtime/define-leaderboard.ts:255 + +#### Type Parameters + +##### TCase + +`TCase` + +#### Parameters + +##### spec + +[`LeaderboardSpec`](#leaderboardspec)\<`TCase`\> + +#### Returns + +[`DefinedLeaderboard`](#definedleaderboard)\<`TCase`\> + +*** + ### harvestCorpus() > **harvestCorpus**(`opts`): `Promise`\<[`HarvestReport`](#harvestreport)\> diff --git a/docs/canonical-api.md b/docs/canonical-api.md index 260e1bb..9d3c72e 100644 --- a/docs/canonical-api.md +++ b/docs/canonical-api.md @@ -2,7 +2,7 @@ -> **Version 0.83.0.** The export inventory + per-symbol signatures live in the generated `docs/api/` reference: **`docs/api/primitive-catalog.md`** is the never-stale, grouped list of every primitive to reuse (own surface + the agent-eval judge / authenticity / verification / statistics / campaign / token-usage surfaces), with each one's import path and one-line summary read live from source; the per-module pages hold the full signatures. The pinned substrate is agent-eval `>=0.97.0 <1.0.0`; the sandbox substrate that materializes profiles into harness shapes is `@tangle-network/sandbox` (peer `>=0.8.0 <1.0.0`). The neutral contract types (`AgentProfile`, `AgentProfileMcpServer`, `HarnessType`, `ReasoningEffort`, `Part`/`ToolPart`/`ToolState`, plus environment-provider types) are owned by **`@tangle-network/agent-interface`** (peer `>=0.14.0 <1.0.0`) — the single source of truth. Substrate primitives are re-exported through `@tangle-network/agent-eval/contract` (or `/campaign`), not local to this package — the catalog's §2 shows exactly which subpath each lives under. +> **Version 0.84.0.** The export inventory + per-symbol signatures live in the generated `docs/api/` reference: **`docs/api/primitive-catalog.md`** is the never-stale, grouped list of every primitive to reuse (own surface + the agent-eval judge / authenticity / verification / statistics / campaign / token-usage surfaces), with each one's import path and one-line summary read live from source; the per-module pages hold the full signatures. The pinned substrate is agent-eval `>=0.101.0 <1.0.0`; the sandbox substrate that materializes profiles into harness shapes is `@tangle-network/sandbox` (peer `>=0.8.0 <1.0.0`). The neutral contract types (`AgentProfile`, `AgentProfileMcpServer`, `HarnessType`, `ReasoningEffort`, `Part`/`ToolPart`/`ToolState`, plus environment-provider types) are owned by **`@tangle-network/agent-interface`** (peer `>=0.14.0 <1.0.0`) — the single source of truth. Substrate primitives are re-exported through `@tangle-network/agent-eval/contract` (or `/campaign`), not local to this package — the catalog's §2 shows exactly which subpath each lives under. > > **`./loops` is the runtime barrel** — `package.json` maps it to `src/runtime/index.ts`. Everything below labelled `/loops` is the recursive-atom + loop-kernel surface. > diff --git a/package.json b/package.json index 3b07a91..9abcb1d 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "@tangle-network/agent-runtime", - "version": "0.83.0", + "version": "0.84.0", "description": "Shared task-lifecycle skeleton for agents: a recursive loop kernel for chat turns, one-shot tasks, and multi-attempt loops, with trace capture and eval-gated self-improvement. Domain behavior lives in adapters; scoring and ship-gates in @tangle-network/agent-eval.", "homepage": "https://github.com/tangle-network/agent-runtime#readme", "repository": { @@ -94,7 +94,7 @@ }, "devDependencies": { "@biomejs/biome": "^2.4.15", - "@tangle-network/agent-eval": ">=0.100.0 <1.0.0", + "@tangle-network/agent-eval": "^0.103.1", "@tangle-network/agent-interface": ">=0.14.0 <1.0.0", "@tangle-network/sandbox": ">=0.8.0 <1.0.0", "@types/node": "^25.9.3", @@ -123,7 +123,7 @@ "license": "MIT", "packageManager": "pnpm@10.28.0", "peerDependencies": { - "@tangle-network/agent-eval": ">=0.97.0 <1.0.0", + "@tangle-network/agent-eval": ">=0.101.0 <1.0.0", "@tangle-network/agent-interface": ">=0.14.0 <1.0.0", "@tangle-network/sandbox": ">=0.8.0 <1.0.0", "playwright": "^1.40.0" diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index b9f061f..cce5f93 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -12,8 +12,8 @@ importers: specifier: ^2.4.15 version: 2.4.15 '@tangle-network/agent-eval': - specifier: '>=0.100.0 <1.0.0' - version: 0.100.0(typescript@5.9.3) + specifier: ^0.103.1 + version: 0.103.1(typescript@5.9.3) '@tangle-network/agent-interface': specifier: '>=0.14.0 <1.0.0' version: 0.14.0 @@ -636,13 +636,8 @@ packages: '@tangle-network/agent-core@0.3.4': resolution: {integrity: sha512-Hvz3ABRouNtBmRvGqPxifAO2yuILneJMylWH5jW/jeS2F03RvqkGYuXyGXWWLqosYbb3hVAvSEe4Ykm2FMGEDQ==} - '@tangle-network/agent-eval@0.100.0': - resolution: {integrity: sha512-yBupVJJAqHozhe1BL5xBuDObjvNsoY+XmJo7qfpw/w7rehAXbKliBb4k3XS1G55+GaYPjFA+xwPzlEDQISpMRw==} - engines: {node: '>=20'} - hasBin: true - - '@tangle-network/agent-integrations@0.29.0': - resolution: {integrity: sha512-Avn4oBDTRP5v/3o1xq++uu/9+Rhl2hscIggeFPBGjtVYwhvbsSZL9pRrF3LfjqL9rjx9AocZOdsZC6MXrxKnkg==} + '@tangle-network/agent-eval@0.103.1': + resolution: {integrity: sha512-9V37IcaRixSfIUkZ50pgU8a5nSVrkVmq5BimNLwVzbi3USwOkkJ9RcecMScpLUnrYNeaoe5Sac8lS6kzL1uTDQ==} engines: {node: '>=20'} hasBin: true @@ -655,26 +650,6 @@ packages: '@tangle-network/agent-interface@0.14.0': resolution: {integrity: sha512-9CyGhIpl90E7v4MTm3b1ti3Bp7BfPigk2Nafgi21Lg0U+QxlNB656F2JmVpUuSbOo9aGZPtg5nXu5EBTlV5a1g==} - '@tangle-network/sandbox@0.3.0': - resolution: {integrity: sha512-KfgvKhsUaOpkJe3AD19w7s4hdQekBlXQGoNx0xS4u6vuQk5YnFzBgv+EQeHCkkgETpYOWS2AN+6u/JhSyWStMw==} - peerDependencies: - '@mastra/core': '*' - '@modelcontextprotocol/sdk': '*' - ai: '*' - openai: ^6.36.0 - viem: ^2.0.0 - peerDependenciesMeta: - '@mastra/core': - optional: true - '@modelcontextprotocol/sdk': - optional: true - ai: - optional: true - openai: - optional: true - viem: - optional: true - '@tangle-network/sandbox@0.9.5': resolution: {integrity: sha512-yvX2OX6uISBVnMQ+v6Upkesa3u8yj6BHxsfcS6p8Vze+M4WBpyhkwA+onzFHuo9rti557ItZn8yDu4a/klljvQ==} peerDependencies: @@ -699,8 +674,8 @@ packages: resolution: {integrity: sha512-+TAF9s5t1jOWGyGHvKhIWe2FYmG7puVaxmmg0Et67ylAjGa7GqUAvISXGjG/6dzld7A170V0kQHK0WVdh2Wh0Q==} engines: {node: '>=18'} - '@tangle-network/tcloud@0.4.12': - resolution: {integrity: sha512-3Qs90sV0P3LBtrTGC9HW2rwCMDjbScyhZIQU6H2/dVd84S5uKN+tCsURnXE6uu54U766Xa/V3Rcdqqjmgv7AXg==} + '@tangle-network/tcloud@0.4.14': + resolution: {integrity: sha512-jWYt//cGdLBDOv0luLH6xAGS4gbuOt8uHIkaCWwDDpQ1zp0FUPATHIrA3RMuF0qtQq9Vq00IhLrmCnHdHBP+dg==} engines: {node: '>=18'} hasBin: true @@ -1629,13 +1604,13 @@ snapshots: '@tangle-network/agent-interface': 0.14.0 zod: 4.4.3 - '@tangle-network/agent-eval@0.100.0(typescript@5.9.3)': + '@tangle-network/agent-eval@0.103.1(typescript@5.9.3)': dependencies: '@asteasolutions/zod-to-openapi': 8.5.0(zod@4.4.3) '@ax-llm/ax': 19.0.45(zod@4.4.3) '@hono/node-server': 2.0.4(hono@4.12.25) '@tangle-network/agent-interface': 0.10.0 - '@tangle-network/tcloud': 0.4.12(typescript@5.9.3)(zod@4.4.3) + '@tangle-network/tcloud': 0.4.14(typescript@5.9.3)(zod@4.4.3) hono: 4.12.25 zod: 4.4.3 transitivePeerDependencies: @@ -1647,8 +1622,6 @@ snapshots: - typescript - utf-8-validate - '@tangle-network/agent-integrations@0.29.0': {} - '@tangle-network/agent-interface@0.10.0': dependencies: zod: 4.4.3 @@ -1661,12 +1634,6 @@ snapshots: dependencies: zod: 4.4.3 - '@tangle-network/sandbox@0.3.0(viem@2.52.2(typescript@5.9.3)(zod@4.4.3))': - dependencies: - '@tangle-network/agent-integrations': 0.29.0 - optionalDependencies: - viem: 2.52.2(typescript@5.9.3)(zod@4.4.3) - '@tangle-network/sandbox@0.9.5(viem@2.52.2(typescript@5.9.3)(zod@4.4.3))': dependencies: '@tangle-network/agent-core': 0.3.4 @@ -1676,11 +1643,11 @@ snapshots: '@tangle-network/tcloud-attestation@0.1.1': {} - '@tangle-network/tcloud@0.4.12(typescript@5.9.3)(zod@4.4.3)': + '@tangle-network/tcloud@0.4.14(typescript@5.9.3)(zod@4.4.3)': dependencies: '@scure/bip32': 2.2.0 '@scure/bip39': 2.2.0 - '@tangle-network/sandbox': 0.3.0(viem@2.52.2(typescript@5.9.3)(zod@4.4.3)) + '@tangle-network/sandbox': 0.9.5(viem@2.52.2(typescript@5.9.3)(zod@4.4.3)) '@tangle-network/tcloud-attestation': 0.1.1 commander: 14.0.3 viem: 2.52.2(typescript@5.9.3)(zod@4.4.3) diff --git a/src/runtime/define-leaderboard.test.ts b/src/runtime/define-leaderboard.test.ts new file mode 100644 index 0000000..9f7a17b --- /dev/null +++ b/src/runtime/define-leaderboard.test.ts @@ -0,0 +1,182 @@ +import { mkdtempSync } from 'node:fs' +import { tmpdir } from 'node:os' +import { join } from 'node:path' +import type { SandboxEvent } from '@tangle-network/sandbox' +import { describe, expect, it } from 'vitest' +import { defineLeaderboard, type LeaderboardRunContext } from './define-leaderboard' +import { inProcessSandboxClient } from './in-process-sandbox-client' + +interface FakeCase { + id: string + answer: string +} + +const CASES: FakeCase[] = [ + { id: 'case-alpha', answer: 'ALPHA-42' }, + { id: 'case-beta', answer: 'BETA-7' }, +] + +/** Offline backend: echoes the prompt's embedded answer + meters an llm_call, + * so the matrix integrity guard sees a real (non-stub) backend. */ +function fakeBackend() { + return inProcessSandboxClient({ + onPrompt: (prompt): SandboxEvent[] => { + const answer = /answer=(\S+)/.exec(prompt)?.[1] ?? 'missing' + return [ + { type: 'llm_call', data: { tokensIn: 12, tokensOut: 6, costUsd: 0.002 } }, + { type: 'result', data: { finalText: `final answer=${answer}` } }, + ] + }, + }) +} + +function board(overrides: Partial>[0]> = {}) { + return defineLeaderboard({ + name: 'fake-board', + cases: CASES, + prompt: async (c) => `solve the task. answer=${c.answer}`, + score: (output, c) => (output.includes(c.answer) ? 1 : 0), + backends: { inproc: fakeBackend }, + export: async () => {}, // silence the default table print in tests + ...overrides, + }) +} + +const AXIS = ['--backend', 'inproc', '--harnesses', 'opencode', '--models', 'test-model@2026-01-01'] + +describe('defineLeaderboard', () => { + it('runs the matrix end-to-end offline and scores every (profile, case) cell', async () => { + const result = await board().run([...AXIS]) + + expect(result.records).toHaveLength(2) + expect(Object.keys(result.byScenario).sort()).toEqual(['case-alpha', 'case-beta']) + const summaries = Object.values(result.byProfile) + expect(summaries).toHaveLength(1) + expect(summaries[0]?.meanComposite).toBe(1) + expect(summaries[0]?.model).toBe('test-model@2026-01-01') + // The fake backend's llm_call events were metered — the run is REAL, not a stub. + expect(result.integrity.verdict).toBe('real') + for (const r of result.records) expect(r.tokenUsage.input).toBeGreaterThan(0) + }) + + it('defaults to a FRESH run dir per invocation (no stale cell-cache reuse)', async () => { + const dirs: string[] = [] + const b = board({ + export: async (_result, ctx: LeaderboardRunContext) => { + dirs.push(ctx.runDir) + }, + }) + await b.run([...AXIS, '--cases', 'case-alpha']) + await b.run([...AXIS, '--cases', 'case-alpha']) + expect(dirs).toHaveLength(2) + expect(dirs[0]).not.toBe(dirs[1]) + for (const d of dirs) expect(d.startsWith(tmpdir())).toBe(true) + }) + + it('honors an explicit --run-dir (the opt-in resume path)', async () => { + const runDir = mkdtempSync(join(tmpdir(), 'lb-explicit-')) + let seen: string | undefined + await board({ + export: async (_r, ctx) => { + seen = ctx.runDir + }, + }).run([...AXIS, '--cases', 'case-alpha', '--run-dir', runDir]) + expect(seen).toBe(runDir) + }) + + it('subsets cases via --cases and rejects unknown ids', async () => { + const result = await board().run([...AXIS, '--cases', 'case-beta']) + expect(result.records).toHaveLength(1) + expect(Object.keys(result.byScenario)).toEqual(['case-beta']) + + await expect(board().run([...AXIS, '--cases', 'nope'])).rejects.toThrow(/unknown case "nope"/) + }) + + it('stamps a snapshot onto bare model ids (RunRecord identity requirement)', async () => { + const result = await board().run([ + '--backend', + 'inproc', + '--harnesses', + 'opencode', + '--models', + 'test-model', + ]) + expect(result.records[0]?.model).toBe('test-model@leaderboard') + }) + + it('wraps score() as the campaign judge, carrying dimensions and notes', async () => { + const result = await board({ + score: (output, c) => ({ + composite: output.includes(c.answer) ? 0.5 : 0, + dimensions: { exactness: 1 }, + notes: 'structured', + }), + }).run([...AXIS, '--cases', 'case-alpha']) + expect(Object.values(result.byProfile)[0]?.meanComposite).toBe(0.5) + const outcome = result.records[0]?.outcome as { searchScore?: number } | undefined + expect(outcome?.searchScore).toBe(0.5) + }) + + it('feeds each cell raw events + case through onCellEvents (the metric-capture seam)', async () => { + const seen: Array<{ id: string; types: string[] }> = [] + await board({ + onCellEvents: (events, c) => { + seen.push({ id: c.id, types: events.map((e) => (e as { type: string }).type) }) + }, + }).run([...AXIS]) + expect(seen.map((s) => s.id).sort()).toEqual(['case-alpha', 'case-beta']) + for (const s of seen) expect(s.types).toContain('llm_call') + }) + + it('parses spec.flags and surfaces every flag to the hooks via ctx.args', async () => { + let args: Record = {} + await board({ + flags: { split: { default: 'dev', description: 'dataset split' } }, + setup: (ctx) => { + args = ctx.args + }, + }).run([...AXIS, '--cases', 'case-alpha', '--split', 'holdout']) + expect(args.split).toBe('holdout') + expect(args.backend).toBe('inproc') + expect(args.harnesses).toBe('opencode') + }) + + it("fails loud on the default 'sandbox' backend with guidance to supply a real client", async () => { + await expect( + defineLeaderboard({ + name: 'no-backend', + cases: CASES, + prompt: (c) => c.id, + score: () => 0, + }).run(['--models', 'm@1']), + ).rejects.toThrow(/backends\.sandbox/) + }) + + it('toBenchmarkAdapter(): loadTasks/judge round-trip in the structural BenchmarkAdapter shape', async () => { + const adapter = board().toBenchmarkAdapter() + expect(adapter.name).toBe('fake-board') + await adapter.preflight() + + const tasks = await adapter.loadTasks() + expect(tasks.map((t) => t.id)).toEqual(['case-alpha', 'case-beta']) + expect(tasks[0]?.prompt).toContain('answer=ALPHA-42') + + const pass = await adapter.judge(tasks[0] as { id: string; prompt: string }, 'final ALPHA-42') + expect(pass).toMatchObject({ resolved: true, score: 1 }) + const fail = await adapter.judge(tasks[0] as { id: string; prompt: string }, 'wrong') + expect(fail).toMatchObject({ resolved: false, score: 0 }) + + const subset = await adapter.loadTasks({ ids: ['case-beta'] }) + expect(subset.map((t) => t.id)).toEqual(['case-beta']) + expect(await adapter.goldArtifact(tasks[0] as { id: string; prompt: string })).toBeUndefined() + + // preflight fails loud on duplicate ids — the cheap corpus-integrity check. + const dup = defineLeaderboard({ + name: 'dup', + cases: [CASES[0] as FakeCase, CASES[0] as FakeCase], + prompt: (c) => c.id, + score: () => 0, + }).toBenchmarkAdapter() + await expect(dup.preflight()).rejects.toThrow(/duplicate case id/) + }) +}) diff --git a/src/runtime/define-leaderboard.ts b/src/runtime/define-leaderboard.ts new file mode 100644 index 0000000..a31dffa --- /dev/null +++ b/src/runtime/define-leaderboard.ts @@ -0,0 +1,551 @@ +/** + * `defineLeaderboard` — the declarative eval-leaderboard facade. + * + * A product's harness×model leaderboard is always the same assembly: expand a + * base profile across the harness×model axes (`expandProfileAxes`), run every + * (profile, case) cell as a driven loop (`loopDispatch` + `naiveDriver`), score + * with the domain's grader, and emit ONE `runProfileMatrix` call. Each product + * hand-rolled that assembly (~650 lines each) and re-hit the same footguns: + * stale cell-cache reuse, zero-token stub cells, missing model snapshots. + * + * This facade IS that assembly, once, with the domain reduced to a declarative + * spec: `cases` + `prompt` + `score`. It contains NO execution, judging, or + * metering logic of its own — every moving part is an existing primitive, and + * every default is overridable: + * + * - LEVEL 0 (declarative): `cases` / `prompt` / `score` / `axis`. + * - LEVEL 1 (seams): `backends`, `flags`, `parseOutput`, `onCellEvents`, + * `setup`/`teardown`, `export`, `modelBackend`, `matrix` passthrough. + * - LEVEL 2 (replacement): `dispatch` and `judges` swap out the whole + * loop wiring or scoring; `runProfileMatrix` itself stays public as the + * escape floor — a product overriding everything just writes what it has + * today, no capability removed. + * + * `toBenchmarkAdapter()` exposes the same domain surface in the structural + * `BenchmarkAdapter` shape (`name`/`preflight`/`loadTasks`/`judge`/ + * `goldArtifact`) so a product leaderboard can register into a benchmark + * registry without this module depending on one. + * + * @experimental + */ +import { execFileSync } from 'node:child_process' +import { mkdirSync, writeFileSync } from 'node:fs' +import { tmpdir } from 'node:os' +import { join } from 'node:path' +import { + type AgentProfile, + CODING_HARNESSES, + expandProfileAxes, + type HarnessType, + harnessAxisOf, +} from '@tangle-network/agent-eval' +import { + type JudgeConfig, + type ProfileDispatchFn, + type RunProfileMatrixOptions, + type RunProfileMatrixResult, + runProfileMatrix, + type Scenario, +} from '@tangle-network/agent-eval/campaign' +import { collectAgentResponseText, type SandboxEvent } from '@tangle-network/sandbox' +import { leaderboard, renderLeaderboardMarkdown } from './benchmark-report' +import { loopDispatch } from './loop-dispatch' +import { resolveSandboxClient } from './resolve-sandbox-client' +import { naiveDriver, type SteeringDecision } from './steering-drivers' +import type { SandboxClient } from './types' + +/** Structured per-case verdict a `score` function may return (a bare number is + * shorthand for `{ composite }`). `composite` is the [0,1] leaderboard score; + * `dimensions` are recorded as extra judge dimensions. */ +export interface LeaderboardScore { + composite: number + dimensions?: Record + notes?: string +} + +/** The campaign scenario a case is wrapped into: the case rides along so + * judges and hooks can reach the full domain payload, not just its id. */ +export interface LeaderboardScenario extends Scenario { + case: TCase +} + +/** One extra CLI flag a spec declares. Parsed by `run()` as `-- ` + * and surfaced to every hook via `ctx.args`. */ +export interface LeaderboardFlagSpec { + default?: string + description: string +} + +/** Resolved run configuration handed to `setup` / `teardown` / `export`. */ +export interface LeaderboardRunContext { + name: string + /** Execution backend name (`--backend`), a key of `backends`. */ + backend: string + runDir: string + exportDir: string + /** Every parsed flag (standard + `spec.flags`), by name without `--`. */ + args: Record + harnesses: readonly HarnessType[] + /** Snapshot-stamped model ids (`name@snapshot`) — the eval identity models. */ + models: readonly string[] + caseIds: readonly string[] + shots: number + reps: number +} + +/** Structurally `BenchTask` (bench registry shape) — declared locally so this + * module adds no dependency on a benchmark package. */ +export interface LeaderboardBenchTask { + id: string + prompt: string + split?: string + metadata?: Record +} + +/** Structurally `BenchScore` (bench registry shape). */ +export interface LeaderboardBenchScore { + resolved: boolean + score: number + detail?: string +} + +/** Structurally `BenchmarkAdapter` (bench registry shape): `name`, + * `preflight()`, `loadTasks()`, deterministic `judge()`, `goldArtifact()`. */ +export interface LeaderboardBenchmarkAdapter { + readonly name: string + preflight(): Promise + loadTasks(opts?: { + limit?: number + split?: string + ids?: string[] + }): Promise + judge(task: LeaderboardBenchTask, artifact: string): Promise + goldArtifact(task: LeaderboardBenchTask): Promise +} + +export interface LeaderboardSpec { + /** Leaderboard name — the scenario `kind`, default profile name, and report title. */ + name: string + /** The case corpus. Every case needs a stable string id (see `caseId`). */ + cases: TCase[] + /** Stable id extractor. Default: the case's own `id` property (fail-loud + * when absent or not a string). */ + caseId?: (c: TCase) => string + /** The per-case task prompt. May be async (e.g. built by shelling out to a + * reference implementation); resolved ONCE per case before dispatch. */ + prompt: (c: TCase) => string | Promise + /** The domain grader: agent output text → score. Used BOTH as the per-shot + * validator (a shot with `composite > 0` stops the naive retry loop) and, + * wrapped as a campaign judge, as the recorded leaderboard score. */ + score: (output: string, c: TCase) => number | LeaderboardScore + /** Harness × model axes for `expandProfileAxes`. Defaults: the canonical + * `CODING_HARNESSES` × the base profile's `model.default`. `--harnesses` / + * `--models` override per run. */ + axis?: { harnesses?: readonly HarnessType[]; models?: readonly string[] } + /** Base profile the axes expand over (prompt/tools/skills held fixed). + * Default: a minimal `{ name, model: { default: } }`. */ + baseProfile?: AgentProfile + /** + * Execution-backend registry: `--backend ` picks the factory that + * yields the `SandboxClient` every cell runs on. Merged over the defaults: + * - `sandbox` — throws with guidance (a product must supply its real + * Sandbox-backed client; the facade has no credentials). + * - `cli-bridge` — `resolveSandboxClient({ backend: 'bridge' })` reading + * `CLI_BRIDGE_URL` + `BRIDGE_BEARER`/`CLI_BRIDGE_BEARER`; the per-cell + * harness/model ride in via `sandboxOverrides.backend`. + */ + backends?: Record SandboxClient) | undefined> + /** Extra `--flag value` CLI args `run()` parses and surfaces via `ctx.args`. */ + flags?: Record + /** Extra fields merged into each cell's `backend.model` create override — + * e.g. `{ provider: 'openai-compat', apiKey, baseUrl }` for a router-backed + * sandbox. The cell's bare model id is set by the facade from the axis. */ + modelBackend?: Record + /** Runs once before the matrix (fetch fixtures, warm caches). */ + setup?: (ctx: LeaderboardRunContext) => Promise | void + /** Runs once after the matrix, even on failure (reap boxes, close handles). */ + teardown?: (ctx: LeaderboardRunContext) => Promise | void + /** Per-cell event tap: the raw sandbox events of each parsed iteration, + * with the case — the seam for domain metric capture (search counts, + * citations) without a substrate change. */ + onCellEvents?: (events: readonly SandboxEvent[], c: TCase) => void + /** Output decode override: raw events → the scored output text. Default: + * the sandbox SDK's `collectAgentResponseText` (final answer text; empty + * string when the stream carried none — which then scores 0). */ + parseOutput?: (events: readonly SandboxEvent[], c: TCase) => string + /** Result export. Default: write `matrix-result.json` under the run dir and + * print (+ write) the ranked leaderboard markdown under the export dir. */ + export?: ( + result: RunProfileMatrixResult>, + ctx: LeaderboardRunContext, + ) => Promise | void + /** LEVEL 2 — full dispatch replacement (in-process products bring their own). + * The default is `loopDispatch` + `naiveDriver` over the resolved backend. */ + dispatch?: ProfileDispatchFn, string> + /** LEVEL 2 — full judge replacement. Default: `score` wrapped as one judge. */ + judges?: JudgeConfig>[] + /** Naive-retry shot cap per cell (`--shots`). Default 1. */ + shots?: number + /** Replicates per cell (`--reps`). Default 1. */ + reps?: number + /** Passthrough overrides spread onto the final `runProfileMatrix` call + * (e.g. `maxConcurrency`, `costCeiling`, `integrity`, `storage`) — spread + * LAST, so anything the facade wired can be overridden. */ + matrix?: Partial, string>> +} + +export interface DefinedLeaderboard { + /** + * Parse flags, run the matrix, export, and return the raw result. + * + * Standard flags: `--backend ` (default `sandbox`), `--harnesses a,b`, + * `--models m1,m2`, `--cases id1,id2`, `--shots N`, `--reps N`, + * `--model-snapshot `, `--run-dir `, `--export-dir `, + * plus every `spec.flags` entry. `argv` defaults to `process.argv.slice(2)`. + * + * The default run dir is FRESH per invocation (timestamp+pid under the OS + * tmpdir). `runProfileMatrix` caches cells by run dir, and a stable default + * would silently reuse a prior FAILED zero-token cell and skip dispatch — + * only an explicit `--run-dir` opts into that resume behavior. + */ + run(argv?: string[]): Promise>> + /** The same domain surface in the structural `BenchmarkAdapter` shape. */ + toBenchmarkAdapter(): LeaderboardBenchmarkAdapter +} + +/** Read `--name ` from an argv array. */ +function argOf(argv: readonly string[], name: string): string | undefined { + const i = argv.indexOf(`--${name}`) + if (i >= 0 && i + 1 < argv.length) return argv[i + 1] + return undefined +} + +function splitList(v: string | undefined): string[] | undefined { + if (v === undefined) return undefined + const parts = v + .split(',') + .map((s) => s.trim()) + .filter(Boolean) + return parts.length > 0 ? parts : undefined +} + +/** RunRecords reject a bare model id — the eval IDENTITY model must carry a + * snapshot (`name@`). Unchanged when already stamped. */ +function withSnapshot(model: string, snapshot: string): string { + return model.includes('@') ? model : `${model}@${snapshot}` +} + +/** The bare model id the backend actually serves (identity snapshot stripped). */ +function bareModel(model: string): string { + return model.split('@')[0] ?? model +} + +function gitSha(): string { + try { + return execFileSync('git', ['rev-parse', 'HEAD'], { encoding: 'utf8' }).trim() + } catch { + return 'unknown' + } +} + +function normalizeScore(s: number | LeaderboardScore): LeaderboardScore { + return typeof s === 'number' ? { composite: s } : s +} + +export function defineLeaderboard(spec: LeaderboardSpec): DefinedLeaderboard { + const caseId = (c: TCase): string => { + const id = spec.caseId ? spec.caseId(c) : (c as { id?: unknown }).id + if (typeof id !== 'string' || id.length === 0) { + throw new Error( + `defineLeaderboard(${spec.name}): every case needs a stable string id — ` + + 'give cases an `id` property or supply spec.caseId', + ) + } + return id + } + + const selectCases = (ids?: readonly string[]): TCase[] => { + if (!ids) return spec.cases + const byId = new Map(spec.cases.map((c) => [caseId(c), c])) + return ids.map((id) => { + const c = byId.get(id) + if (c === undefined) { + throw new Error( + `defineLeaderboard(${spec.name}): unknown case "${id}" (have: ${[...byId.keys()].join(', ')})`, + ) + } + return c + }) + } + + const scoreJudge: JudgeConfig> = { + name: `${spec.name}-score`, + dimensions: [{ key: 'composite', description: `${spec.name} case score` }], + score({ artifact, scenario }) { + const s = normalizeScore(spec.score(artifact, scenario.case)) + return { + composite: s.composite, + dimensions: { composite: s.composite, ...s.dimensions }, + notes: s.notes ?? '', + } + }, + } + + async function run( + argv: string[] = process.argv.slice(2), + ): Promise>> { + const args: Record = {} + for (const name of [ + 'backend', + 'harnesses', + 'models', + 'cases', + 'shots', + 'reps', + 'model-snapshot', + 'run-dir', + 'export-dir', + ]) { + args[name] = argOf(argv, name) + } + for (const [name, flag] of Object.entries(spec.flags ?? {})) { + args[name] = argOf(argv, name) ?? flag.default + } + + const backendName = args.backend ?? 'sandbox' + const shots = Number(args.shots ?? spec.shots ?? 1) + const reps = Number(args.reps ?? spec.reps ?? 1) + const snapshot = args['model-snapshot'] ?? 'leaderboard' + // FRESH run dir per invocation: runProfileMatrix caches cells by run dir, + // and a stable default resumes a prior FAILED zero-token cell without + // re-dispatching. Only an explicit --run-dir opts into resume. + const runDir = + args['run-dir'] ?? join(tmpdir(), `leaderboard-${spec.name}-${Date.now()}-${process.pid}`) + const exportDir = args['export-dir'] ?? join(runDir, 'export') + mkdirSync(runDir, { recursive: true }) + + const cases = selectCases(splitList(args.cases)) + if (cases.length === 0) throw new Error(`defineLeaderboard(${spec.name}): no cases to run`) + const scenarios: LeaderboardScenario[] = cases.map((c) => ({ + id: caseId(c), + kind: spec.name, + case: c, + })) + + const harnesses = + (splitList(args.harnesses) as HarnessType[] | undefined) ?? + spec.axis?.harnesses ?? + CODING_HARNESSES + const rawModels = + splitList(args.models) ?? + spec.axis?.models ?? + (spec.baseProfile?.model?.default !== undefined + ? [spec.baseProfile.model.default] + : undefined) + if (!rawModels || rawModels.length === 0) { + throw new Error( + `defineLeaderboard(${spec.name}): no models — pass --models, set spec.axis.models, ` + + 'or give spec.baseProfile a model.default', + ) + } + const models = rawModels.map((m) => withSnapshot(m, snapshot)) + + const base: AgentProfile = + spec.baseProfile ?? + ({ name: spec.name, model: { default: bareModel(models[0] ?? '') } } as AgentProfile) + const profiles = expandProfileAxes({ base, harnesses, models }) + + const ctx: LeaderboardRunContext = { + name: spec.name, + backend: backendName, + runDir, + exportDir, + args, + harnesses, + models, + caseIds: scenarios.map((s) => s.id), + shots, + reps, + } + + // Backend registry: defaults + spec overrides (spec wins). Factories are + // lazy so an unused backend never resolves credentials. + const backends: Record SandboxClient) | undefined> = { + sandbox: () => { + throw new Error( + `defineLeaderboard(${spec.name}): the 'sandbox' backend needs your product's real ` + + 'SandboxClient — supply spec.backends.sandbox (e.g. () => new SandboxClient({ apiKey, baseUrl }))', + ) + }, + 'cli-bridge': () => { + const bearer = process.env.BRIDGE_BEARER ?? process.env.CLI_BRIDGE_BEARER + if (!bearer) { + throw new Error( + `defineLeaderboard(${spec.name}): backend 'cli-bridge' needs BRIDGE_BEARER or CLI_BRIDGE_BEARER set`, + ) + } + return resolveSandboxClient({ + backend: 'bridge', + bridge: { + url: process.env.CLI_BRIDGE_URL, + bearer, + model: bareModel(models[0] ?? ''), + timeoutMs: 900_000, + }, + }) + }, + ...spec.backends, + } + const makeClient = backends[backendName] + if (!makeClient) { + throw new Error( + `defineLeaderboard(${spec.name}): unknown backend "${backendName}" (have: ${Object.keys(backends).join(', ')})`, + ) + } + const sandboxClient = makeClient() + + // Prompts resolve ONCE per case, up front — spec.prompt may be async + // (shelling out to a reference implementation) but the loop kernel's + // taskToPrompt is sync. + const promptById = new Map() + for (const s of scenarios) promptById.set(s.id, await spec.prompt(s.case)) + const promptOf = (s: LeaderboardScenario): string => { + const p = promptById.get(s.id) + if (p === undefined) + throw new Error(`defineLeaderboard(${spec.name}): no prompt for case "${s.id}"`) + return p + } + + // Monotonic per-shot nonce appended to each shot's prompt — defeats router + // response-caching of byte-identical prompts across naive-retry shots. + let shotNonce = 0 + + const dispatch = + spec.dispatch ?? + loopDispatch< + LeaderboardScenario, + string, + SteeringDecision, + LeaderboardScenario, + string + >({ + sandboxClient, + toLoopOptions: (scenario, profile) => { + // The cell's harness + model come off the profile's axis stamp set + // by expandProfileAxes; the sandbox create override carries them to + // whichever backend client runs the cell. + const axis = harnessAxisOf(profile) + const modelId = bareModel(axis?.model ?? models[0] ?? '') + return { + // naiveDriver = the no-signal retry floor: re-run the same case as + // an independent attempt until one scores (>0) or the shot cap. + driver: naiveDriver, string>({ + continuation: '', + applyContinuation: (task) => task, + maxIterations: shots, + }), + agentRun: { + profile, + taskToPrompt: (s) => `${promptOf(s)}\n\n`, + ...(axis + ? { + sandboxOverrides: { + backend: { + type: axis.harness, + model: { ...spec.modelBackend, model: modelId }, + }, + } as never, + } + : {}), + }, + output: { + parse: (events) => { + spec.onCellEvents?.(events, scenario.case) + return spec.parseOutput + ? spec.parseOutput(events, scenario.case) + : (collectAgentResponseText(events) ?? '') + }, + }, + validator: { + validate: async (output: string) => { + const s = normalizeScore(spec.score(output, scenario.case)) + return { valid: s.composite > 0, score: s.composite } + }, + }, + task: scenario, + maxIterations: shots, + } + }, + }) + + await spec.setup?.(ctx) + try { + const result = await runProfileMatrix, string>({ + profiles, + scenarios, + dispatch, + judges: spec.judges ?? [scoreJudge], + runDir, + commitSha: gitSha(), + reps, + ...spec.matrix, + }) + + if (spec.export) { + await spec.export(result, ctx) + } else { + mkdirSync(exportDir, { recursive: true }) + writeFileSync(join(runDir, 'matrix-result.json'), `${JSON.stringify(result, null, 2)}\n`) + const table = renderLeaderboardMarkdown( + leaderboard(result.records, { title: spec.name, meta: { backend: backendName } }), + ) + writeFileSync(join(exportDir, 'leaderboard.md'), table) + console.log(table) + } + return result + } finally { + await spec.teardown?.(ctx) + } + } + + function toBenchmarkAdapter(): LeaderboardBenchmarkAdapter { + return { + name: spec.name, + async preflight(): Promise { + // Case-id integrity is the cheap, real check: duplicate or missing ids + // corrupt every downstream join. + const seen = new Set() + for (const c of spec.cases) { + const id = caseId(c) + if (seen.has(id)) { + throw new Error(`defineLeaderboard(${spec.name}): duplicate case id "${id}"`) + } + seen.add(id) + } + }, + async loadTasks(opts): Promise { + const selected = selectCases(opts?.ids) + const limited = opts?.limit !== undefined ? selected.slice(0, opts.limit) : selected + return Promise.all( + limited.map(async (c) => ({ + id: caseId(c), + prompt: await spec.prompt(c), + metadata: { case: c }, + })), + ) + }, + async judge(task, artifact): Promise { + const [c] = selectCases([task.id]) + if (c === undefined) + throw new Error(`defineLeaderboard(${spec.name}): no case "${task.id}"`) + const s = normalizeScore(spec.score(artifact, c)) + return { resolved: s.composite > 0, score: s.composite, detail: s.notes } + }, + async goldArtifact(): Promise { + return undefined + }, + } + } + + return { run, toBenchmarkAdapter } +} diff --git a/src/runtime/index.ts b/src/runtime/index.ts index da18bc2..72c0739 100644 --- a/src/runtime/index.ts +++ b/src/runtime/index.ts @@ -87,6 +87,21 @@ export { sentinelCompletion, stopSentinel, } from './completion' +// The declarative eval-leaderboard facade: cases + prompt + score → one +// runProfileMatrix call (expandProfileAxes × loopDispatch × naiveDriver), +// with a structural BenchmarkAdapter view via toBenchmarkAdapter(). +export { + type DefinedLeaderboard, + defineLeaderboard, + type LeaderboardBenchmarkAdapter, + type LeaderboardBenchScore, + type LeaderboardBenchTask, + type LeaderboardFlagSpec, + type LeaderboardRunContext, + type LeaderboardScenario, + type LeaderboardScore, + type LeaderboardSpec, +} from './define-leaderboard' export { type AgentEnvironment, type AgentEnvironmentCapabilities, From 76d764d078b69c3cef107fd06c88b96392fe9ce2 Mon Sep 17 00:00:00 2001 From: Drew Stone Date: Fri, 3 Jul 2026 00:47:08 -0600 Subject: [PATCH 2/2] docs(api): regenerate for defineLeaderboard --- docs/api/runtime.md | 120 ++++++++++++++++++++++---------------------- 1 file changed, 60 insertions(+), 60 deletions(-) diff --git a/docs/api/runtime.md b/docs/api/runtime.md index 067d1e5..265a155 100644 --- a/docs/api/runtime.md +++ b/docs/api/runtime.md @@ -1310,7 +1310,7 @@ Minimum confidence a PROBABILISTIC verdict must clear to end. Default 0.8. ### LeaderboardScore -Defined in: runtime/define-leaderboard.ts:60 +Defined in: [runtime/define-leaderboard.ts:60](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L60) Structured per-case verdict a `score` function may return (a bare number is shorthand for `{ composite }`). `composite` is the [0,1] leaderboard score; @@ -1322,25 +1322,25 @@ Structured per-case verdict a `score` function may return (a bare number is > **composite**: `number` -Defined in: runtime/define-leaderboard.ts:61 +Defined in: [runtime/define-leaderboard.ts:61](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L61) ##### dimensions? > `optional` **dimensions?**: `Record`\<`string`, `number`\> -Defined in: runtime/define-leaderboard.ts:62 +Defined in: [runtime/define-leaderboard.ts:62](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L62) ##### notes? > `optional` **notes?**: `string` -Defined in: runtime/define-leaderboard.ts:63 +Defined in: [runtime/define-leaderboard.ts:63](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L63) *** ### LeaderboardScenario -Defined in: runtime/define-leaderboard.ts:68 +Defined in: [runtime/define-leaderboard.ts:68](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L68) The campaign scenario a case is wrapped into: the case rides along so judges and hooks can reach the full domain payload, not just its id. @@ -1361,13 +1361,13 @@ The campaign scenario a case is wrapped into: the case rides along so > **case**: `TCase` -Defined in: runtime/define-leaderboard.ts:69 +Defined in: [runtime/define-leaderboard.ts:69](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L69) *** ### LeaderboardFlagSpec -Defined in: runtime/define-leaderboard.ts:74 +Defined in: [runtime/define-leaderboard.ts:74](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L74) One extra CLI flag a spec declares. Parsed by `run()` as `-- ` and surfaced to every hook via `ctx.args`. @@ -1378,19 +1378,19 @@ One extra CLI flag a spec declares. Parsed by `run()` as `-- ` > `optional` **default?**: `string` -Defined in: runtime/define-leaderboard.ts:75 +Defined in: [runtime/define-leaderboard.ts:75](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L75) ##### description > **description**: `string` -Defined in: runtime/define-leaderboard.ts:76 +Defined in: [runtime/define-leaderboard.ts:76](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L76) *** ### LeaderboardRunContext -Defined in: runtime/define-leaderboard.ts:80 +Defined in: [runtime/define-leaderboard.ts:80](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L80) Resolved run configuration handed to `setup` / `teardown` / `export`. @@ -1400,13 +1400,13 @@ Resolved run configuration handed to `setup` / `teardown` / `export`. > **name**: `string` -Defined in: runtime/define-leaderboard.ts:81 +Defined in: [runtime/define-leaderboard.ts:81](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L81) ##### backend > **backend**: `string` -Defined in: runtime/define-leaderboard.ts:83 +Defined in: [runtime/define-leaderboard.ts:83](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L83) Execution backend name (`--backend`), a key of `backends`. @@ -1414,19 +1414,19 @@ Execution backend name (`--backend`), a key of `backends`. > **runDir**: `string` -Defined in: runtime/define-leaderboard.ts:84 +Defined in: [runtime/define-leaderboard.ts:84](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L84) ##### exportDir > **exportDir**: `string` -Defined in: runtime/define-leaderboard.ts:85 +Defined in: [runtime/define-leaderboard.ts:85](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L85) ##### args > **args**: `Record`\<`string`, `string` \| `undefined`\> -Defined in: runtime/define-leaderboard.ts:87 +Defined in: [runtime/define-leaderboard.ts:87](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L87) Every parsed flag (standard + `spec.flags`), by name without `--`. @@ -1434,13 +1434,13 @@ Every parsed flag (standard + `spec.flags`), by name without `--`. > **harnesses**: readonly `HarnessType`[] -Defined in: runtime/define-leaderboard.ts:88 +Defined in: [runtime/define-leaderboard.ts:88](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L88) ##### models > **models**: readonly `string`[] -Defined in: runtime/define-leaderboard.ts:90 +Defined in: [runtime/define-leaderboard.ts:90](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L90) Snapshot-stamped model ids (`name@snapshot`) — the eval identity models. @@ -1448,25 +1448,25 @@ Snapshot-stamped model ids (`name@snapshot`) — the eval identity models. > **caseIds**: readonly `string`[] -Defined in: runtime/define-leaderboard.ts:91 +Defined in: [runtime/define-leaderboard.ts:91](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L91) ##### shots > **shots**: `number` -Defined in: runtime/define-leaderboard.ts:92 +Defined in: [runtime/define-leaderboard.ts:92](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L92) ##### reps > **reps**: `number` -Defined in: runtime/define-leaderboard.ts:93 +Defined in: [runtime/define-leaderboard.ts:93](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L93) *** ### LeaderboardBenchTask -Defined in: runtime/define-leaderboard.ts:98 +Defined in: [runtime/define-leaderboard.ts:98](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L98) Structurally `BenchTask` (bench registry shape) — declared locally so this module adds no dependency on a benchmark package. @@ -1477,31 +1477,31 @@ Structurally `BenchTask` (bench registry shape) — declared locally so this > **id**: `string` -Defined in: runtime/define-leaderboard.ts:99 +Defined in: [runtime/define-leaderboard.ts:99](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L99) ##### prompt > **prompt**: `string` -Defined in: runtime/define-leaderboard.ts:100 +Defined in: [runtime/define-leaderboard.ts:100](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L100) ##### split? > `optional` **split?**: `string` -Defined in: runtime/define-leaderboard.ts:101 +Defined in: [runtime/define-leaderboard.ts:101](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L101) ##### metadata? > `optional` **metadata?**: `Record`\<`string`, `unknown`\> -Defined in: runtime/define-leaderboard.ts:102 +Defined in: [runtime/define-leaderboard.ts:102](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L102) *** ### LeaderboardBenchScore -Defined in: runtime/define-leaderboard.ts:106 +Defined in: [runtime/define-leaderboard.ts:106](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L106) Structurally `BenchScore` (bench registry shape). @@ -1511,25 +1511,25 @@ Structurally `BenchScore` (bench registry shape). > **resolved**: `boolean` -Defined in: runtime/define-leaderboard.ts:107 +Defined in: [runtime/define-leaderboard.ts:107](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L107) ##### score > **score**: `number` -Defined in: runtime/define-leaderboard.ts:108 +Defined in: [runtime/define-leaderboard.ts:108](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L108) ##### detail? > `optional` **detail?**: `string` -Defined in: runtime/define-leaderboard.ts:109 +Defined in: [runtime/define-leaderboard.ts:109](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L109) *** ### LeaderboardBenchmarkAdapter -Defined in: runtime/define-leaderboard.ts:114 +Defined in: [runtime/define-leaderboard.ts:114](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L114) Structurally `BenchmarkAdapter` (bench registry shape): `name`, `preflight()`, `loadTasks()`, deterministic `judge()`, `goldArtifact()`. @@ -1540,7 +1540,7 @@ Structurally `BenchmarkAdapter` (bench registry shape): `name`, > `readonly` **name**: `string` -Defined in: runtime/define-leaderboard.ts:115 +Defined in: [runtime/define-leaderboard.ts:115](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L115) #### Methods @@ -1548,7 +1548,7 @@ Defined in: runtime/define-leaderboard.ts:115 > **preflight**(): `Promise`\<`void`\> -Defined in: runtime/define-leaderboard.ts:116 +Defined in: [runtime/define-leaderboard.ts:116](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L116) ###### Returns @@ -1558,7 +1558,7 @@ Defined in: runtime/define-leaderboard.ts:116 > **loadTasks**(`opts?`): `Promise`\<[`LeaderboardBenchTask`](#leaderboardbenchtask)[]\> -Defined in: runtime/define-leaderboard.ts:117 +Defined in: [runtime/define-leaderboard.ts:117](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L117) ###### Parameters @@ -1584,7 +1584,7 @@ Defined in: runtime/define-leaderboard.ts:117 > **judge**(`task`, `artifact`): `Promise`\<[`LeaderboardBenchScore`](#leaderboardbenchscore)\> -Defined in: runtime/define-leaderboard.ts:122 +Defined in: [runtime/define-leaderboard.ts:122](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L122) ###### Parameters @@ -1604,7 +1604,7 @@ Defined in: runtime/define-leaderboard.ts:122 > **goldArtifact**(`task`): `Promise`\<`string` \| `undefined`\> -Defined in: runtime/define-leaderboard.ts:123 +Defined in: [runtime/define-leaderboard.ts:123](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L123) ###### Parameters @@ -1620,7 +1620,7 @@ Defined in: runtime/define-leaderboard.ts:123 ### LeaderboardSpec -Defined in: runtime/define-leaderboard.ts:126 +Defined in: [runtime/define-leaderboard.ts:126](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L126) #### Type Parameters @@ -1634,7 +1634,7 @@ Defined in: runtime/define-leaderboard.ts:126 > **name**: `string` -Defined in: runtime/define-leaderboard.ts:128 +Defined in: [runtime/define-leaderboard.ts:128](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L128) Leaderboard name — the scenario `kind`, default profile name, and report title. @@ -1642,7 +1642,7 @@ Leaderboard name — the scenario `kind`, default profile name, and report title > **cases**: `TCase`[] -Defined in: runtime/define-leaderboard.ts:130 +Defined in: [runtime/define-leaderboard.ts:130](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L130) The case corpus. Every case needs a stable string id (see `caseId`). @@ -1650,7 +1650,7 @@ The case corpus. Every case needs a stable string id (see `caseId`). > `optional` **caseId?**: (`c`) => `string` -Defined in: runtime/define-leaderboard.ts:133 +Defined in: [runtime/define-leaderboard.ts:133](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L133) Stable id extractor. Default: the case's own `id` property (fail-loud when absent or not a string). @@ -1669,7 +1669,7 @@ Stable id extractor. Default: the case's own `id` property (fail-loud > **prompt**: (`c`) => `string` \| `Promise`\<`string`\> -Defined in: runtime/define-leaderboard.ts:136 +Defined in: [runtime/define-leaderboard.ts:136](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L136) The per-case task prompt. May be async (e.g. built by shelling out to a reference implementation); resolved ONCE per case before dispatch. @@ -1688,7 +1688,7 @@ The per-case task prompt. May be async (e.g. built by shelling out to a > **score**: (`output`, `c`) => `number` \| [`LeaderboardScore`](#leaderboardscore) -Defined in: runtime/define-leaderboard.ts:140 +Defined in: [runtime/define-leaderboard.ts:140](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L140) The domain grader: agent output text → score. Used BOTH as the per-shot validator (a shot with `composite > 0` stops the naive retry loop) and, @@ -1712,7 +1712,7 @@ The domain grader: agent output text → score. Used BOTH as the per-shot > `optional` **axis?**: `object` -Defined in: runtime/define-leaderboard.ts:144 +Defined in: [runtime/define-leaderboard.ts:144](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L144) Harness × model axes for `expandProfileAxes`. Defaults: the canonical `CODING_HARNESSES` × the base profile's `model.default`. `--harnesses` / @@ -1730,7 +1730,7 @@ Harness × model axes for `expandProfileAxes`. Defaults: the canonical > `optional` **baseProfile?**: `AgentProfile` -Defined in: runtime/define-leaderboard.ts:147 +Defined in: [runtime/define-leaderboard.ts:147](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L147) Base profile the axes expand over (prompt/tools/skills held fixed). Default: a minimal `{ name, model: { default: } }`. @@ -1739,7 +1739,7 @@ Base profile the axes expand over (prompt/tools/skills held fixed). > `optional` **backends?**: `Record`\<`string`, (() => [`SandboxClient`](#sandboxclient-3)) \| `undefined`\> -Defined in: runtime/define-leaderboard.ts:157 +Defined in: [runtime/define-leaderboard.ts:157](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L157) Execution-backend registry: `--backend ` picks the factory that yields the `SandboxClient` every cell runs on. Merged over the defaults: @@ -1753,7 +1753,7 @@ yields the `SandboxClient` every cell runs on. Merged over the defaults: > `optional` **flags?**: `Record`\<`string`, [`LeaderboardFlagSpec`](#leaderboardflagspec)\> -Defined in: runtime/define-leaderboard.ts:159 +Defined in: [runtime/define-leaderboard.ts:159](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L159) Extra `--flag value` CLI args `run()` parses and surfaces via `ctx.args`. @@ -1761,7 +1761,7 @@ Extra `--flag value` CLI args `run()` parses and surfaces via `ctx.args`. > `optional` **modelBackend?**: `Record`\<`string`, `unknown`\> -Defined in: runtime/define-leaderboard.ts:163 +Defined in: [runtime/define-leaderboard.ts:163](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L163) Extra fields merged into each cell's `backend.model` create override — e.g. `{ provider: 'openai-compat', apiKey, baseUrl }` for a router-backed @@ -1771,7 +1771,7 @@ Extra fields merged into each cell's `backend.model` create override — > `optional` **setup?**: (`ctx`) => `void` \| `Promise`\<`void`\> -Defined in: runtime/define-leaderboard.ts:165 +Defined in: [runtime/define-leaderboard.ts:165](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L165) Runs once before the matrix (fetch fixtures, warm caches). @@ -1789,7 +1789,7 @@ Runs once before the matrix (fetch fixtures, warm caches). > `optional` **teardown?**: (`ctx`) => `void` \| `Promise`\<`void`\> -Defined in: runtime/define-leaderboard.ts:167 +Defined in: [runtime/define-leaderboard.ts:167](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L167) Runs once after the matrix, even on failure (reap boxes, close handles). @@ -1807,7 +1807,7 @@ Runs once after the matrix, even on failure (reap boxes, close handles). > `optional` **onCellEvents?**: (`events`, `c`) => `void` -Defined in: runtime/define-leaderboard.ts:171 +Defined in: [runtime/define-leaderboard.ts:171](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L171) Per-cell event tap: the raw sandbox events of each parsed iteration, with the case — the seam for domain metric capture (search counts, @@ -1831,7 +1831,7 @@ readonly `SandboxEvent`[] > `optional` **parseOutput?**: (`events`, `c`) => `string` -Defined in: runtime/define-leaderboard.ts:175 +Defined in: [runtime/define-leaderboard.ts:175](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L175) Output decode override: raw events → the scored output text. Default: the sandbox SDK's `collectAgentResponseText` (final answer text; empty @@ -1855,7 +1855,7 @@ readonly `SandboxEvent`[] > `optional` **export?**: (`result`, `ctx`) => `void` \| `Promise`\<`void`\> -Defined in: runtime/define-leaderboard.ts:178 +Defined in: [runtime/define-leaderboard.ts:178](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L178) Result export. Default: write `matrix-result.json` under the run dir and print (+ write) the ranked leaderboard markdown under the export dir. @@ -1878,7 +1878,7 @@ Result export. Default: write `matrix-result.json` under the run dir and > `optional` **dispatch?**: `ProfileDispatchFn`\<[`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>, `string`\> -Defined in: runtime/define-leaderboard.ts:184 +Defined in: [runtime/define-leaderboard.ts:184](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L184) LEVEL 2 — full dispatch replacement (in-process products bring their own). The default is `loopDispatch` + `naiveDriver` over the resolved backend. @@ -1887,7 +1887,7 @@ LEVEL 2 — full dispatch replacement (in-process products bring their own). > `optional` **judges?**: `JudgeConfig`\<`string`, [`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>\>[] -Defined in: runtime/define-leaderboard.ts:186 +Defined in: [runtime/define-leaderboard.ts:186](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L186) LEVEL 2 — full judge replacement. Default: `score` wrapped as one judge. @@ -1895,7 +1895,7 @@ LEVEL 2 — full judge replacement. Default: `score` wrapped as one judge. > `optional` **shots?**: `number` -Defined in: runtime/define-leaderboard.ts:188 +Defined in: [runtime/define-leaderboard.ts:188](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L188) Naive-retry shot cap per cell (`--shots`). Default 1. @@ -1903,7 +1903,7 @@ Naive-retry shot cap per cell (`--shots`). Default 1. > `optional` **reps?**: `number` -Defined in: runtime/define-leaderboard.ts:190 +Defined in: [runtime/define-leaderboard.ts:190](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L190) Replicates per cell (`--reps`). Default 1. @@ -1911,7 +1911,7 @@ Replicates per cell (`--reps`). Default 1. > `optional` **matrix?**: `Partial`\<`RunProfileMatrixOptions`\<[`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>, `string`\>\> -Defined in: runtime/define-leaderboard.ts:194 +Defined in: [runtime/define-leaderboard.ts:194](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L194) Passthrough overrides spread onto the final `runProfileMatrix` call (e.g. `maxConcurrency`, `costCeiling`, `integrity`, `storage`) — spread @@ -1921,7 +1921,7 @@ Passthrough overrides spread onto the final `runProfileMatrix` call ### DefinedLeaderboard -Defined in: runtime/define-leaderboard.ts:197 +Defined in: [runtime/define-leaderboard.ts:197](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L197) #### Type Parameters @@ -1935,7 +1935,7 @@ Defined in: runtime/define-leaderboard.ts:197 > **run**(`argv?`): `Promise`\<`RunProfileMatrixResult`\<`string`, [`LeaderboardScenario`](#leaderboardscenario)\<`TCase`\>\>\> -Defined in: runtime/define-leaderboard.ts:211 +Defined in: [runtime/define-leaderboard.ts:211](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L211) Parse flags, run the matrix, export, and return the raw result. @@ -1963,7 +1963,7 @@ only an explicit `--run-dir` opts into that resume behavior. > **toBenchmarkAdapter**(): [`LeaderboardBenchmarkAdapter`](#leaderboardbenchmarkadapter) -Defined in: runtime/define-leaderboard.ts:213 +Defined in: [runtime/define-leaderboard.ts:213](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L213) The same domain surface in the structural `BenchmarkAdapter` shape. @@ -15319,7 +15319,7 @@ passes. Ground truth — the driver ends directly, no validation. The check read > **defineLeaderboard**\<`TCase`\>(`spec`): [`DefinedLeaderboard`](#definedleaderboard)\<`TCase`\> -Defined in: runtime/define-leaderboard.ts:255 +Defined in: [runtime/define-leaderboard.ts:255](https://github.com/tangle-network/agent-runtime/blob/main/src/runtime/define-leaderboard.ts#L255) #### Type Parameters