diff --git a/README.md b/README.md index a9ffd45e..76fe63e3 100644 --- a/README.md +++ b/README.md @@ -96,5 +96,9 @@ All 28 live in [`examples/`](./examples). - New here? [`docs/concepts.md`](./docs/concepts.md) — the mental model in plain terms. - [`docs/canonical-api.md`](./docs/canonical-api.md) — find the primitive: "I want to ___ → use ___". +- [`docs/api/primitive-catalog.md`](./docs/api/primitive-catalog.md) — every export in one generated, never-stale list with its import path. Check it before building anything new. +- Import subpaths: the root export is the product surface (`handleChatTurn`, `improve`); deeper capabilities ship as subpaths — `/loops` (multi-agent + the loop kernel), `/mcp` (tool servers), `/intelligence` (observability drop-in), `/lifecycle`, `/agent`, `/profiles`, `/platform`, `/analyst-loop`, `/environment-provider`. - [`docs/architecture.md`](./docs/architecture.md) — the design, end to end. - [`bench/HARNESS.md`](./bench/HARNESS.md) — the experiment harness and how to run a benchmark. + +**Contributing:** `pnpm i && pnpm test` gets you running; the full local gate is the [`package.json`](./package.json) scripts (`lint`, `typecheck`, `docs:check`). diff --git a/bench/HARNESS.md b/bench/HARNESS.md index 9ec40d3d..60e37170 100644 --- a/bench/HARNESS.md +++ b/bench/HARNESS.md @@ -124,6 +124,7 @@ the gate + measurement tools: corpus-replay.mts --selector: selector@k vs random@k vs oracle@k over a corpus (THE offline gate) corpus-report.mts paired-bootstrap CI + Benjamini-Hochberg over corpora gate-cli.mts the recursive diverse-vs-blind gate through `runGate` (Supervisor) + run-benchmarks-cli.mts runBenchmarks: any subset of the ADAPTERS registry × model/harness cells, one combined ranked report (#420) commit0-env-run.mts the HARD domain through `runBenchmark` (the optimization suite) terminal-compare.ts Terminal-Bench compare (own main) unit tests (the only fully-green, cred-free runnable surface besides offline replay): diff --git a/docs/README.md b/docs/README.md index adc9a60a..f7a28673 100644 --- a/docs/README.md +++ b/docs/README.md @@ -33,7 +33,7 @@ The map of every doc. **Start here** if you're new; the deeper tracks follow. | [execution-model.md](./execution-model.md) | the picture | The unified `Executor` port (router/bridge/cli/sandbox/BYO) + two engines, driver vs worker, spawn mechanics. | | [agent-bus-protocol.md](./agent-bus-protocol.md) | normative protocol | The multi-agent call bus — depth limits, headers, refusal contract. | | [durability-adapters.md](./durability-adapters.md) | subsystem | Journal + durability for resumable conversations + supervisor trees. | -| [intelligence-sdk.md](./intelligence-sdk.md) | subsystem | The product intelligence drop-in (`withTangleIntelligence`). | +| [intelligence-sdk.md](./intelligence-sdk.md) | product SDK | The Intelligence SDK reference — the `/intelligence` subpath (observe, effort tiers, certified delivery). | | [BUILDING.md](./BUILDING.md) | process | Building discipline: goal first, cheapest decisive proof, verification rules. | | [ANTI_PATTERNS.md](./ANTI_PATTERNS.md) | process | Named failure modes. | | [MAINTAINING.md](./MAINTAINING.md) | process | How the generated API reference + the docs-freshness gate stay honest. | @@ -44,7 +44,7 @@ The map of every doc. **Start here** if you're new; the deeper tracks follow. |---|---|---| | [simplification-plan.md](./research/simplification-plan.md) | **live tracker** | The in-flight simplification/rearchitecture: the converged design, the scratch list, the doc/module inventory, the workstreams + completion criteria. | | [research/README.md](./research/README.md) | research index | Forward-looking design threads + decision log. Not the canonical spine. | -| [archive/](./archive/) | retired notes | Superseded/niche docs kept for history (delivery manifest, conversation economics, artifact-lifecycle, go-live, results). | +| [archive/](./archive/) | retired notes | Superseded/niche docs kept for history (delivery manifest, conversation economics, artifact-lifecycle, go-live, results, benchmark-matrix consolidation). | ## Conventions diff --git a/docs/benchmark-matrix-consolidation.md b/docs/archive/benchmark-matrix-consolidation.md similarity index 91% rename from docs/benchmark-matrix-consolidation.md rename to docs/archive/benchmark-matrix-consolidation.md index bc98a399..a7f8a112 100644 --- a/docs/benchmark-matrix-consolidation.md +++ b/docs/archive/benchmark-matrix-consolidation.md @@ -1,3 +1,5 @@ +> Track: Archive · Status: the agent-runtime portion SHIPPED — external benchmark adapters (`5e2e81a0`) + DABStep (`5d610e78`) are on main, and `runBenchmarks` (the registry subset sweep, plan step 2) landed as #420 (`bench/src/run-benchmarks.ts`, `run-benchmarks-cli.mts`, `run-benchmarks-report.ts`). Plan steps 3–5 (agent-lab external-bench/product adapters; tax/legal/gtm folds) are cross-repo work owned by those repos — the agent-lab branches named below no longer exist. The living map of this repo's bench surface is `bench/HARNESS.md`. + # Benchmark matrix consolidation How to run any subset of `{harnesses × models × personas × scenarios × external-benchmarks}` and rank the cells, using the existing library primitives — and the plan to fold the per-product matrices onto them. diff --git a/docs/canonical-api.md b/docs/canonical-api.md index fc93d7c4..3576673d 100644 --- a/docs/canonical-api.md +++ b/docs/canonical-api.md @@ -39,7 +39,7 @@ This table is judgment-only: it maps an intent to the ONE primitive to reach for | I want to… | Use (import) | Do NOT build | |---|---|---| -| **Just run a supervisor to a goal (one call, scaffolding defaulted)** — START HERE | `supervise(profile, task, { budget, backend? })` — `/loops` | hand-wiring `createSupervisor().run` + `blobs`/`perWorker`/`journal`/`executors`; reaching for the lower-level run-verbs below before you need a specific counterparty | +| **Just run a supervisor to a goal (one call, scaffolding defaulted)** — START HERE (running agents) | `supervise(profile, task, { budget, backend? })` — `/loops` | hand-wiring `createSupervisor().run` + `blobs`/`perWorker`/`journal`/`executors`; reaching for the lower-level run-verbs below before you need a specific counterparty | | **Supervise agents to solve a graded `AgenticSurface` task** (workers `runAgentic` the surface, settle on its own check, driver self-improves from the failing tests) | `superviseSurface(profile, task, { surface, worker })` — `/loops` | a worker-seam + a "self-improving supervisor" wrapper around `supervise()`; passing a custom `makeWorkerAgent` that runs `runAgentic` | | Run a genome through a topology shape over the keystone Supervisor, end-to-end | `runPersonified({ persona, shape, task, budget })` — `/loops` | a hand-rolled `createSupervisor().run` + seam-wiring helper | | Loop a worker over one evolving artifact, K rounds, stop-when-good | `loopUntil(seed, spec)` as the `shape` — `/loops` | a `while(!done){runWorker();decide()}` hand-loop or "multi-attempt refine driver" | @@ -65,7 +65,7 @@ This table is judgment-only: it maps an intent to the ONE primitive to reach for | Run + **resume** ONE persistent box across turns | `openSandboxRun(client, opts, deliverable)` — `/loops` | a per-domain `new Sandbox`+`box.fs.read`+delete copy | | Pick / register a leaf backend, or bring your own agent | `createExecutor({ backend })` / `createExecutorRegistry()` / implement `Executor` — `/loops` | a per-vendor adapter or closed `inline\|sandbox\|cli` switch (won't report through the `UsageEvent` channel) | | Evolve a **prompt/string** surface | `gepaProposer({ llm, model, target })` (default inside `selfImprove`; the skill-surface twin is `skillOptProposer`, same source) — `agent-eval/campaign` | a hand-rolled prompt-mutation reflection loop with its own Pareto bookkeeping | -| Self-improve a profile (one pluggable verb) — START HERE | `improve(profile, findings, { surface, gate })` — root `.` (the RSI verb; defaults the generator from `surface`, wraps `selfImprove`) | a bespoke optimize loop, or calling `selfImprove`/a skill-optimizer directly for the common case | +| Self-improve a profile (one pluggable verb) — START HERE (self-improvement) | `improve(profile, findings, { surface, gate })` — root `.` (the RSI verb; defaults the generator from `surface`, wraps `selfImprove`) | a bespoke optimize loop, or calling `selfImprove`/a skill-optimizer directly for the common case | | Measure **one profile artifact's marginal lift** (with-vs-without, score+cost) / catalog artifacts | `measureMarginalLift(...)` / `ArtifactRegistry` (`applyArtifact` is the one `ArtifactKind`→`AgentProfile`-field bridge) — `/lifecycle` | a hand-rolled with/without ablation loop, or a per-kind `if kind==='skill'…` profile-field switch | | Run the **whole artifact lifecycle** — generate→measure→promote→store→compose, then drift-watch/dedupe the live set — over ANY profile surface (skill/prompt/tool/MCP) | `runLifecycle({ baseline, generators, evalRunner, gate })` then `composeProfile(registry, base, query)`; maintain with `driftWatch(...)` / `dedupeArtifacts(...)` — `/lifecycle` | a per-surface improve loop, a hand-rolled promote→compose step, or re-running `measureMarginalLift` without the registry/gate spine. The ONLY per-surface code is a thin `CandidateGenerator` (`skillGenerator` distills, `promptGenerator`/`buildableGenerator` for the rest) | | Run the self-improvement loop with full substrate control | `selfImprove({ agent, scenarios, judge, baselineSurface })` — `agent-eval/contract` | a bespoke optimize loop or a parallel skill-optimizer | diff --git a/examples/README.md b/examples/README.md index 8ac2630d..2a90d906 100644 --- a/examples/README.md +++ b/examples/README.md @@ -1,26 +1,23 @@ # agent-runtime examples -A learning path. Read the examples in order — each one adds a single concept on top of the last. -The fastest way to feel the package is to read **ONE** example: [`driver-loop/`](./driver-loop/) -(below), which shows the move every supervisor is built on. +Start by reading ONE example: [`driver-loop/`](./driver-loop/) — the move every supervisor is built on. +The catalog below is a learning path for when you want more: each example adds a single concept on top of the last. Every example imports from `@tangle-network/agent-runtime` (the surface consumers use), not from relative paths, and they are typechecked by `pnpm run typecheck:examples` — except `researcher-loop`, which needs the optional `@tangle-network/agent-knowledge` peer that agent-runtime doesn't depend on and CI doesn't install, so it is excluded from that typecheck (run it with `agent-knowledge` installed). -## Quickstart — run these three (≈5 min, two run offline) - -Get the feel before reading the full map. In order: +## Quickstart — the golden path (≈5 min; the first two are $0, offline) ```bash -pnpm tsx examples/driver-loop/driver-loop.ts # SEE THE FOLD — offline, no creds -TANGLE_API_KEY=... pnpm tsx examples/supervise/supervise.ts # one-call supervisor over real workers -pnpm tsx examples/improve/improve.ts # the gated self-improvement verb — offline +pnpm tsx examples/driver-loop/driver-loop.ts # 1. SEE THE FOLD — offline, no creds +pnpm tsx examples/improve/improve.ts # 2. the gated self-improvement verb — offline +TANGLE_API_KEY=... pnpm tsx examples/supervise/supervise.ts # 3. one-call supervisor over real workers (router key) ``` -`driver-loop` is the one move everything else is built on; `supervise` is the one-call product entry; -`improve` is the one self-improvement verb. The full learning path is below. +`driver-loop` is the one move everything else is built on; `improve` is the one self-improvement +verb; `supervise` is the one-call product entry. The full learning path is below. ## Vocabulary @@ -94,6 +91,12 @@ purpose — read [`driver-loop/`](./driver-loop/) for the contrast (a driver tha | 22 | [`product-eval/`](./product-eval/) | You want user-sim product evals: a persona over a multi-round conversation via `runPersonaConversation`, then score the transcript (`maxTurns` is a ceiling, not a target). Needs `TANGLE_API_KEY` (the engine takes a `backendFor` override, but this example wires the live router). | | 23 | [`agentic-data-creation/`](./agentic-data-creation/) | You want the **Autodata inner loop**: an agent manufactures HARD training examples from a doc and keeps only the ones that DISCRIMINATE a strong solver from a weak one. Composes the fold (`runLoop`+refine driver), N× sampling (`runLoop`+fanout driver), `llmJudge`, `CostLedger`, and `Corpus`; the one new piece is `discriminativeAcceptRule`. Shows the calibration (plain gap ≈ 0.02 vs agentic ≈ 0.31). Offline. | +## Research harnesses (not on the learning path) + +| Example | Use this when… | +|---|---| +| [`ablation-suite/`](./ablation-suite/) | You want the coordination-vs-raw-compute ablation (continuous vs ralph vs supervisor, cost-aware, paired-bootstrap Δ) — the suite behind the supervisor +20.8pp result. Needs `TANGLE_API_KEY`; run `ARMS=cal` first (its README explains why). | + ## Conventions - Examples are synthetic unless noted. `strategy-evolution`, `product-eval`, `supervise`, and diff --git a/examples/coding-benchmark/README.md b/examples/coding-benchmark/README.md index e6304a65..570cc2a8 100644 --- a/examples/coding-benchmark/README.md +++ b/examples/coding-benchmark/README.md @@ -1,6 +1,6 @@ # coding-benchmark -**Run the same coding task across coding agents — fairly, honestly, with real statistics — as thin composition over `agent-runtime` / `agent-eval` primitives.** The anti-cheat is **held-out test execution** (SWE-bench / HumanEval style): the agent develops against a few visible example tests, then is graded on a **hidden test suite it never saw and cannot hardcode**. A real solution passes; a cheat (memorize the visible examples, fake the hard part) fails. The verifier, the stats, and the judges are all substrate calls, not reimplemented. The glue this example owns is small and named (an in-process offline box, the per-round refine loop, the leaderboard render); the load-bearing scoring and statistics are not hand-rolled. +**Run the same coding task across coding agents — fairly, honestly, with real statistics — as thin composition over `agent-runtime` / `agent-eval` primitives.** The anti-cheat is **held-out test execution**: the agent develops against a few visible example tests, then is graded on a **hidden suite it never saw and cannot hardcode**. The scoring, firewall, and statistics are substrate calls, not reimplemented. ```bash # offline — no creds, no network. Runs the whole pipeline against an in-process box @@ -42,7 +42,7 @@ Pairwise (paired delta + bootstrap CI; paired-test p, BH-corrected): demonstrate the wiring). Use 20-50 tasks for a real harness comparison. ``` -> **Offline, every harness writes the same scripted solution and is scored by the same deterministic mock judge, so all deltas are 0.000** — the honest no-variance result, not a bug. The whole pipeline (matrix, verifier, held-out test execution, judge wiring, stats, firewall) runs for real; only the agent and the judge model are stubbed offline. **Offline the `--ensemble` panel is degenerate too: all three cross-family models share the one mock transport and return the identical verdict — cross-family independence is a live-only property.** `--live` swaps in real harness boxes, a real judge model, and (with `--ensemble`) three genuinely independent models, and the harnesses separate. +> **Offline, every harness writes the same scripted solution and is scored by the same deterministic mock judge, so all deltas are 0.000** — the honest no-variance result, not a bug. Offline `--ensemble` is degenerate too (all three models share the one mock transport); cross-family independence is a live-only property — `--live` swaps in real boxes, a real judge, and the harnesses separate. ### The offline "agent" is a scripted stand-in @@ -69,13 +69,13 @@ On the CLI it's `--tools none|web|search-mcp`. **Honesty caveat:** a preset only - At grading (after the refine loop), the harness copies the held-out suite into the box and runs it (`node --experimental-transform-types --test`). The **held-out pass rate is the PRIMARY, ungameable correctness score.** - A solution that hardcoded the visible examples' exact values passes the visible test but **fails the held-out inputs** (e.g. the rate-limiter held-out uses capacities `7/6/5/2`, not the visible `10/3/10`). A solution that faked the hard part fails them too. Only real behavior passes both. -The firewall is **`agent-eval`'s hidden-criteria port, not a comment.** `scenarios.ts` declares each field's destination (`prompt` → `agent-visible`, `visibleTest` → `develop-against`, `heldoutTest` → `grading-only`, `rubricNote` → `judge-only`) and `routeCodingFields` turns that into the substrate's `RoutedField[]`. `dispatch.ts` (`THE FIREWALL LIVES HERE`) calls **`assertNoHiddenLeak`** each round — a `grading-only`/`judge-only` value found in the agent context **throws** — and grades through **`gradeOnHidden`**, which *re-asserts* the firewall against the exact context the run used before seeding + running the held-out suite. The only thing the agent's context gets is `scenario.prompt`; the held-out suite and rubric note never enter the box. (A regression test asserts the firewall throws when the held-out content leaks.) +The firewall is **`agent-eval`'s hidden-criteria port, not a comment**: `scenarios.ts` declares each field's destination as data (`fieldRouting` → `routeCodingFields` → the substrate's `RoutedField[]`), `dispatch.ts` calls **`assertNoHiddenLeak`** each round (a `grading-only`/`judge-only` value found in the agent context **throws**), and **`gradeOnHidden`** re-asserts the firewall against the exact context the run used before seeding + running the held-out suite. A regression test asserts the firewall throws when held-out content leaks. ## How it scores (held-out correctness first, judge second) Scoring runs in strict order, cheapest and most objective first — an `agent-eval` primitive at each layer: -1. **Dev checks (first, in the box, ~$0, advisory for the grade).** An ordered **`MultiLayerVerifier`** pipeline: `typecheck → test(visible) → lint`, with dependency-based skip (test never runs on a type error) and a blended score. These pass/fail booleans are the only thing that steers the next refine round — they tell the agent it's on track, but passing the visible examples does **not** prove correctness. The test layer runs `node --experimental-transform-types --test`, not plain `node --test`: the test imports the solution as a `.ts` file, and Node's default type-*stripping* throws on constructor parameter properties (`constructor(private x: number)`) — the exact style the canonical impl uses — so a correct solution would otherwise score as a test failure. (`eval.ts` · `runChecks`) +1. **Dev checks (first, in the box, ~$0, advisory for the grade).** An ordered **`MultiLayerVerifier`** pipeline: `typecheck → test(visible) → lint`, with dependency-based skip (test never runs on a type error) and a blended score. These pass/fail booleans are the only thing that steers the next refine round — they tell the agent it's on track, but passing the visible examples does **not** prove correctness. (`--experimental-transform-types`: plain `node --test` throws on the canonical impl's constructor parameter properties.) (`eval.ts` · `runChecks`) 2. **Held-out test execution (the PRIMARY anti-cheat).** After the loop, the hidden suite is seeded and run in the same box via `agent-eval`'s **`gradeOnHidden`**, with the coding **`nodeTestGrader`** (a `node --test` executor) plugged in as the domain `HiddenCriteriaGrader` — the substrate bakes in no node/test/regex; the coding execution *is* the grader. The **held-out pass rate** is the primary correctness number. It is ungameable: the agent never saw these inputs, so a hardcode-the-visible cheat or a faked impl fails. (`eval.ts` · `nodeTestGrader` / `gradeOnHiddenCriteria`, `scenarios.ts` · `heldoutTest`) 3. **LLM judge (last, SECONDARY quality signal).** A 4-dimension weighted rubric — correctness 0.40 · completeness 0.25 · code_quality 0.20 · robustness 0.15. The rubric text + anchors live **with the judge**, never in the workdir. The judge scores code *quality*; it does not decide correctness. (`eval.ts`) @@ -92,10 +92,10 @@ Every number comes from the shared `leaderboard()` + `pairwiseSignificance` engi - per-harness **mean composite + bootstrap CI** (`confidenceInterval`) - per-harness **pass-rate + Wilson binomial CI** (`wilson`) — the correct interval for a proportion - every harness **pair** compared on **matched scenarios** with a **real paired test** (`pairedTTest`) for the p-value, and a **paired bootstrap** (`pairedBootstrap`) for the effect size + CI, then **BH-corrected** across all pairs (`benjaminiHochberg`) so running many comparisons doesn't manufacture a false winner. -- **Reps don't fake independent n — anywhere.** The paired unit is the *scenario*, and **the leaderboard uses the same unit**: with `--reps > 1`, a harness produces several records per scenario, so BOTH the leaderboard CI/Wilson AND the pairing collapse reps to **one mean per (harness, scenario)** before computing anything. Reps tighten the per-cell *estimate*; they are not independent samples, so they never narrow the interval out of zero new information. The reported `n` is the number of distinct scenarios, not the record count. (A regression test asserts identical reps leave the CI unchanged.) +- **Reps don't fake independent n.** Reps collapse to **one mean per (harness, scenario)** everywhere — they tighten the estimate, never fake independent samples; the reported `n` is the number of distinct scenarios. (A regression test asserts identical reps leave the CI unchanged.) - A record missing its `scenarioId` is a **loud throw**, not a silent merge — averaging distinct scenarios into one `''` bucket would corrupt the pairing, so it fails fast instead. -> **Power caveat.** The example corpus is **3 tasks** — far below what these tests need to separate harnesses. The paired t-test has ~1 degree of freedom on a few scenarios. Below the power floor (`n < 6`) `renderStats` **suppresses the `SIGNIFICANT` tag entirely** (a near-constant gap on a few scenarios can return `p<0.05` and still mean nothing — the small-n mirage), and a zero-variance pair (a collapsed bootstrap CI) never reads as a real effect either. At this corpus size the example demonstrates the *wiring*, not a defensible claim. A real harness comparison wants **20-50 tasks**. +> **Power caveat.** The example corpus is **3 tasks** — below the power floor (`n < 6`) the `SIGNIFICANT` tag is suppressed entirely (the small-n mirage). The example demonstrates the *wiring*; a real harness comparison wants **20-50 tasks**. The leaderboard labels are the readable harness names, not the matrix's internal profile hashes. @@ -112,22 +112,9 @@ The leaderboard labels are the readable harness names, not the matrix's internal | `benchmark.ts` | the entrypoint: build the axes, hand the matrix the dispatch + judges, run, then print `leaderboard()` + `pairwiseSignificance` (the shared stats engine — CIs, Wilson, paired-bootstrap, BH, with the small-n SIGNIFICANT-suppression guard) | | `coding-benchmark.test.ts` | offline smoke — the matrix produces `harnesses × scenarios × reps` records; a hardcode-the-visible cheat fails the held-out suite while the real solution passes (by execution); the held-out test is never seeded during the turn (firewall); reps don't narrow the CI | -## Primitives composed +Every load-bearing piece is a substrate call — `runProfileMatrix`, `openSandboxRun`, `MultiLayerVerifier`, `gradeOnHidden`/`blendHeldout`, `llmJudge`/`ensembleJudge`, `pairedBootstrap` + BH; find each in [`docs/canonical-api.md`](../../docs/canonical-api.md). -- **matrix:** `runProfileMatrix({ profiles, scenarios, dispatch, judges, reps, integrity, costCeiling })` (`@tangle-network/agent-eval/campaign`) with a `ProfileDispatchFn` rendering each cell -- **box + multi-round:** `openSandboxRun(client, opts, deliverable)` → `.start()` / `.resume()` over one persistent, resumable session (`@tangle-network/agent-runtime/loops`) -- **dev layer:** `MultiLayerVerifier` — ordered `typecheck → test → lint` with dependency-based skip and a blended score (`@tangle-network/agent-eval`) -- **hidden-criteria firewall + grade:** `routeFields` / `assertNoHiddenLeak` / `gradeOnHidden` / `hiddenGrade` / `blendHeldout` / `withHeldoutBlend` — the domain-agnostic grading port; this bench supplies the coding `HiddenCriteriaGrader` (`nodeTestGrader`, a `node --experimental-transform-types --test` executor) and gets the firewall enforcement + held-out-weighted composite for free (`@tangle-network/agent-eval`) -- **token metering:** `extractLlmCallEvent` (`@tangle-network/agent-runtime/loops`) — reads usage off **every** backend event shape (`done` / `result` / `llm_call` / `usage`) so the integrity guard sees a real run -- **judges:** `llmJudge` (single model call → canonical `JudgeConfig`, imported from `@tangle-network/agent-eval/campaign` so it resolves across the whole peer range) and `ensembleJudge` for the cross-family panel (`@tangle-network/agent-eval`); the judge transport is a `ChatClient` (`createChatClient` — a `mock` handler offline, the `router` live) -- **integrity:** `integrity: 'assert'` on the matrix proves a real backend ran (no stubbed cell) — `'off'` only for the offline mock -- **stats:** `pairedTTest`, `pairedBootstrap`, `benjaminiHochberg`, `confidenceInterval`, `wilson` - -## Reusing the firewall across domains (and why there's no `defineBenchmark` yet) - -The hidden-criteria firewall is now **domain-free substrate** (`@tangle-network/agent-eval`): a legal, tax, research, or content benchmark declares the same `fieldRouting` and supplies its own `HiddenCriteriaGrader` (`gradeOnHidden` runs it behind `assertNoHiddenLeak`; `blendHeldout` composes the score). That is the part worth lifting, and it is lifted — the coding bench is now just *one consumer* of the port. - -A thin `defineBenchmark` skeleton (one call that wraps `runProfileMatrix` + a dispatch + judges) was considered and **deliberately not built.** The only other benchmark today is `examples/product-eval`, and it shares **nothing** with this one beyond `runProfileMatrix` itself — which is already a one-line substrate call. product-eval has no held-out split, no firewall, and no blend (it scores a transcript with a trivial `artifactOf` count), so a shared skeleton would have **zero common logic to dedup** and would force the two unrelated shapes (a multi-round coding box vs a persona conversation) under one premature abstraction. The right home for such a helper, *if it earns its place*, is `@tangle-network/agent-runtime` (next to the loop seams the dispatch already uses), **not** an example. The **trigger to revisit:** a *second* benchmark that genuinely shares the firewall + held-out + blend wiring with this one (e.g. a research or legal held-out bench). At that point the duplicated dispatch+judge plumbing — not `runProfileMatrix` — is what a skeleton would dedup. Until then it would be abstraction ahead of evidence. +A `defineBenchmark` wrapper was considered and deliberately not built: no second benchmark shares the firewall + held-out + blend wiring (`product-eval` shares only `runProfileMatrix`); revisit when one does. ## Going live diff --git a/examples/improve/improve.ts b/examples/improve/improve.ts index 59a905fe..5f9d8f17 100644 --- a/examples/improve/improve.ts +++ b/examples/improve/improve.ts @@ -95,6 +95,9 @@ async function main(): Promise { // A perfect +1.0 lift at this n/reps clears the default held-out gate. budget: { generations: 1, populationSize: 2, reps: 3, holdoutFraction: 0.5 }, }) + console.log( + 'improve() — proposed a new "prompt" surface from the analyst findings, measured it on held-out scenarios, and shipped only because the gate cleared (offline: scripted proposer, deterministic judge):', + ) console.log(`shipped: ${out.shipped} lift: ${out.lift.toFixed(3)} gate: ${out.gateDecision}`) console.log(`prompt after: ${out.profile.prompt?.systemPrompt}`) } diff --git a/examples/sanitized-telemetry-streaming/sanitized-telemetry-streaming.ts b/examples/sanitized-telemetry-streaming/sanitized-telemetry-streaming.ts index 6d083c31..63dfc7d6 100644 --- a/examples/sanitized-telemetry-streaming/sanitized-telemetry-streaming.ts +++ b/examples/sanitized-telemetry-streaming/sanitized-telemetry-streaming.ts @@ -8,7 +8,7 @@ * uris are dropped by default * - shows the opt-in path for privileged diagnostics via * `RuntimeTelemetryOptions` flags - * - prints the streaming summary at the end + * - prints each run's summary first, then its sanitized event wall * * Note on `task.intent`: this is fixed metadata that flows through * sanitized telemetry by default. NEVER set it to user input — use a @@ -103,7 +103,9 @@ async function drain(label: string, collector: RuntimeStreamEventCollector): Pro })) { collector.onEvent(event as RuntimeStreamEvent) } - console.log(`--- ${label} stream events ---`) + console.log(`--- ${label}: summary ---`) + console.log(collector.summary()) + console.log(`--- ${label}: sanitized events ---`) for (const e of collector.events) console.log(JSON.stringify(e)) } @@ -112,8 +114,6 @@ async function main() { // uris are all stripped — the safe-by-default state any telemetry sink gets. const safe = createRuntimeStreamEventCollector() await drain('safe (default redaction)', safe) - console.log('\n--- safe summary ---') - console.log(safe.summary()) // ── 2. Opt-in verbose: a privileged operator triaging an incident turns on the // exact fields they need; everything they did NOT opt into stays redacted. diff --git a/examples/stream-backends/stream-backends.ts b/examples/stream-backends/stream-backends.ts index 811b9bb1..ca1623d0 100644 --- a/examples/stream-backends/stream-backends.ts +++ b/examples/stream-backends/stream-backends.ts @@ -104,6 +104,14 @@ async function drainToSse( } async function main() { + console.log( + 'stream-backends — three transports, one drain: every backend below lands on the same', + ) + console.log( + 'RuntimeStreamEvent -> SSE serialization. Sections: readiness SSE, iterable (offline),', + ) + console.log('sandbox (offline), openai-compatible (skipped unless OPENAI_API_KEY is set).\n') + // Readiness SSE — the one-off event a route writes when a task is gated on // missing knowledge (see examples/knowledge-gating for the gate itself). const requirements: KnowledgeRequirement[] = [ diff --git a/examples/supervise/README.md b/examples/supervise/README.md index ff718c74..93fda286 100644 --- a/examples/supervise/README.md +++ b/examples/supervise/README.md @@ -19,5 +19,8 @@ const result = await supervise( Run: `TANGLE_API_KEY= pnpm tsx examples/supervise/supervise.ts` -For the lower-level seams (`supervisorAgent` + `createSupervisor().run`, or a different worker backend -per spawn), see `examples/supervisor-loop/`. +This is the smallest call — router brain, router-tools workers, everything defaulted. When your +workers need a real backend (a sandbox box, a local harness CLI, or the coordination MCP), graduate +to [`../supervisor-loop/`](../supervisor-loop/) — the same `supervise()` call with the worker +backend as the only knob. For the lower-level seams (`supervisorAgent` + `createSupervisor().run`) +see its README.