Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,5 +96,9 @@ All 28 live in [`examples/`](./examples).

- New here? [`docs/concepts.md`](./docs/concepts.md) — the mental model in plain terms.
- [`docs/canonical-api.md`](./docs/canonical-api.md) — find the primitive: "I want to ___ → use ___".
- [`docs/api/primitive-catalog.md`](./docs/api/primitive-catalog.md) — every export in one generated, never-stale list with its import path. Check it before building anything new.
- Import subpaths: the root export is the product surface (`handleChatTurn`, `improve`); deeper capabilities ship as subpaths — `/loops` (multi-agent + the loop kernel), `/mcp` (tool servers), `/intelligence` (observability drop-in), `/lifecycle`, `/agent`, `/profiles`, `/platform`, `/analyst-loop`, `/environment-provider`.
- [`docs/architecture.md`](./docs/architecture.md) — the design, end to end.
- [`bench/HARNESS.md`](./bench/HARNESS.md) — the experiment harness and how to run a benchmark.

**Contributing:** `pnpm i && pnpm test` gets you running; the full local gate is the [`package.json`](./package.json) scripts (`lint`, `typecheck`, `docs:check`).
1 change: 1 addition & 0 deletions bench/HARNESS.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ the gate + measurement tools:
corpus-replay.mts --selector: selector@k vs random@k vs oracle@k over a corpus (THE offline gate)
corpus-report.mts paired-bootstrap CI + Benjamini-Hochberg over corpora
gate-cli.mts the recursive diverse-vs-blind gate through `runGate` (Supervisor)
run-benchmarks-cli.mts runBenchmarks: any subset of the ADAPTERS registry × model/harness cells, one combined ranked report (#420)
commit0-env-run.mts the HARD domain through `runBenchmark` (the optimization suite)
terminal-compare.ts Terminal-Bench compare (own main)
unit tests (the only fully-green, cred-free runnable surface besides offline replay):
Expand Down
4 changes: 2 additions & 2 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ The map of every doc. **Start here** if you're new; the deeper tracks follow.
| [execution-model.md](./execution-model.md) | the picture | The unified `Executor` port (router/bridge/cli/sandbox/BYO) + two engines, driver vs worker, spawn mechanics. |
| [agent-bus-protocol.md](./agent-bus-protocol.md) | normative protocol | The multi-agent call bus — depth limits, headers, refusal contract. |
| [durability-adapters.md](./durability-adapters.md) | subsystem | Journal + durability for resumable conversations + supervisor trees. |
| [intelligence-sdk.md](./intelligence-sdk.md) | subsystem | The product intelligence drop-in (`withTangleIntelligence`). |
| [intelligence-sdk.md](./intelligence-sdk.md) | product SDK | The Intelligence SDK reference — the `/intelligence` subpath (observe, effort tiers, certified delivery). |
| [BUILDING.md](./BUILDING.md) | process | Building discipline: goal first, cheapest decisive proof, verification rules. |
| [ANTI_PATTERNS.md](./ANTI_PATTERNS.md) | process | Named failure modes. |
| [MAINTAINING.md](./MAINTAINING.md) | process | How the generated API reference + the docs-freshness gate stay honest. |
Expand All @@ -44,7 +44,7 @@ The map of every doc. **Start here** if you're new; the deeper tracks follow.
|---|---|---|
| [simplification-plan.md](./research/simplification-plan.md) | **live tracker** | The in-flight simplification/rearchitecture: the converged design, the scratch list, the doc/module inventory, the workstreams + completion criteria. |
| [research/README.md](./research/README.md) | research index | Forward-looking design threads + decision log. Not the canonical spine. |
| [archive/](./archive/) | retired notes | Superseded/niche docs kept for history (delivery manifest, conversation economics, artifact-lifecycle, go-live, results). |
| [archive/](./archive/) | retired notes | Superseded/niche docs kept for history (delivery manifest, conversation economics, artifact-lifecycle, go-live, results, benchmark-matrix consolidation). |

## Conventions

Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
> Track: Archive · Status: the agent-runtime portion SHIPPED — external benchmark adapters (`5e2e81a0`) + DABStep (`5d610e78`) are on main, and `runBenchmarks` (the registry subset sweep, plan step 2) landed as #420 (`bench/src/run-benchmarks.ts`, `run-benchmarks-cli.mts`, `run-benchmarks-report.ts`). Plan steps 3–5 (agent-lab external-bench/product adapters; tax/legal/gtm folds) are cross-repo work owned by those repos — the agent-lab branches named below no longer exist. The living map of this repo's bench surface is `bench/HARNESS.md`.

# Benchmark matrix consolidation

How to run any subset of `{harnesses × models × personas × scenarios × external-benchmarks}` and rank the cells, using the existing library primitives — and the plan to fold the per-product matrices onto them.
Expand Down
4 changes: 2 additions & 2 deletions docs/canonical-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ This table is judgment-only: it maps an intent to the ONE primitive to reach for

| I want to… | Use (import) | Do NOT build |
|---|---|---|
| **Just run a supervisor to a goal (one call, scaffolding defaulted)** — START HERE | `supervise(profile, task, { budget, backend? })` — `/loops` | hand-wiring `createSupervisor().run` + `blobs`/`perWorker`/`journal`/`executors`; reaching for the lower-level run-verbs below before you need a specific counterparty |
| **Just run a supervisor to a goal (one call, scaffolding defaulted)** — START HERE (running agents) | `supervise(profile, task, { budget, backend? })` — `/loops` | hand-wiring `createSupervisor().run` + `blobs`/`perWorker`/`journal`/`executors`; reaching for the lower-level run-verbs below before you need a specific counterparty |
| **Supervise agents to solve a graded `AgenticSurface` task** (workers `runAgentic` the surface, settle on its own check, driver self-improves from the failing tests) | `superviseSurface(profile, task, { surface, worker })` — `/loops` | a worker-seam + a "self-improving supervisor" wrapper around `supervise()`; passing a custom `makeWorkerAgent` that runs `runAgentic` |
| Run a genome through a topology shape over the keystone Supervisor, end-to-end | `runPersonified({ persona, shape, task, budget })` — `/loops` | a hand-rolled `createSupervisor().run` + seam-wiring helper |
| Loop a worker over one evolving artifact, K rounds, stop-when-good | `loopUntil(seed, spec)` as the `shape` — `/loops` | a `while(!done){runWorker();decide()}` hand-loop or "multi-attempt refine driver" |
Expand All @@ -65,7 +65,7 @@ This table is judgment-only: it maps an intent to the ONE primitive to reach for
| Run + **resume** ONE persistent box across turns | `openSandboxRun(client, opts, deliverable)` — `/loops` | a per-domain `new Sandbox`+`box.fs.read`+delete copy |
| Pick / register a leaf backend, or bring your own agent | `createExecutor({ backend })` / `createExecutorRegistry()` / implement `Executor` — `/loops` | a per-vendor adapter or closed `inline\|sandbox\|cli` switch (won't report through the `UsageEvent` channel) |
| Evolve a **prompt/string** surface | `gepaProposer({ llm, model, target })` (default inside `selfImprove`; the skill-surface twin is `skillOptProposer`, same source) — `agent-eval/campaign` | a hand-rolled prompt-mutation reflection loop with its own Pareto bookkeeping |
| Self-improve a profile (one pluggable verb) — START HERE | `improve(profile, findings, { surface, gate })` — root `.` (the RSI verb; defaults the generator from `surface`, wraps `selfImprove`) | a bespoke optimize loop, or calling `selfImprove`/a skill-optimizer directly for the common case |
| Self-improve a profile (one pluggable verb) — START HERE (self-improvement) | `improve(profile, findings, { surface, gate })` — root `.` (the RSI verb; defaults the generator from `surface`, wraps `selfImprove`) | a bespoke optimize loop, or calling `selfImprove`/a skill-optimizer directly for the common case |
| Measure **one profile artifact's marginal lift** (with-vs-without, score+cost) / catalog artifacts | `measureMarginalLift(...)` / `ArtifactRegistry` (`applyArtifact` is the one `ArtifactKind`→`AgentProfile`-field bridge) — `/lifecycle` | a hand-rolled with/without ablation loop, or a per-kind `if kind==='skill'…` profile-field switch |
| Run the **whole artifact lifecycle** — generate→measure→promote→store→compose, then drift-watch/dedupe the live set — over ANY profile surface (skill/prompt/tool/MCP) | `runLifecycle({ baseline, generators, evalRunner, gate })` then `composeProfile(registry, base, query)`; maintain with `driftWatch(...)` / `dedupeArtifacts(...)` — `/lifecycle` | a per-surface improve loop, a hand-rolled promote→compose step, or re-running `measureMarginalLift` without the registry/gate spine. The ONLY per-surface code is a thin `CandidateGenerator` (`skillGenerator` distills, `promptGenerator`/`buildableGenerator` for the rest) |
| Run the self-improvement loop with full substrate control | `selfImprove({ agent, scenarios, judge, baselineSurface })` — `agent-eval/contract` | a bespoke optimize loop or a parallel skill-optimizer |
Expand Down
25 changes: 14 additions & 11 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,23 @@
# agent-runtime examples

A learning path. Read the examples in order — each one adds a single concept on top of the last.
The fastest way to feel the package is to read **ONE** example: [`driver-loop/`](./driver-loop/)
(below), which shows the move every supervisor is built on.
Start by reading ONE example: [`driver-loop/`](./driver-loop/) — the move every supervisor is built on.
The catalog below is a learning path for when you want more: each example adds a single concept on top of the last.

Every example imports from `@tangle-network/agent-runtime` (the surface consumers use), not from
relative paths, and they are typechecked by `pnpm run typecheck:examples` — except `researcher-loop`,
which needs the optional `@tangle-network/agent-knowledge` peer that agent-runtime doesn't depend on
and CI doesn't install, so it is excluded from that typecheck (run it with `agent-knowledge` installed).

## Quickstart — run these three (≈5 min, two run offline)

Get the feel before reading the full map. In order:
## Quickstart — the golden path (≈5 min; the first two are $0, offline)

```bash
pnpm tsx examples/driver-loop/driver-loop.ts # SEE THE FOLD — offline, no creds
TANGLE_API_KEY=... pnpm tsx examples/supervise/supervise.ts # one-call supervisor over real workers
pnpm tsx examples/improve/improve.ts # the gated self-improvement verb — offline
pnpm tsx examples/driver-loop/driver-loop.ts # 1. SEE THE FOLD — offline, no creds
pnpm tsx examples/improve/improve.ts # 2. the gated self-improvement verb — offline
TANGLE_API_KEY=... pnpm tsx examples/supervise/supervise.ts # 3. one-call supervisor over real workers (router key)
```

`driver-loop` is the one move everything else is built on; `supervise` is the one-call product entry;
`improve` is the one self-improvement verb. The full learning path is below.
`driver-loop` is the one move everything else is built on; `improve` is the one self-improvement
verb; `supervise` is the one-call product entry. The full learning path is below.

## Vocabulary

Expand Down Expand Up @@ -94,6 +91,12 @@ purpose — read [`driver-loop/`](./driver-loop/) for the contrast (a driver tha
| 22 | [`product-eval/`](./product-eval/) | You want user-sim product evals: a persona over a multi-round conversation via `runPersonaConversation`, then score the transcript (`maxTurns` is a ceiling, not a target). Needs `TANGLE_API_KEY` (the engine takes a `backendFor` override, but this example wires the live router). |
| 23 | [`agentic-data-creation/`](./agentic-data-creation/) | You want the **Autodata inner loop**: an agent manufactures HARD training examples from a doc and keeps only the ones that DISCRIMINATE a strong solver from a weak one. Composes the fold (`runLoop`+refine driver), N× sampling (`runLoop`+fanout driver), `llmJudge`, `CostLedger`, and `Corpus`; the one new piece is `discriminativeAcceptRule`. Shows the calibration (plain gap ≈ 0.02 vs agentic ≈ 0.31). Offline. |

## Research harnesses (not on the learning path)

| Example | Use this when… |
|---|---|
| [`ablation-suite/`](./ablation-suite/) | You want the coordination-vs-raw-compute ablation (continuous vs ralph vs supervisor, cost-aware, paired-bootstrap Δ) — the suite behind the supervisor +20.8pp result. Needs `TANGLE_API_KEY`; run `ARMS=cal` first (its README explains why). |

## Conventions

- Examples are synthetic unless noted. `strategy-evolution`, `product-eval`, `supervise`, and
Expand Down
Loading
Loading