chore(deps): agent-eval 0.103.2 — catalog fully documented, ratchet 34→0#461
Conversation
The 0.103.1/0.103.2 substrate TSDoc heals every remaining blank callable row in docs/api/primitive-catalog.md (34 → 0). maxUndocumentedCallables drops to 0: any future undocumented public function/class/const export fails docs:check.
…ry (ratchet-0 catch)
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — fde1d360
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-07-03T18:34:32Z
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 0 (none) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 81.4s (2 bridge agents) |
| Total | 81.4s |
💰 Value — sound
Bumps agent-eval to 0.103.2, drops the undocumented-callables ratchet from 34→0, and adds the one TSDoc the ratchet caught — closes the catalog-blanks arc permanently.
- What it does: Bumps the @tangle-network/agent-eval dev dependency from ^0.103.1 to ^0.103.2, which ships TSDoc backfills for the substrate primitive surfaces. Lowers the maxUndocumentedCallables ratchet in scripts/gen-primitive-catalog.mjs from 34 to 0, so any future public function/class/const export without a TSDoc summary fails docs:check. Adds a TSDoc summary to the defineLeaderboard declaration in src/runt
- Goals it achieves: Closes the catalog-blanks arc: every public callable in the primitive catalog now has a TSDoc summary visible in the generated catalog table. The ratchet at 0 functions as a permanent CI gate — any new undocumented export blocks the build, preventing the catalog from ever regressing into blanks again. The catalog is now a live, machine-enforced anti-reinvention inventory: agents can see what exist
- Assessment: A minimal, coherent, three-piece change that completes a multi-commit arc (#448 → #450 → #461). The ratchet mechanism in gen-primitive-catalog.mjs:271-274 was explicitly designed for exactly this workflow (comment at line 271: 'when a backfill lowers the real number, lower the constant to match'). The dependency bump is the correct way to consume substrate TSDoc backfills — those summaries live in
- Better / existing approach: none — this is the right approach. The ratchet mechanism already exists in scripts/gen-primitive-catalog.mjs:268-339 and was designed for this exact lifecycle: backfill summaries, lower the ceiling, regenerate. No alternative mechanism in the codebase does this. The dependency bump + TSDoc fix + ratchet drop are the intended, documented, single-path workflow. Searched: gen-primitive-catalog.mjs fo
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content
🎯 Usefulness — sound
Bumps agent-eval to the fully-documented release, drops the undocumented-callables ratchet to zero (CI-enforced), and adds the one missing TSDoc that the ratchet caught — a clean close to the catalog-blanks arc.
- Integration: Ratchet enforcement is hard-wired into CI at .github/workflows/ci.yml:41 via
docs:check→docs:api→scripts/gen-primitive-catalog.mjs:331-338, which exits 1 if undocumented callables exceed the ceiling (now 0).defineLeaderboardis publicly exported fromsrc/runtime/index.ts:95and has tests atsrc/runtime/define-leaderboard.test.ts. No external callers yet, but it landed in the prior - Fit with existing patterns: The ratchet mechanism (
maxUndocumentedCallables) already existed and was progressively lowered across PRs #448 and #450 — this is the natural end state (zero tolerance). The TSDoc follows the same pattern as other exports in the codebase (bare-function summaries before@tagblocks). The agent-eval version bump is a routine devDependency patch update following the established substrate dependen - Real-world viability: The ratchet is a static-analysis CI gate — no concurrency or edge-input concerns. It runs via the TypeScript compiler extracting export symbols from the live codebase (
scripts/gen-primitive-catalog.mjs:262-263), so it always reflects ground truth, not a hand-maintained list. Any future PR adding an undocumented export will fail CI with a specific error naming the offending symbol. - Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge warning: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
No concerns — sound change, no better or existing approach found. ✅
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
|
| State | Detail |
|---|---|
| Interrupted | max run seconds |
No review verdict was produced for this run. Trigger a fresh review on the current PR head if the PR is still open.
tangletools · #461 · model: kimi-for-coding · updated 2026-07-03T20:39:20Z
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — fde1d360
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-07-03T20:47:15Z
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 0 (none) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 87.7s (2 bridge agents) |
| Total | 87.7s |
💰 Value — sound
Bumps agent-eval to 0.103.2 (substrate TSDoc backfills) and ratchets the catalog's undocumented-callable ceiling 34→0, closing the docs arc with the right mechanism already in place.
- What it does: Three coordinated edits: (1)
package.json:97floor@tangle-network/agent-eval^0.103.1 → ^0.103.2, picking up substrate TSDoc backfills (agent-eval#305/#306) that fill the last blank callable rows; (2)scripts/gen-primitive-catalog.mjs:274maxUndocumentedCallables34 → 0, so any future public function/class/const export lacking a TSDoc summary exits 1 atdocs:check(the failure path at l - Goals it achieves: Finish the catalog-blanks arc (629→108 via #448 → 34 via #450 → 0) and lock it closed: the ratchet now enforces zero undocumented public callables going forward, so the 'anti-reinvention inventory' (
gen-primitive-catalog.mjs:349) can never silently drift back to blank rows. Better outcome = a new primitive added without a one-line summary is a red build, not a silent regression. - Assessment: Good change, in the grain of the codebase. The ratchet pattern was already established (commit history confirms #448 introduced it at 108, #450 tightened to 34); 0 is its natural terminal value, not a new mechanism. The defineLeaderboard summary is accurate and earns its row (describes the spec→runnable + adapter shape, names
toBenchmarkAdapter()). The dep bump is additive (patch-level TSDoc bac - Better / existing approach: none — this is the right approach. The ratchet lives in the existing catalog generator (
scripts/gen-primitive-catalog.mjs), which is already the canonical anti-reinvention inventory the repo routes through (canonical-api.md§2 is its judgmental companion per lines 355-357). There is no second docs-freshness gate to consolidate with —check-docs-freshness.mjs(referenced at line 346) handles - Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
🎯 Usefulness — sound
A small, correctly-wired dependency bump that takes the undocumented-callable ratchet to its terminal value of 0, with the regenerated catalog and a defineLeaderboard summary proving the gate bites.
- Integration: Fully wired and reachable. The ratchet (maxUndocumentedCallables=0, gen-primitive-catalog.mjs:274) is enforced by an exit(1) at lines 331-339, runs via
docs:api(package.json:91), and is re-run by check-docs-freshness.mjs:594 underdocs:check. defineLeaderboard — the callable whose new summary satisfies the gate — is a real exported, tested facade (src/runtime/index.ts:95, define-leaderboard.t - Fit with existing patterns: Follows the established pattern exactly. This is the final step of an existing iterative ratchet (629→108→34→0) on the existing anti-reinvention inventory; it competes with nothing. The defineLeaderboard TSDoc is placed correctly — summary line before the @experimental tag, matching the extractor's position rule at gen-primitive-catalog.mjs:42-45.
- Real-world viability: Holds up. Setting the ceiling to 0 is the intended terminal state: any new public function/class/const export (own surface or the 6 curated agent-eval surfaces) without a TSDoc summary now fails docs:check. The PR body documents the ratchet already caught defineLeaderboard during the main merge, so the strictness is proven, not speculative. The one coupling — a future agent-eval release adding an
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
No concerns — sound change, no better or existing approach found. ✅
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| glm | deepseek | aggregate | |
|---|---|---|---|
| Readiness | 86 | 89 | 86 |
| Confidence | 85 | 85 | 85 |
| Correctness | 86 | 89 | 86 |
| Security | 86 | 89 | 86 |
| Testing | 86 | 89 | 86 |
| Architecture | 86 | 89 | 86 |
Reviewer score is advisory once the run is complete and the verdict has no blockers.
Full multi-shot audit completed 5/5 planned shots over 6 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 6 changed files. Global verifier still owns final merge decision.
🟡 LOW Transitive viem minor bump (2.52.2 → 2.54.2) pulled by lockfile drift, not declared — pnpm-lock.yaml
viem moves a minor version via sandbox/tcloud optional deps (spec
>=0.8.0 <1.0.0is wide). Lockfile resolves it consistently across sandbox, tcloud, and isows/ws. No correctness issue in this file, but the sandbox/tcloud consumers of viem are not re-tested in this shot — that coverage belongs to the package.json + src shots. Flag as a nit, not a blocker: if any agent-runtime code touches viem APIs directly, the behavior diff between 2.52 and 2.54 should be checked there.
🟡 LOW Ratchet comment doesn't note 0 is the terminal floor — scripts/gen-primitive-catalog.mjs
The comment block at lines 268-272 (and the header at 38-40) describes the ceiling as 'the exact current count' and instructs to 'lower the constant to match' after a backfill. At 0 the ratchet has reached its floor — it cannot be lowered further and any new callable MUST ship with a TSDoc summary or the build breaks. A one-line note ('0 = fully documented; this is the floor, not a step toward a lower ceiling') would make the terminal intent explicit and prevent a future reader from thinking 0 is temporary. Pure documentation nit, non-blocking.
🟡 LOW Doc comment partially duplicates module-level header — src/runtime/define-leaderboard.ts
The new JSDoc at lines 255-259 restates 'cases + prompt + score' which the module header (lines 12-24) already documents. Harmless and arguably good for hover/IDE discovery, but consider a single canonical phrasing. No behavior impact.
🟡 LOW Unvalidated numeric CLI args produce NaN silently — src/runtime/define-leaderboard.ts
Number(args.shots ?? ...)andNumber(args.reps ?? ...)coerce non-numeric CLI values like--shots footoNaN. The NaN then flows intomaxIterations: NaNin the naiveDriver andreps: NaNin runProfileMatrix, potentially causing zero iterations with no error. Example:--shots abcshould thrownot a valid numberbut instead silently produces NaN. Fix: validate withisNaN(parsed)after conversion and throw a clear error.
🟡 LOW as never type assertion on sandboxOverrides bypasses type safety — src/runtime/define-leaderboard.ts
The
sandboxOverridesobject is cast withas neverto satisfy theAgentRunSpec.sandboxOverridestype. If the sandbox SDK'sCreateSandboxOptionsshape changes (e.g.,backend.modelrequires new fields), this cast silently accepts the mismatch at compile time and only fails at runtime inside the loop kernel. The inline object{ backend: { type: axis.harness, model: { ...spec.modelBackend, model: modelId } } }is deliberately looser becausespec.modelBackendisRecord<string, unknown>, but a narrower intermediate type assertion would catch structural drift earlier.
tangletools · 2026-07-03T20:52:53Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 5 non-blocking findings — fde1d360
Full multi-shot audit completed 5/5 planned shots over 6 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 6 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-07-03T20:52:53Z · immutable trace
…bstrate # Conflicts: # docs/api/primitive-catalog.md # docs/api/runtime.md # src/runtime/define-leaderboard.ts
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — 050bb0ae
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-07-04T01:55:06Z
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 0 (none) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 101.8s (2 bridge agents) |
| Total | 101.8s |
💰 Value — sound
Bumps agent-eval to 0.103.2 (whose TSDoc backfills healed the last substrate blanks), drops the undocumented-callable ratchet from 34 to its terminal 0, and adds the two remaining own-code summaries — a clean, mechanism-consistent close to the catalog-blanks arc.
- What it does: Three coupled changes: (1) raises the @tangle-network/agent-eval devDep floor from ^0.103.1 to ^0.103.2 (lockfile resolves to 0.103.2, confirmed in pnpm-lock.yaml); (2) sets maxUndocumentedCallables from 34 to 0 in scripts/gen-primitive-catalog.mjs:274, so any future public function/class/const export without a TSDoc summary line fails docs:check; (3) adds TSDoc summary lines to the two own-code c
- Goals it achieves: Make 'undocumented public callable' a hard build failure rather than tolerated shame. The primitive catalog is the anti-reinvention inventory (scripts/gen-primitive-catalog.mjs:2-10) — a blank row there is a reach-for primitive with no one-line description, which is exactly the thing the catalog exists to prevent. With agent-eval 0.103.1+0.103.2 backfilling every remaining substrate TSDoc, the rat
- Assessment: Good change, built in the grain of the codebase. The ratchet mechanism was introduced in #448 and lowered in #450 with the explicit contract (scripts/gen-primitive-catalog.mjs:268-272): 'the ceiling is the exact current count; when a backfill lowers the real number, lower the constant to match.' This PR does exactly that — the substrate backfill in 0.103.2 dropped the real count to 0, so the const
- Better / existing approach: none — this is the right approach. The ratchet is the codebase's own one-way gate for this exact problem; dropping it to 0 is its terminal state. Checked for alternatives: removing the ratchet in favor of manual review is strictly worse (the whole point is mechanical enforcement); lowering incrementally (34→12→0) would be cosmetic since the 0.103.2 substrate backfill healed all remaining blanks in
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
🎯 Usefulness — sound
Tightens the catalog-blank ratchet 34→0 on the back of a substrate TSDoc backfill, documenting the two own-repo offenders it caught — all wired through a CI-enforced gate and landing on live code.
- Integration: Fully reachable and enforced. The ratchet runs via
docs:check(package.json:93→gen-primitive-catalog.mjs) which is in CI (.github/workflows/ci.yml:41) and exits 1 on violation (scripts/gen-primitive-catalog.mjs:338). Both documented functions are real public API:createSandboxToolPartStateis called on the LIVE streaming path atsrc/runtime/stream-agent-turn.ts:387every turn when - Fit with existing patterns: Fits the established grain precisely. The ratchet is an existing, documented mechanism (
scripts/gen-primitive-catalog.mjs:267-275) whose stated contract is 'lower the constant to match when a backfill lowers the real number' — this PR does exactly that. The 629→108→34→0 arc shows an iterative tightening along one coherent axis, not a new competing pattern. DocumentingdefineLeaderboardwas its - Real-world viability: Holds up. The ratchet logic is simple (count blank-summary callable rows, compare to ceiling, exit 1) with no concurrency or edge-input surface. The TSDoc additions are pure documentation with no runtime effect. The one robustness-relevant detail — 'summary BEFORE any @tag, a tag-first block reads as blank' (line 335) — is already surfaced in the error message, so future offenders get actionable g
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
No concerns — sound change, no better or existing approach found. ✅
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| glm | deepseek | aggregate | |
|---|---|---|---|
| Readiness | 92 | 89 | 89 |
| Confidence | 85 | 85 | 85 |
| Correctness | 92 | 89 | 89 |
| Security | 92 | 89 | 89 |
| Testing | 92 | 89 | 89 |
| Architecture | 92 | 89 | 89 |
Reviewer score is advisory once the run is complete and the verdict has no blockers.
Full multi-shot audit completed 5/5 planned shots over 7 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 7 changed files. Global verifier still owns final merge decision.
🟡 LOW Catalog summaries truncated mid-sentence in table cells — docs/api/primitive-catalog.md
The catalog table cell for AgentEvalError reads 'Base class for every contract error this package throws — carries the stable' — cut off mid-sentence by the summary column's width cap. Other entries (createSandboxToolPartState, loadEvalFixture, planCampaignRun, etc.) are similarly truncated. Impact is cosmetic: the full TSDoc text IS present in runtime.md (verified for defineLeaderboard and createSandboxToolPartState). No behavioral risk. If the generator has a 'trailing dash' or 'first sentence only' heuristic it should be applied consistently; otherwise consider raising the column cap or terminating at sentence boundaries. Not blocking — generated artifact, fix at generator level.
🟡 LOW Pre-existing: as never cast masks type drift on sandbox overrides — src/runtime/define-leaderboard.ts
The
sandboxOverridesspread on the per-cell agent run spec usesas neverto satisfy TypeScript, bypassing all structural checking on the backend model shape. If the sandbox SDK'sCreateSandboxOptions.backend.modeladds required fields or renames keys, the compiler will not flag the mismatch here. Not introduced by this PR (JSDoc-only change), but the cast has existed since the facade was authored.
🟡 LOW Pre-existing: no unit tests for tool-part projection functions — src/runtime/sandbox-events.ts
mapSandboxToolEvent,projectToolPart, andcreateSandboxToolPartState(lines 160-288) have zero direct unit tests. The existing sandbox-events test suite coverssumSandboxUsage,mapSandboxEvent, andextractLlmCallEventonly. The tool-part state machine (pending→running→completed, terminal-failure result shape, call-id dedup) is exercised indirectly throughdefine-leaderboard's integration test, but has no isolated coverage. Not introduced by this PR.
tangletools · 2026-07-04T02:07:11Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 3 non-blocking findings — 050bb0ae
Full multi-shot audit completed 5/5 planned shots over 7 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 7 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-07-04T02:07:11Z · immutable trace
Bumps the agent-eval floor to ^0.103.2. The substrate TSDoc backfills released in 0.103.1 + 0.103.2 (agent-eval#305, agent-eval#306) heal every remaining blank callable row in the primitive catalog:
maxUndocumentedCallables34 → 0 inscripts/gen-primitive-catalog.mjs— any future undocumented publicfunction/class/constexport now failsdocs:check.defineLeaderboard(new on main, module-header-only doc) — summary added at the declaration.Closes the catalog-blanks arc: 629/1366 (46%) → 108 (#448) → 34 (#450) → 0.
Gates: build, lint, typecheck, docs:check (clean-tree), 1221 tests pass.