Skip to content

chore(deps): agent-eval 0.103.2 — catalog fully documented, ratchet 34→0#461

Merged
drewstone merged 3 commits into
mainfrom
chore/catalog-heal-substrate
Jul 4, 2026
Merged

chore(deps): agent-eval 0.103.2 — catalog fully documented, ratchet 34→0#461
drewstone merged 3 commits into
mainfrom
chore/catalog-heal-substrate

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Bumps the agent-eval floor to ^0.103.2. The substrate TSDoc backfills released in 0.103.1 + 0.103.2 (agent-eval#305, agent-eval#306) heal every remaining blank callable row in the primitive catalog:

  • Blank public callables: 34 → 0 (catalog regenerated; 1405 symbols across 10 own subpaths + 6 substrate surfaces).
  • Ratchet maxUndocumentedCallables 34 → 0 in scripts/gen-primitive-catalog.mjs — any future undocumented public function/class/const export now fails docs:check.
  • The ratchet caught its first offender during the main merge: defineLeaderboard (new on main, module-header-only doc) — summary added at the declaration.

Closes the catalog-blanks arc: 629/1366 (46%) → 108 (#448) → 34 (#450) → 0.

Gates: build, lint, typecheck, docs:check (clean-tree), 1221 tests pass.

drewstone added 2 commits July 3, 2026 12:29
The 0.103.1/0.103.2 substrate TSDoc heals every remaining blank callable row
in docs/api/primitive-catalog.md (34 → 0). maxUndocumentedCallables drops to
0: any future undocumented public function/class/const export fails docs:check.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved drewstone PR — fde1d360

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: drewstone_author · 2026-07-03T18:34:32Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 0 (none)
Heuristic 0.0s
Duplication 0.0s
Interrogation 81.4s (2 bridge agents)
Total 81.4s

💰 Value — sound

Bumps agent-eval to 0.103.2, drops the undocumented-callables ratchet from 34→0, and adds the one TSDoc the ratchet caught — closes the catalog-blanks arc permanently.

  • What it does: Bumps the @tangle-network/agent-eval dev dependency from ^0.103.1 to ^0.103.2, which ships TSDoc backfills for the substrate primitive surfaces. Lowers the maxUndocumentedCallables ratchet in scripts/gen-primitive-catalog.mjs from 34 to 0, so any future public function/class/const export without a TSDoc summary fails docs:check. Adds a TSDoc summary to the defineLeaderboard declaration in src/runt
  • Goals it achieves: Closes the catalog-blanks arc: every public callable in the primitive catalog now has a TSDoc summary visible in the generated catalog table. The ratchet at 0 functions as a permanent CI gate — any new undocumented export blocks the build, preventing the catalog from ever regressing into blanks again. The catalog is now a live, machine-enforced anti-reinvention inventory: agents can see what exist
  • Assessment: A minimal, coherent, three-piece change that completes a multi-commit arc (#448#450#461). The ratchet mechanism in gen-primitive-catalog.mjs:271-274 was explicitly designed for exactly this workflow (comment at line 271: 'when a backfill lowers the real number, lower the constant to match'). The dependency bump is the correct way to consume substrate TSDoc backfills — those summaries live in
  • Better / existing approach: none — this is the right approach. The ratchet mechanism already exists in scripts/gen-primitive-catalog.mjs:268-339 and was designed for this exact lifecycle: backfill summaries, lower the ceiling, regenerate. No alternative mechanism in the codebase does this. The dependency bump + TSDoc fix + ratchet drop are the intended, documented, single-path workflow. Searched: gen-primitive-catalog.mjs fo
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content

🎯 Usefulness — sound

Bumps agent-eval to the fully-documented release, drops the undocumented-callables ratchet to zero (CI-enforced), and adds the one missing TSDoc that the ratchet caught — a clean close to the catalog-blanks arc.

  • Integration: Ratchet enforcement is hard-wired into CI at .github/workflows/ci.yml:41 via docs:checkdocs:apiscripts/gen-primitive-catalog.mjs:331-338, which exits 1 if undocumented callables exceed the ceiling (now 0). defineLeaderboard is publicly exported from src/runtime/index.ts:95 and has tests at src/runtime/define-leaderboard.test.ts. No external callers yet, but it landed in the prior
  • Fit with existing patterns: The ratchet mechanism (maxUndocumentedCallables) already existed and was progressively lowered across PRs #448 and #450 — this is the natural end state (zero tolerance). The TSDoc follows the same pattern as other exports in the codebase (bare-function summaries before @tag blocks). The agent-eval version bump is a routine devDependency patch update following the established substrate dependen
  • Real-world viability: The ratchet is a static-analysis CI gate — no concurrency or edge-input concerns. It runs via the TypeScript compiler extracting export symbols from the live codebase (scripts/gen-primitive-catalog.mjs:262-263), so it always reflects ground truth, not a hand-maintained list. Any future PR adding an undocumented export will fail CI with a specific error naming the offending symbol.
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge warning: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

No concerns — sound change, no better or existing approach found. ✅


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260703T184038Z

@tangletools

Copy link
Copy Markdown
Contributor

⚠️ Review Interrupted — fde1d360

The review runner stopped before publishing a final verdict: max_run_seconds.

State Detail
Interrupted max run seconds

No review verdict was produced for this run. Trigger a fresh review on the current PR head if the PR is still open.

tangletools · #461 · model: kimi-for-coding · updated 2026-07-03T20:39:20Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved drewstone PR — fde1d360

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: drewstone_author · 2026-07-03T20:47:15Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 0 (none)
Heuristic 0.0s
Duplication 0.0s
Interrogation 87.7s (2 bridge agents)
Total 87.7s

💰 Value — sound

Bumps agent-eval to 0.103.2 (substrate TSDoc backfills) and ratchets the catalog's undocumented-callable ceiling 34→0, closing the docs arc with the right mechanism already in place.

  • What it does: Three coordinated edits: (1) package.json:97 floor @tangle-network/agent-eval ^0.103.1 → ^0.103.2, picking up substrate TSDoc backfills (agent-eval#305/#306) that fill the last blank callable rows; (2) scripts/gen-primitive-catalog.mjs:274 maxUndocumentedCallables 34 → 0, so any future public function/class/const export lacking a TSDoc summary exits 1 at docs:check (the failure path at l
  • Goals it achieves: Finish the catalog-blanks arc (629→108 via #448 → 34 via #450 → 0) and lock it closed: the ratchet now enforces zero undocumented public callables going forward, so the 'anti-reinvention inventory' (gen-primitive-catalog.mjs:349) can never silently drift back to blank rows. Better outcome = a new primitive added without a one-line summary is a red build, not a silent regression.
  • Assessment: Good change, in the grain of the codebase. The ratchet pattern was already established (commit history confirms #448 introduced it at 108, #450 tightened to 34); 0 is its natural terminal value, not a new mechanism. The defineLeaderboard summary is accurate and earns its row (describes the spec→runnable + adapter shape, names toBenchmarkAdapter()). The dep bump is additive (patch-level TSDoc bac
  • Better / existing approach: none — this is the right approach. The ratchet lives in the existing catalog generator (scripts/gen-primitive-catalog.mjs), which is already the canonical anti-reinvention inventory the repo routes through (canonical-api.md §2 is its judgmental companion per lines 355-357). There is no second docs-freshness gate to consolidate with — check-docs-freshness.mjs (referenced at line 346) handles
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound

A small, correctly-wired dependency bump that takes the undocumented-callable ratchet to its terminal value of 0, with the regenerated catalog and a defineLeaderboard summary proving the gate bites.

  • Integration: Fully wired and reachable. The ratchet (maxUndocumentedCallables=0, gen-primitive-catalog.mjs:274) is enforced by an exit(1) at lines 331-339, runs via docs:api (package.json:91), and is re-run by check-docs-freshness.mjs:594 under docs:check. defineLeaderboard — the callable whose new summary satisfies the gate — is a real exported, tested facade (src/runtime/index.ts:95, define-leaderboard.t
  • Fit with existing patterns: Follows the established pattern exactly. This is the final step of an existing iterative ratchet (629→108→34→0) on the existing anti-reinvention inventory; it competes with nothing. The defineLeaderboard TSDoc is placed correctly — summary line before the @experimental tag, matching the extractor's position rule at gen-primitive-catalog.mjs:42-45.
  • Real-world viability: Holds up. Setting the ceiling to 0 is the intended terminal state: any new public function/class/const export (own surface or the 6 curated agent-eval surfaces) without a TSDoc summary now fails docs:check. The PR body documents the ratchet already caught defineLeaderboard during the main merge, so the strictness is proven, not speculative. The one coupling — a future agent-eval release adding an
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

No concerns — sound change, no better or existing approach found. ✅


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260703T204913Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — fde1d360

Review health 100/100 · Reviewer score 86/100 · Confidence 85/100 · 5 findings (5 low)

glm deepseek aggregate
Readiness 86 89 86
Confidence 85 85 85
Correctness 86 89 86
Security 86 89 86
Testing 86 89 86
Architecture 86 89 86

Reviewer score is advisory once the run is complete and the verdict has no blockers.

Full multi-shot audit completed 5/5 planned shots over 6 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 6 changed files. Global verifier still owns final merge decision.

🟡 LOW Transitive viem minor bump (2.52.2 → 2.54.2) pulled by lockfile drift, not declared — pnpm-lock.yaml

viem moves a minor version via sandbox/tcloud optional deps (spec >=0.8.0 <1.0.0 is wide). Lockfile resolves it consistently across sandbox, tcloud, and isows/ws. No correctness issue in this file, but the sandbox/tcloud consumers of viem are not re-tested in this shot — that coverage belongs to the package.json + src shots. Flag as a nit, not a blocker: if any agent-runtime code touches viem APIs directly, the behavior diff between 2.52 and 2.54 should be checked there.

🟡 LOW Ratchet comment doesn't note 0 is the terminal floor — scripts/gen-primitive-catalog.mjs

The comment block at lines 268-272 (and the header at 38-40) describes the ceiling as 'the exact current count' and instructs to 'lower the constant to match' after a backfill. At 0 the ratchet has reached its floor — it cannot be lowered further and any new callable MUST ship with a TSDoc summary or the build breaks. A one-line note ('0 = fully documented; this is the floor, not a step toward a lower ceiling') would make the terminal intent explicit and prevent a future reader from thinking 0 is temporary. Pure documentation nit, non-blocking.

🟡 LOW Doc comment partially duplicates module-level header — src/runtime/define-leaderboard.ts

The new JSDoc at lines 255-259 restates 'cases + prompt + score' which the module header (lines 12-24) already documents. Harmless and arguably good for hover/IDE discovery, but consider a single canonical phrasing. No behavior impact.

🟡 LOW Unvalidated numeric CLI args produce NaN silently — src/runtime/define-leaderboard.ts

Number(args.shots ?? ...) and Number(args.reps ?? ...) coerce non-numeric CLI values like --shots foo to NaN. The NaN then flows into maxIterations: NaN in the naiveDriver and reps: NaN in runProfileMatrix, potentially causing zero iterations with no error. Example: --shots abc should throw not a valid number but instead silently produces NaN. Fix: validate with isNaN(parsed) after conversion and throw a clear error.

🟡 LOW as never type assertion on sandboxOverrides bypasses type safety — src/runtime/define-leaderboard.ts

The sandboxOverrides object is cast with as never to satisfy the AgentRunSpec.sandboxOverrides type. If the sandbox SDK's CreateSandboxOptions shape changes (e.g., backend.model requires new fields), this cast silently accepts the mismatch at compile time and only fails at runtime inside the loop kernel. The inline object { backend: { type: axis.harness, model: { ...spec.modelBackend, model: modelId } } } is deliberately looser because spec.modelBackend is Record<string, unknown>, but a narrower intermediate type assertion would catch structural drift earlier.


tangletools · 2026-07-03T20:52:53Z · trace

tangletools
tangletools previously approved these changes Jul 3, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 5 non-blocking findings — fde1d360

Full multi-shot audit completed 5/5 planned shots over 6 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 6 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-07-03T20:52:53Z · immutable trace

…bstrate

# Conflicts:
#	docs/api/primitive-catalog.md
#	docs/api/runtime.md
#	src/runtime/define-leaderboard.ts

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved drewstone PR — 050bb0ae

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: drewstone_author · 2026-07-04T01:55:06Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 0 (none)
Heuristic 0.0s
Duplication 0.0s
Interrogation 101.8s (2 bridge agents)
Total 101.8s

💰 Value — sound

Bumps agent-eval to 0.103.2 (whose TSDoc backfills healed the last substrate blanks), drops the undocumented-callable ratchet from 34 to its terminal 0, and adds the two remaining own-code summaries — a clean, mechanism-consistent close to the catalog-blanks arc.

  • What it does: Three coupled changes: (1) raises the @tangle-network/agent-eval devDep floor from ^0.103.1 to ^0.103.2 (lockfile resolves to 0.103.2, confirmed in pnpm-lock.yaml); (2) sets maxUndocumentedCallables from 34 to 0 in scripts/gen-primitive-catalog.mjs:274, so any future public function/class/const export without a TSDoc summary line fails docs:check; (3) adds TSDoc summary lines to the two own-code c
  • Goals it achieves: Make 'undocumented public callable' a hard build failure rather than tolerated shame. The primitive catalog is the anti-reinvention inventory (scripts/gen-primitive-catalog.mjs:2-10) — a blank row there is a reach-for primitive with no one-line description, which is exactly the thing the catalog exists to prevent. With agent-eval 0.103.1+0.103.2 backfilling every remaining substrate TSDoc, the rat
  • Assessment: Good change, built in the grain of the codebase. The ratchet mechanism was introduced in #448 and lowered in #450 with the explicit contract (scripts/gen-primitive-catalog.mjs:268-272): 'the ceiling is the exact current count; when a backfill lowers the real number, lower the constant to match.' This PR does exactly that — the substrate backfill in 0.103.2 dropped the real count to 0, so the const
  • Better / existing approach: none — this is the right approach. The ratchet is the codebase's own one-way gate for this exact problem; dropping it to 0 is its terminal state. Checked for alternatives: removing the ratchet in favor of manual review is strictly worse (the whole point is mechanical enforcement); lowering incrementally (34→12→0) would be cosmetic since the 0.103.2 substrate backfill healed all remaining blanks in
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound

Tightens the catalog-blank ratchet 34→0 on the back of a substrate TSDoc backfill, documenting the two own-repo offenders it caught — all wired through a CI-enforced gate and landing on live code.

  • Integration: Fully reachable and enforced. The ratchet runs via docs:check (package.json:93gen-primitive-catalog.mjs) which is in CI (.github/workflows/ci.yml:41) and exits 1 on violation (scripts/gen-primitive-catalog.mjs:338). Both documented functions are real public API: createSandboxToolPartState is called on the LIVE streaming path at src/runtime/stream-agent-turn.ts:387 every turn when
  • Fit with existing patterns: Fits the established grain precisely. The ratchet is an existing, documented mechanism (scripts/gen-primitive-catalog.mjs:267-275) whose stated contract is 'lower the constant to match when a backfill lowers the real number' — this PR does exactly that. The 629→108→34→0 arc shows an iterative tightening along one coherent axis, not a new competing pattern. Documenting defineLeaderboard was its
  • Real-world viability: Holds up. The ratchet logic is simple (count blank-summary callable rows, compare to ceiling, exit 1) with no concurrency or edge-input surface. The TSDoc additions are pure documentation with no runtime effect. The one robustness-relevant detail — 'summary BEFORE any @tag, a tag-first block reads as blank' (line 335) — is already surfaced in the error message, so future offenders get actionable g
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

No concerns — sound change, no better or existing approach found. ✅


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260704T015946Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 050bb0ae

Review health 100/100 · Reviewer score 89/100 · Confidence 85/100 · 3 findings (3 low)

glm deepseek aggregate
Readiness 92 89 89
Confidence 85 85 85
Correctness 92 89 89
Security 92 89 89
Testing 92 89 89
Architecture 92 89 89

Reviewer score is advisory once the run is complete and the verdict has no blockers.

Full multi-shot audit completed 5/5 planned shots over 7 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 7 changed files. Global verifier still owns final merge decision.

🟡 LOW Catalog summaries truncated mid-sentence in table cells — docs/api/primitive-catalog.md

The catalog table cell for AgentEvalError reads 'Base class for every contract error this package throws — carries the stable' — cut off mid-sentence by the summary column's width cap. Other entries (createSandboxToolPartState, loadEvalFixture, planCampaignRun, etc.) are similarly truncated. Impact is cosmetic: the full TSDoc text IS present in runtime.md (verified for defineLeaderboard and createSandboxToolPartState). No behavioral risk. If the generator has a 'trailing dash' or 'first sentence only' heuristic it should be applied consistently; otherwise consider raising the column cap or terminating at sentence boundaries. Not blocking — generated artifact, fix at generator level.

🟡 LOW Pre-existing: as never cast masks type drift on sandbox overrides — src/runtime/define-leaderboard.ts

The sandboxOverrides spread on the per-cell agent run spec uses as never to satisfy TypeScript, bypassing all structural checking on the backend model shape. If the sandbox SDK's CreateSandboxOptions.backend.model adds required fields or renames keys, the compiler will not flag the mismatch here. Not introduced by this PR (JSDoc-only change), but the cast has existed since the facade was authored.

🟡 LOW Pre-existing: no unit tests for tool-part projection functions — src/runtime/sandbox-events.ts

mapSandboxToolEvent, projectToolPart, and createSandboxToolPartState (lines 160-288) have zero direct unit tests. The existing sandbox-events test suite covers sumSandboxUsage, mapSandboxEvent, and extractLlmCallEvent only. The tool-part state machine (pending→running→completed, terminal-failure result shape, call-id dedup) is exercised indirectly through define-leaderboard's integration test, but has no isolated coverage. Not introduced by this PR.


tangletools · 2026-07-04T02:07:11Z · trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 3 non-blocking findings — 050bb0ae

Full multi-shot audit completed 5/5 planned shots over 7 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 7 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-07-04T02:07:11Z · immutable trace

@drewstone drewstone merged commit 738a9f7 into main Jul 4, 2026
1 check passed
@drewstone drewstone deleted the chore/catalog-heal-substrate branch July 4, 2026 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants