feat(product-benchmark): export helpers for product-owned benchmark bundles by drewstone · Pull Request #302 · tangle-network/agent-eval

drewstone · 2026-07-02T02:27:22Z

Problem

research-package.ts is triplicated across tax-agent (687 lines), legal-agent (666), and creative-agent (536) — and has already drifted 359 lines between tax and legal. agent-eval main already owns the product-benchmark schema + validators (ProductBenchmarkRecord, ProductBenchmarkManifest, validateProductBenchmarkRun, integrity failures); what the products still hand-roll is the EXPORT side: converting RunRecords, building manifests, and writing the bundle.

Change

Additive to the existing @tangle-network/agent-eval/product-benchmark subpath — no consumer updates required until products migrate:

runRecordToProductBenchmarkRecord — RunRecord + artifact pointers → validated benchmark record
buildProductBenchmarkManifest / productBenchmarkRepoIdentity — manifest with repo/commit/package provenance
exportProductBenchmark / exportProductBenchmarkRuns — run-dir(s) → manifest + records JSONL + artifact pointers, validated on write

Product-specific code (scenario lists, runner commands, scoring, profile builders) stays in products; the next step is migrating tax/legal/creative research-package.ts onto these helpers after release.

Verification

npx vitest run src/product-benchmark/ → 12 pass / 0 fail
Full suite → 2663 pass / 0 fail
npx tsc --noEmit → clean

🤖 Generated with Claude Code

…est build, run-dir export

tangletools

✅ Auto-approved drewstone PR — `45a3c090`

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: drewstone_author · 2026-07-02T02:27:31Z}

tangletools

⚪ Value Audit — audit-incomplete


Verdict	audit-incomplete
Concerns	0 (none)
Heuristic	0.0s
Duplication	0.1s
Interrogation	14.8s (2 bridge agents)
Total	14.9s

💰 Value — error

value agent produced no parseable value-audit JSON.

Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 3
Bridge error: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content

🎯 Usefulness — error

usefulness agent produced no parseable value-audit JSON.

Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 3
Bridge error: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content

No PR concerns were produced because the value/usefulness agent pass did not complete. Treat this audit as incomplete, not as approval.

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260702T025445Z}

tangletools · 2026-07-02T03:19:08Z

✅ No Blockers — `45a3c090`

Review health 100/100 · Reviewer score 66/100 · Confidence 70/100 · 10 findings (3 medium, 7 low)

	opencode-kimi	glm	deepseek	aggregate
Readiness	66	76	92	66
Confidence	70	70	70	70
Correctness	66	76	92	66
Security	66	76	92	66
Testing	66	76	92	66
Architecture	66	76	92	66

Reviewer score is advisory once the run is complete and the verdict has no blockers.

Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Manifest arms/scenarios deduplicated by id only, dropping divergent attributes — src/product-benchmark/export.ts

Arms and scenarios are built with new Map(records.map((record) => [record.armId, record])).values() and the same pattern for scenarioId, keeping only the first record per id. If records share an armId but differ in profileId, model, backend, harness, or reasoningEffort, the manifest silently keeps one set of policy axes and profileId. The validator (index.ts:468-472) only checks that the arm/scenario id exists, not that attributes match the records. This can produce a manifest that misrepresents the actual arms/scenarios in the records. Fix: group by a composite key (armId + profileId + model + backend + harness + reasoningEffort) for arms and (scenarioId + split) for scenarios, or validate that all records with the same armId/scenarioId agree on the manifest attributes and fail loud on

🟠 MEDIUM Substrate version defaults silently fall back to 'unknown' — src/product-benchmark/export.ts

packageVersion returns 'unknown' when a package is not declared in the cwd package.json, and both buildProductBenchmarkManifest (lines 496-501) and resolveOptions (line 618) use this default. The manifest validator only checks that substrate fields are non-empty strings, so 'unknown' passes validation silently. This violates the repo's 'No fallbacks. Fail loud.' doctrine (CLAUDE.md) and can produce manifests with meaningless substrate versions when export is run from a repo that does not directly declare @tangle-network/a

🟠 MEDIUM Substrate 'unknown' versions pass silently; no substrateFailures analog to repoFailures — src/product-benchmark/index.ts

The PR adds repoFailures that flags manifest.repo.* === 'unknown' (index.ts:474-476) and assertProductBenchmarkRun (index.ts:547-558) throws on it. But manifest.substrate.{agentEval,agentRuntime,agentInterface,sandbox} are validated only as non-empty strings (validateProductBenchmarkManifest:308-311), and packageVersion (export.ts) returns 'unknown' whenever the dep is absent from cwd package.json. A bundle whose entire substrate block is 'unknown' — un-reproducible — passes assertProductBenchmarkRun cleanly. Reproducibility of a benchmark depends on substrate identity at least as much as on repo branch. Add a substrateFailures list (same shape as repoFailures) flagging any substrate key === 'unknown', and include it in assertProductBenchmarkRun's failure union at index.ts:553.

🟡 LOW Test coverage gaps: option overrides and individually-exported helpers untested — src/product-benchmark/export.test.ts

The four product-drift tests + merge/single-run/error tests cover the core matrix well, but several ProductBenchmarkExportOptions fields are never asserted: passThreshold (only the default 0.7 is exercised), backendVersion override, mutableSurfaces override (only default asserted at line ~117), scenarioTagPrefix override (only the projectId→strip heuristic). The individually-exported helpers runRecordToProductBenchmarkRecord and buildProductBenchmarkManifest are exercised only indirectly through exportProductBenchmarkRuns; productBenchmarkRepoIdentity is never called directly. Add one test per override to lock the option contract.

🟡 LOW Circular module dependency between index.ts and export.ts — src/product-benchmark/export.ts

export.ts imports validateProductBenchmarkManifest and validateProductBenchmarkRecord from ./index, while index.ts re-exports types and functions from ./export (index.ts:560-572). Current load order happens to work because the validators are defined before the re-export in index.ts, but the coupling is fragile and can break with future reordering or bundler changes. Fix: move the shared validators and types that both modules need into a separate contract.ts or validate.ts file that both import, eliminating the cycle.

🟡 LOW Repo branch fallback 'detached:unknown' evades repo-failure detection — src/product-benchmark/export.ts

When git cannot resolve a branch or commit, repoBranch returns detached:unknown. The repoFailures check in validateProductBenchmarkRun (index.ts:474-476) only flags values that are empty or exactly equal to 'unknown', so detached:unknown passes as a valid branch. This contradicts the stated intent that 'unknown' values are flagged. Fix: make the repo-failure check detect sentinels containing 'unknown' (e.g., manifest.repo[key].includes('unknown')) or avoid embedding 'unknown' in the fallback string.

🟡 LOW Safety split detection only matches numeric 1, not boolean true — src/product-benchmark/export.ts

(record.outcome.raw as Record<string, number>).safety === 1 is false when a harness emits safety: true, because true === 1 is false in JavaScript. Although RunOutcome.raw is typed as Record<string, number>, the export ingests raw JSONL and casts via expectRunRecordShape, so a boolean can reach this check at runtime. A safety row would then be misclassified as practice/dev/holdout based on splitTag. Fix: accept both 1 and true (e.g., raw.safety === 1 || raw.safety === true).

🟡 LOW packageVersion records dep semver range, not installed version — src/product-benchmark/export.ts

packageVersion reads cwd package.json and returns pkg.dependencies?.[name] ?? pkg.devDependencies?.[name] ?? 'unknown' — i.e. a range like '^0.101.0', not the resolved/installed version from node_modules//package.json. The field is recorded under manifest.substrate.* and backend.version, both used for reproducibility. A range understates precision (many satisfying versions exist). Consistent with prior product exporters per the doc comment, so not a regression — but reading node_modules//package.json (or accepting an explicit resolvedVersion option) would make the substrate block actually pin a run. Worth a follow-up.

🟡 LOW sourceProfileHash fallback mixes cellId with source hash semantics — src/product-benchmark/export.ts

The fallback chain at line 308–309 reads record.agentProfile?.sourceProfile?.hash ?? record.agentProfile?.cellId. But cellId is "agent-profile-cell:sha256:<full cell hash>" — the identity of the entire agent profile cell (profileId + sourceProfile + harness + model + dimensions + promptHash) — not just the source profile. Using it where a source-profile hash is expected produces a materialized profile hash that incorporates cell metadata rather than just {sourceHash, model}. The result is deterministic and self-consistent within the bundle (same computation used in manifest and records), but a consumer trying to independently verify the profile has

🟡 LOW splitTag==='search' silently collapses to 'practice' (undocumented mapping) — src/product-benchmark/export.ts

splitOf: custom → safety(raw.safety===1) → 'holdout' → 'dev' → else 'practice'. RunRecord.splitTag is 'search'|'dev'|'holdout' (run-record.ts:33); product-benchmark splits add 'safety'/'sentinel' but NOT 'search'. A row with splitTag='search' (the optimization pool per run-record.ts:32) falls through to 'practice' with no comment. This is a defensible default (search ≈ practice) but is invisible to callers and not tested. Either document the mapping in the splitOf docstring or add an explicit if (record.splitTag === 'search') return 'practice' line with a one-line comment so the intent is visible at the decision site.

_{tangletools · 2026-07-02T03:19:05Z · trace}

tangletools

✅ Approved — 10 non-blocking findings — `45a3c090`

Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-07-02T03:19:05Z · immutable trace}

…venance - arms/scenarios dedup fails loud when records sharing an id disagree on a policy axis or split (id-only dedup silently misrepresented them) - substrateFailures mirrors repoFailures: an 'unknown' substrate version fails assertProductBenchmarkRun (un-reproducible bundle) - packageVersion falls back to the INSTALLED node_modules version (transitive deps record real provenance), plus an explicit substrate override option for environments where a package is absent

drewstone · 2026-07-02T05:06:17Z

All three MEDIUMs fixed, latest push:

Id-only dedup drops divergent attributes: uniqueArmRecords / uniqueScenarioRecords now fail loud when records sharing an arm id disagree on any manifest-carried axis (profileId/model/backend/harness/reasoningEffort) or a scenario id disagrees on split — tests reproduce both.
'unknown' substrate passes silently (both findings): substrateFailures mirrors repoFailures — any substrate version that is empty or 'unknown' fails assertProductBenchmarkRun (test mutates the manifest and asserts the throw). Root-cause side: packageVersion now falls back to the INSTALLED node_modules version (a product getting sandbox transitively records real provenance), and a new explicit substrate override option covers environments where a package genuinely isn't installed — the fixtures exercise it, since agent-eval's own repo (correctly, per layering) has no agent-runtime/sandbox.

Full suite 2671 passed / 2 skipped; lint + typecheck clean; branch merged with main.

tangletools

✅ Auto-approved drewstone PR — `282ba8bf`

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: drewstone_author · 2026-07-02T05:06:27Z}

feat(product-benchmark): export helpers — RunRecord conversion, manif…

45a3c09

…est build, run-dir export

tangletools approved these changes Jul 2, 2026

View reviewed changes

tangletools reviewed Jul 2, 2026

View reviewed changes

tangletools previously approved these changes Jul 2, 2026

View reviewed changes

drewstone added 2 commits July 1, 2026 23:05

Merge remote-tracking branch 'origin/main' into feat/product-benchmark

fa6b9b7

drewstone dismissed tangletools’s stale review via 282ba8b July 2, 2026 05:06

tangletools approved these changes Jul 2, 2026

View reviewed changes

drewstone merged commit 1c08a44 into main Jul 2, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(product-benchmark): export helpers for product-owned benchmark bundles#302

feat(product-benchmark): export helpers for product-owned benchmark bundles#302
drewstone merged 3 commits into
mainfrom
feat/product-benchmark

drewstone commented Jul 2, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jul 2, 2026

Uh oh!

tangletools left a comment

Uh oh!

drewstone commented Jul 2, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jul 2, 2026

Problem

Change

Verification

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved drewstone PR — 45a3c090

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

⚪ Value Audit — audit-incomplete

💰 Value — error

🎯 Usefulness — error

Uh oh!

tangletools commented Jul 2, 2026

✅ No Blockers — 45a3c090

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Approved — 10 non-blocking findings — 45a3c090

Uh oh!

drewstone commented Jul 2, 2026

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved drewstone PR — 282ba8bf

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved drewstone PR — `45a3c090`

✅ No Blockers — `45a3c090`

✅ Approved — 10 non-blocking findings — `45a3c090`

✅ Auto-approved drewstone PR — `282ba8bf`