Skip to content

feat(product-benchmark): export helpers for product-owned benchmark bundles#302

Merged
drewstone merged 3 commits into
mainfrom
feat/product-benchmark
Jul 2, 2026
Merged

feat(product-benchmark): export helpers for product-owned benchmark bundles#302
drewstone merged 3 commits into
mainfrom
feat/product-benchmark

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Problem

research-package.ts is triplicated across tax-agent (687 lines), legal-agent (666), and creative-agent (536) — and has already drifted 359 lines between tax and legal. agent-eval main already owns the product-benchmark schema + validators (ProductBenchmarkRecord, ProductBenchmarkManifest, validateProductBenchmarkRun, integrity failures); what the products still hand-roll is the EXPORT side: converting RunRecords, building manifests, and writing the bundle.

Change

Additive to the existing @tangle-network/agent-eval/product-benchmark subpath — no consumer updates required until products migrate:

  • runRecordToProductBenchmarkRecord — RunRecord + artifact pointers → validated benchmark record
  • buildProductBenchmarkManifest / productBenchmarkRepoIdentity — manifest with repo/commit/package provenance
  • exportProductBenchmark / exportProductBenchmarkRuns — run-dir(s) → manifest + records JSONL + artifact pointers, validated on write

Product-specific code (scenario lists, runner commands, scoring, profile builders) stays in products; the next step is migrating tax/legal/creative research-package.ts onto these helpers after release.

Verification

  • npx vitest run src/product-benchmark/ → 12 pass / 0 fail
  • Full suite → 2663 pass / 0 fail
  • npx tsc --noEmit → clean

🤖 Generated with Claude Code

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved drewstone PR — 45a3c090

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: drewstone_author · 2026-07-02T02:27:31Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚪ Value Audit — audit-incomplete

Verdict audit-incomplete
Concerns 0 (none)
Heuristic 0.0s
Duplication 0.1s
Interrogation 14.8s (2 bridge agents)
Total 14.9s

💰 Value — error

value agent produced no parseable value-audit JSON.

  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge error: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content

🎯 Usefulness — error

usefulness agent produced no parseable value-audit JSON.

  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge error: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content

No PR concerns were produced because the value/usefulness agent pass did not complete. Treat this audit as incomplete, not as approval.


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260702T025445Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 45a3c090

Review health 100/100 · Reviewer score 66/100 · Confidence 70/100 · 10 findings (3 medium, 7 low)

opencode-kimi glm deepseek aggregate
Readiness 66 76 92 66
Confidence 70 70 70 70
Correctness 66 76 92 66
Security 66 76 92 66
Testing 66 76 92 66
Architecture 66 76 92 66

Reviewer score is advisory once the run is complete and the verdict has no blockers.

Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Manifest arms/scenarios deduplicated by id only, dropping divergent attributes — src/product-benchmark/export.ts

Arms and scenarios are built with new Map(records.map((record) => [record.armId, record])).values() and the same pattern for scenarioId, keeping only the first record per id. If records share an armId but differ in profileId, model, backend, harness, or reasoningEffort, the manifest silently keeps one set of policy axes and profileId. The validator (index.ts:468-472) only checks that the arm/scenario id exists, not that attributes match the records. This can produce a manifest that misrepresents the actual arms/scenarios in the records. Fix: group by a composite key (armId + profileId + model + backend + harness + reasoningEffort) for arms and (scenarioId + split) for scenarios, or validate that all records with the same armId/scenarioId agree on the manifest attributes and fail loud on

🟠 MEDIUM Substrate version defaults silently fall back to 'unknown' — src/product-benchmark/export.ts

packageVersion returns 'unknown' when a package is not declared in the cwd package.json, and both buildProductBenchmarkManifest (lines 496-501) and resolveOptions (line 618) use this default. The manifest validator only checks that substrate fields are non-empty strings, so 'unknown' passes validation silently. This violates the repo's 'No fallbacks. Fail loud.' doctrine (CLAUDE.md) and can produce manifests with meaningless substrate versions when export is run from a repo that does not directly declare @tangle-network/a

🟠 MEDIUM Substrate 'unknown' versions pass silently; no substrateFailures analog to repoFailures — src/product-benchmark/index.ts

The PR adds repoFailures that flags manifest.repo.* === 'unknown' (index.ts:474-476) and assertProductBenchmarkRun (index.ts:547-558) throws on it. But manifest.substrate.{agentEval,agentRuntime,agentInterface,sandbox} are validated only as non-empty strings (validateProductBenchmarkManifest:308-311), and packageVersion (export.ts) returns 'unknown' whenever the dep is absent from cwd package.json. A bundle whose entire substrate block is 'unknown' — un-reproducible — passes assertProductBenchmarkRun cleanly. Reproducibility of a benchmark depends on substrate identity at least as much as on repo branch. Add a substrateFailures list (same shape as repoFailures) flagging any substrate key === 'unknown', and include it in assertProductBenchmarkRun's failure union at index.ts:553.

🟡 LOW Test coverage gaps: option overrides and individually-exported helpers untested — src/product-benchmark/export.test.ts

The four product-drift tests + merge/single-run/error tests cover the core matrix well, but several ProductBenchmarkExportOptions fields are never asserted: passThreshold (only the default 0.7 is exercised), backendVersion override, mutableSurfaces override (only default asserted at line ~117), scenarioTagPrefix override (only the projectId→strip heuristic). The individually-exported helpers runRecordToProductBenchmarkRecord and buildProductBenchmarkManifest are exercised only indirectly through exportProductBenchmarkRuns; productBenchmarkRepoIdentity is never called directly. Add one test per override to lock the option contract.

🟡 LOW Circular module dependency between index.ts and export.ts — src/product-benchmark/export.ts

export.ts imports validateProductBenchmarkManifest and validateProductBenchmarkRecord from ./index, while index.ts re-exports types and functions from ./export (index.ts:560-572). Current load order happens to work because the validators are defined before the re-export in index.ts, but the coupling is fragile and can break with future reordering or bundler changes. Fix: move the shared validators and types that both modules need into a separate contract.ts or validate.ts file that both import, eliminating the cycle.

🟡 LOW Repo branch fallback 'detached:unknown' evades repo-failure detection — src/product-benchmark/export.ts

When git cannot resolve a branch or commit, repoBranch returns detached:unknown. The repoFailures check in validateProductBenchmarkRun (index.ts:474-476) only flags values that are empty or exactly equal to 'unknown', so detached:unknown passes as a valid branch. This contradicts the stated intent that 'unknown' values are flagged. Fix: make the repo-failure check detect sentinels containing 'unknown' (e.g., manifest.repo[key].includes('unknown')) or avoid embedding 'unknown' in the fallback string.

🟡 LOW Safety split detection only matches numeric 1, not boolean true — src/product-benchmark/export.ts

(record.outcome.raw as Record<string, number>).safety === 1 is false when a harness emits safety: true, because true === 1 is false in JavaScript. Although RunOutcome.raw is typed as Record<string, number>, the export ingests raw JSONL and casts via expectRunRecordShape, so a boolean can reach this check at runtime. A safety row would then be misclassified as practice/dev/holdout based on splitTag. Fix: accept both 1 and true (e.g., raw.safety === 1 || raw.safety === true).

🟡 LOW packageVersion records dep semver range, not installed version — src/product-benchmark/export.ts

packageVersion reads cwd package.json and returns pkg.dependencies?.[name] ?? pkg.devDependencies?.[name] ?? 'unknown' — i.e. a range like '^0.101.0', not the resolved/installed version from node_modules//package.json. The field is recorded under manifest.substrate.* and backend.version, both used for reproducibility. A range understates precision (many satisfying versions exist). Consistent with prior product exporters per the doc comment, so not a regression — but reading node_modules//package.json (or accepting an explicit resolvedVersion option) would make the substrate block actually pin a run. Worth a follow-up.

🟡 LOW sourceProfileHash fallback mixes cellId with source hash semantics — src/product-benchmark/export.ts

The fallback chain at line 308–309 reads record.agentProfile?.sourceProfile?.hash ?? record.agentProfile?.cellId. But cellId is "agent-profile-cell:sha256:<full cell hash>" — the identity of the entire agent profile cell (profileId + sourceProfile + harness + model + dimensions + promptHash) — not just the source profile. Using it where a source-profile hash is expected produces a materialized profile hash that incorporates cell metadata rather than just {sourceHash, model}. The result is deterministic and self-consistent within the bundle (same computation used in manifest and records), but a consumer trying to independently verify the profile has

🟡 LOW splitTag==='search' silently collapses to 'practice' (undocumented mapping) — src/product-benchmark/export.ts

splitOf: custom → safety(raw.safety===1) → 'holdout' → 'dev' → else 'practice'. RunRecord.splitTag is 'search'|'dev'|'holdout' (run-record.ts:33); product-benchmark splits add 'safety'/'sentinel' but NOT 'search'. A row with splitTag='search' (the optimization pool per run-record.ts:32) falls through to 'practice' with no comment. This is a defensible default (search ≈ practice) but is invisible to callers and not tested. Either document the mapping in the splitOf docstring or add an explicit if (record.splitTag === 'search') return 'practice' line with a one-line comment so the intent is visible at the decision site.


tangletools · 2026-07-02T03:19:05Z · trace

tangletools
tangletools previously approved these changes Jul 2, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 10 non-blocking findings — 45a3c090

Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-07-02T03:19:05Z · immutable trace

drewstone added 2 commits July 1, 2026 23:05
…venance

- arms/scenarios dedup fails loud when records sharing an id disagree on
  a policy axis or split (id-only dedup silently misrepresented them)
- substrateFailures mirrors repoFailures: an 'unknown' substrate version
  fails assertProductBenchmarkRun (un-reproducible bundle)
- packageVersion falls back to the INSTALLED node_modules version
  (transitive deps record real provenance), plus an explicit substrate
  override option for environments where a package is absent
@drewstone

Copy link
Copy Markdown
Contributor Author

All three MEDIUMs fixed, latest push:

  • Id-only dedup drops divergent attributes: uniqueArmRecords / uniqueScenarioRecords now fail loud when records sharing an arm id disagree on any manifest-carried axis (profileId/model/backend/harness/reasoningEffort) or a scenario id disagrees on split — tests reproduce both.
  • 'unknown' substrate passes silently (both findings): substrateFailures mirrors repoFailures — any substrate version that is empty or 'unknown' fails assertProductBenchmarkRun (test mutates the manifest and asserts the throw). Root-cause side: packageVersion now falls back to the INSTALLED node_modules version (a product getting sandbox transitively records real provenance), and a new explicit substrate override option covers environments where a package genuinely isn't installed — the fixtures exercise it, since agent-eval's own repo (correctly, per layering) has no agent-runtime/sandbox.

Full suite 2671 passed / 2 skipped; lint + typecheck clean; branch merged with main.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved drewstone PR — 282ba8bf

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: drewstone_author · 2026-07-02T05:06:27Z

@drewstone drewstone merged commit 1c08a44 into main Jul 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants