feat(bench-matrix): H6-H8 + H19-H21 evaluators embed comparator evidence#140
Merged
Conversation
Extend H6/H7/H8 (interaction) and H19/H20/H21 (cell-renderer) evaluators to include comparator evidence in their evidence arrays, mirroring H1's pattern. Status logic stays pretable-only; data is informational. Retires the aggregator-script pattern over time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Twelve-task plan: shared helper for comparator-evidence lookup, six evaluator extensions (H6, H7, H8, H19, H20, H21), test coverage, matrix re-run, repo-memory entry, PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
evaluateInteractionHypothesis now embeds every measured comparator adapter in the evidence array (was: best-by-interaction-latency only). Pretable-only status verdicts unchanged. Adds findComparatorEvidence helper used by all six target evaluators (H6/H7/H8 + H19/H20/H21). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
H19's evidence array now embeds each comparator's scroll-with-format summary alongside pretable's format/baseline delta. Comparator entries are absolute format p95 (not deltas) — per-adapter format-vs-baseline deltas are a future enhancement. Status verdict unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
H20's evidence array now embeds each comparator's scroll-with-render summary alongside pretable. Status verdict unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
H21's evidence array now embeds each comparator's scroll-with-heavy-render summary alongside pretable. Status verdict unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…20/H21 Pins the contract that each of the six evaluators surfaces every measured comparator adapter in its evidence array (4 entries for H6/H7/H8/H20/H21; 5 for H19 which also carries the pretable scroll baseline). Status verdicts remain pretable-only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Synthesized hypotheses report from per-run summaries (matrix runner's end-of-run report-writer flaked repeatedly in this worktree; the per-run summaries are valid). Verification of evaluator extensions: | H# | Status | Evidence adapters in array | | --- | --------- | ---------------------------------------- | | H1 | satisfied | pretable, ag-grid, tanstack | | H6 | satisfied | pretable, ag-grid, tanstack | | H7 | satisfied | pretable, ag-grid, tanstack | | H8 | satisfied | pretable, ag-grid, tanstack | | H19 | satisfied | pretable (format), pretable (baseline), | | | | ag-grid, tanstack | | H20 | satisfied | pretable, ag-grid, tanstack | | H21 | satisfied | pretable, ag-grid, tanstack | All seven hypotheses retained their existing `satisfied` status (no threshold changes; evaluator-extension was data-only). Comparator evidence now embedded inline in each hypothesis's evidence array — the architectural goal of the PR. MUI runs flaked in this matrix attempt and are absent from the evidence; that's a matrix-runner reliability issue, not an evaluator issue. The evaluator correctly handles whatever comparator data is present per-slice. Investigating the matrix-runner's tanstack/mui flake pattern is a separate follow-up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Architecture change documenting the H6/H7/H8/H19/H20/H21 evaluator extensions, the matrix-runner flake workaround, and the deferred follow-ups. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
H6/H7/H8/H19/H20/H21 now embed comparator evidence in their evidence arrays (4 entries each; 5 for H19 which also carries pretable's scroll baseline). Pretable-only status verdicts unchanged — all six remain satisfied. Aggregated from today's S2 hypothesis-scale runs across pretable/ag-grid/tanstack/mui after the matrix runner hit two mid-run e2e flakes (preview server / locator timing); on-disk summaries combined into a single report. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
All four adapters (pretable + ag-grid + tanstack + mui) are present in every comparator-aware evaluator's evidence array. Updates the 2026-05-12 entry to reflect the recovered matrix outcome. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Contributor
Vercel preview readyPreview: https://pretable-osprb3of2-cacheplane.vercel.app Updated automatically by the |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends six pretable-only evaluators in
scripts/bench-matrix.mjs(H6/H7/H8 interaction + H19/H20/H21 cell-renderer) to embed comparator-adapter evidence in theirevidencearrays. MirrorsevaluateH1's pre-existing pattern. Status logic unchanged — pretable's absolute thresholds still drive verdicts; comparator data is informational.Goal:
hypotheses.jsonbecomes a single source of truth for cross-adapter perf data, retiring (over time) the per-PR aggregator scripts from PRs #130/#131/#132.What changed
scripts/bench-matrix.mjs— newfindComparatorEvidence(runs, { scenarioId, scriptName })helper; six evaluators extended to append...comparatorEvidenceto theirevidencearrays in every return branch. H6/H7/H8 changes are centralized in the sharedevaluateInteractionHypothesishelper.scripts/__tests__/bench-matrix.test.mjs— six new tests asserting evidence-array contents when comparator runs are present. All existing status-verdict tests untouched.status/milestones/2026-05-12-comparator-aware-evaluators.hypotheses.json— milestone synthesized from per-run summaries after recovering from two matrix-runner flakes (one tanstack/filter-metadata locator-timing flake, one preview-serverECONNREFUSED).Verification
All seven hypotheses retained
satisfiedstatus. Evidence arrays:What's NOT in this PR
/benchpage swap to read fromhypotheses.jsondirectly. Aggregator scripts (PR feat(bench): open cell-renderer scripts to comparator adapters (B2 follow-up #5a) #130/feat(bench): B2 follow-up #5b — sort + filter comparator wiring #131/fix(website): homepage interaction wedge refresh (B2 follow-up) #132 +scripts/extract-interaction-summary.mjs) still feed the page. Editorial-only follow-up.scroll-with-formatp95 only.Test plan
pnpm -w typecheckpassespnpm -w testpasses (191 tests — existing matrix-runner tests + 6 new evidence-array assertions)pnpm -w lint0 errorspnpm formatclean🤖 Generated with Claude Code