fix(judge): record unmeasured instead of fabricated zeros on checker failure#301
Conversation
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — 5d129947
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-07-02T02:09:38Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 1 (1 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.1s |
| Interrogation | 262.5s (2 bridge agents) |
| Total | 262.6s |
💰 Value — sound-with-nits
Stops the correctness checker from fabricating zeros on truncation/infra failure by recovering truncated JSON and modeling checker failures as unmeasured; a correct, in-grain fix with one minor dedup opportunity in the hand-rolled retry/rawSink loop.
- What it does: Three coupled changes. (1) New shared
src/json-recovery.tslifts truncation-tolerant JSON recovery out ofreflective-mutation.ts(clean extraction, now importsautoCloseTruncatedJson) and extends it withrecoverTruncatedJson(auto-close unclosed strings/objects, drop dangling members, iterate at comma boundaries). (2)parseCorrectnessResponsestrict-parses first, then on failure recovers - Goals it achieves: Eliminate fabricated zeros — a checker LLM reply truncated by the 200-token cap (e.g.
{"correct": false, ") used to throw, aborting the completion check and recording 0 across all judge dimensions for a row whose deterministic score was 0.96. This directly violates the package's ownJudgeParseErrorcontract (judges.ts:6-11: 'a synthetic zero is indistinguishable from a real low score'). Second - Assessment: Sound and well-scoped. The
unmeasuredmodeling is the correct third state for an eval oracle — distinct from both 'passed' and 'failed' — and its propagation is coherent: denominator exclusion incompletionVerdict(completion-verifier.ts:126-136),fullyCompletesuppression (line 139), and the throw-when-all-failed guard (line 128-134) all read as deliberate, matching the repo's fail-loud doc - Better / existing approach: Looked for existing machinery the checker should reuse instead of hand-rolling. Two candidates exist but neither is a drop-in: (a)
recordRaw+cryptoEventIdin llm-client.ts:653-678 do best-effort rawSink recording + event-id generation identical in spirit to the PR's inlinerecordhelper (completion-verifier.ts:494-501), but they are private and fetch-coupled — the checker uses theTCloud/` - Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
🎯 Usefulness — sound
Fixes fabricated-zero scoring on checker truncation/failure via a shared JSON-recovery module, the established JudgeParseError type, an additive forensic sink opt, and a new unmeasured state that correctly excludes failed checks from the completion denominator — all wired through the existing verify
- Integration: Fully reachable. verifyCompletion (completion-verifier.ts:340) is the completion oracle called by the agent tool eval-tools.ts:163, the playback scoreUserStory (playback.ts:97), and exported from index.ts:488-492. createLlmCorrectnessChecker (completion-verifier.ts:486) is the production CorrectnessChecker injected into it, so the new retry/recovery/rawSink behavior is on every real scoring path.
- Fit with existing patterns: Fits the codebase's grain. json-recovery.ts is a clean extraction of the autoCloseTruncatedJson logic reflective-mutation.ts already needed (now imports it at reflective-mutation.ts:20, deduplicating the inline copy). The checker now throws JudgeParseError (judges.ts:12) — the exact type parseJudgeResponse (judges.ts:294) uses for the same 'never fabricate a zero' doctrine. The rawSink opt reuses
- Real-world viability: Holds up off the happy path. parseCorrectnessResponse (completion-verifier.ts:453) orders strict-parse → truncation-recovery → throw, and readVerdict requires a boolean correct, so {"correct": false, "} recovers (real measurement) while {"correct": } throws (no fabrication) — the core correctness property. recoverTruncatedJson is bounded (64 iterations) and returns null when unrecoverable. The ret
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
💰 Value Audit
🟡 Hand-rolled retry + rawSink loop parallels llm-client.ts internals [duplication] ``
The
recordhelper (completion-verifier.ts:494-501) and the per-attempt RawProviderEvent construction (lines 524-569) re-implementrecordRaw(llm-client.ts:653-665, same best-effort-swallow semantics) andcryptoEventId(llm-client.ts:675-678, nowrandomUUID()inline). Neither is exported, so reuse wasn't available, but extracting a sharedrecordRaw/event-id helper for the TCloud path would prevent a second copy from drifting. Also the retry loop has no backoff (immediate re-call) — fine
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| opencode-kimi | glm | deepseek | aggregate | |
|---|---|---|---|---|
| Readiness | 47 | 59 | 61 | 47 |
| Confidence | 85 | 85 | 85 | 85 |
| Correctness | 47 | 59 | 61 | 47 |
| Security | 47 | 59 | 61 | 47 |
| Testing | 47 | 59 | 61 | 47 |
| Architecture | 47 | 59 | 61 | 47 |
Reviewer score is advisory once the run is complete and the verdict has no blockers.
Full multi-shot audit completed 5/5 planned shots over 5 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 5 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 5 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Stale file-level doc comment contradicts new completionRate denominator — src/completion-verifier.ts
The header comment still says '
completionRateis satisfied / total', but completionVerdict() now divides by measurable requirements only (line 136) and the CompletionVerdict interface doc at line 102 says 'satisfied / MEASURABLE requirements'. This misleads consumers about how unmeasured rows affect the score. Fix: update line 18 to 'satisfied / measurable requirements'.
🟠 MEDIUM Earliest {/[ heuristic mis-slices prose that contains a bracket before the JSON — src/json-recovery.ts
Line 118-122 pick
Math.min(objStart, arrStart), so any[before the actual JSON object wins. For input[note] {"score":3}the candidate becomes[note] {"score":3}, the strict parse fails, auto-close fails, and the function returns null even though a complete object is present. Reproduction: recoverTruncatedJson('[note] {"score":3}') -> null. This breaks the documented tolerance for prose/markdown and can miss recoverable judge outputs. Fix: prefer the first{; only fall back to[when no object is recoverable, or validate that the chosen slice can be made to parse.
🟠 MEDIUM Incomplete backslash escape at truncation boundary produces unparseable JSON — src/json-recovery.ts
When a truncated string value ends on a backslash (escaped=true at end-of-loop), the suffix builder appends '"' (line 59) which JSON.parse consumes as the escape target rather than the string terminator. Result: {"a": "val"} still has an open string and fails JSON.parse. Verified: autoCloseTruncatedJson('{"a": "val\') → '{"a": "val\"}' → JSON.parse throws 'Unterminated string in JSON at position 13'. Fix: before appending '"' on line 59, check escaped and prepend 'n' to produce \n (valid escape completion):
if (escaped) suffix += 'n'then `suffix +=
🟠 MEDIUM Trailing unescaped backslash in open string breaks recovery — src/json-recovery.ts
When a truncated string ends with a bare backslash, e.g. raw input
{"a":"x\\, the loop exits withescaped=trueandinString=true. The suffix then appends"}directly after the backslash, yielding{"a":"x\"}where the added quote is escaped by the trailing backslash. JSON.parse fails, so recoverTruncatedJson returns null. Reproduction: autoCloseTruncatedJson('{"a":"x\') -> '{"a":"x"}', JSON.parse throws 'Unterminated string'. This leaves path/regex-like cap-hit output unrecovered and still produces the fabricated-zero outcome the module exists to prevent. Fix: detect and neutralize a trailing unescaped backslash before closing the string (drop it or add an extra escape so the closing quote is literal).
🟡 LOW Best-effort sink failure path is untested — src/completion-verifier.test.ts
The implementation swallows sink.record errors silently (impl lines 494-501: try/catch with empty catch). No test verifies that a throwing sink does not corrupt the verdict. If someone removes the try/catch, a flaky sink would break every eval run and no test would catch the regression. Low severity — this is implementation resilience, not test correctness — but worth a one-line test: sink.record = async () => { throw new Error('disk full') } and assert the checker still returns a verdict.
🟡 LOW Greedy assignment ordering assumption in unmeasured test — src/completion-verifier.test.ts
The test's flaky checker throws on its FIRST call (r1) and succeeds on the second (r2), but the assignment order depends on greedy sort order (r2 scores 0.80 vs r1's 0.67). If token recall scores ever change (e.g., if REQUIREMENT_FORM_STOPWORDS or tokenization changes), the assignment order could swap, causing r2 to be the unmeasured row. The test's
firstflag assumption (r1 fails, r2 succeeds) is implicitly coupled to implementation details ofartifactCandidates. A more robust test would set up requirements with predictable, non-interchangeable scores, or would inspect the unmeasured row's reqId.
🟡 LOW Minor coverage gap: non-default maxAttempts not tested — src/completion-verifier.test.ts
Tests assume the default maxAttempts=2 but do not exercise a custom maxAttempts option. Since the option exists in LlmCorrectnessCheckerOpts, a test for maxAttempts=1 or maxAttempts=3 would harden the contract.
🟡 LOW Minor coverage gap: raw sink parse-error path not exercised — src/completion-verifier.test.ts
The raw-sink test records an 'error' event only when tc.chat() throws. It does not verify that a parse error from parseCorrectnessResponse is also recorded as an error event with the correct errorMessage. The implementation does record it (lines 555-570), so this is a coverage nit, not a bug.
🟡 LOW No test coverage for custom maxAttempts option — src/completion-verifier.test.ts
The new
maxAttemptsoption defaults to 2 but is configurable. Tests only exercise the default path. A test withmaxAttempts: 3would catch off-by-one errors in the retry loop (e.g.,attempt < maxAttemptsvsattempt <= maxAttempts). Not a correctness bug on the default path, but uncovered configuration surface.
🟡 LOW No test for custom maxAttempts value (only default of 2 exercised) — src/completion-verifier.test.ts
The retry tests (lines 479-512) use the default maxAttempts=2. The LlmCorrectnessCheckerOpts.maxAttempts field (impl line 437) is never tested with a non-default value. If someone changes the default or breaks the maxAttempts override, no test catches it. Minor coverage gap — the production path (default) is well covered. Fix: add one test passing { maxAttempts: 3 } and assert calls === 3 on all-fail.
🟡 LOW Missing baseUrl in forensic provider events — src/completion-verifier.ts
All RawProviderEvents record
baseUrl: ''(hardcoded at lines 529, 545, 561). The TCloud instance knows the provider endpoint but its baseUrl is not threaded through to the sink. This means forensic replay of checker calls can't confirm which provider/endpoint served each response. The RawProviderEvent interface documents baseUrl as 'Base URL used for the call' — an empty string loses this signal. Not a correctness bug; just incomplete forensic capture.
🟡 LOW No backoff between checker retry attempts — src/completion-verifier.ts
The retry loop at line 522 sends attempts back-to-back with no delay. For transport errors (503, ECONNRESET), both attempts fire within milliseconds, making the retry unlikely to help with rate-limiting or server-overload scenarios. For parse failures (JudgeParseError), temperature 0 means the model produces the same output on retry, so backoff wouldn't help there. Impact is low because maxAttempts defaults to 2 and parse-failure retries are the dominant failure mode. Add a small exponential backoff (e.g., 100ms * 2^attempt) to give transport errors a chance to resolve.
🟡 LOW Recorded rawSink events carry inaccurate baseUrl/endpoint forensic metadata — src/completion-verifier.ts
Every recorded RawProviderEvent hardcodes
endpoint: '/chat'andbaseUrl: ''. The RawProviderSink contract (src/trace/raw-provider-sink.ts:46-49) defines endpoint as the real path (e.g. '/v1/chat/completions') and baseUrl as the normalized call URL with no trailing slash. Empty baseUrl violates the contract; '/chat' is a guess. Forensic-only so not load-bearing, but degrades audit value — a reviewer reconstructing a failed call cannot tell which provider/URL actually answered. Either omit the fields (sink treats undefined as unknown) or pull them from the TCloud client when exposed. Same shape repeats on lines 545, 562 for response/error events.
🟡 LOW Retrying on JudgeParseError with temperature=0 is wasted compute — src/completion-verifier.ts
The retry loop treats parse failures and network errors identically — both consume an attempt. With
temperature: 0(line 519) the model is deterministic, so a parse failure on attempt 0 will almost certainly reproduce on attempt 1 because the request is byte-identical. Retries remain useful for transient network/5xx errors. The docstring at line 432-436 acknowledges both consume attempts but doesn't justify why parse-failure retry is worthwhile at temp 0. Either (a) short-circuit parse failures (don't retry), or (b) document the reason
🟡 LOW completionRate can report 1.0 during a partial checker outage — src/completion-verifier.ts
When some requirements are unmeasured and all measurable ones pass,
completionRate = satisfiedCount / measurable.length= 1.0 whilefullyComplete = falseandvalid = false. This is the documented design choice (see comments at lines 102-107, 137-138) and the alternative (zero-folding) is exactly the bug class being fixed. Flag for DOWNSTREAM awareness: any consumer that aggregatescompletionRate(mean across runs, trend lines) rather than gating onfullyComplete/validwill see inflated numbers during partial outages. The unmeasuredCount field is exported precisely to detect this; recommend the README/concepts doc call out the inflation-when-aggregat
🟡 LOW maxAttempts <= 0 throws 'undefined' instead of a clear misuse error — src/completion-verifier.ts
If a caller passes
{ maxAttempts: 0 }(or negative), the for-loop never executes,lastErrstays undefined, and the function throwsnew Error(String(undefined))=Error('undefined'). The default is 2 so this only triggers under misuse, but a one-line guard (if (maxAttempts < 1) throw new Error('maxAttempts must be >= 1')) at construction time would surface the misconfiguration loudly instead of as a cryptic runtime error.
🟡 LOW maxAttempts=0 produces confusing Error('undefined') — src/completion-verifier.ts
The retry loop runs
for (let attempt = 0; attempt < maxAttempts; attempt++)with no validation. If a caller passes maxAttempts=0, the loop never executes, lastErr stays undefined, and the function throwsnew Error(String(undefined))i.e.Error('undefined'). Default is 2, so this is only a misconfiguration edge case. Fix: either guardmaxAttempts > 0or assert a clear message.
🟡 LOW rawSink failures are silently swallowed — src/completion-verifier.ts
The record() helper catches and swallows all sink errors. The inline comment acknowledges this is intentional ('Forensic capture is best-effort'), but it means a misconfigured sink (bad path, permissions, disk full) silently loses audit data with no log or metric. Consider emitting a warning or surfacing a non-fatal diagnostic so operators notice capture is broken.
🟡 LOW Missing coverage for arrays, nested truncation, and markdown fences — src/json-recovery.test.ts
The tests cover object truncation well but do not exercise array truncation (e.g.
[1, 2,), nested object/array cut-offs, or JSON inside markdown fences. These shapes are reachable from completion-verifier.ts and reflective-mutation.ts callers; adding them would guard against regressions in the recovery loop.
🟡 LOW No test coverage for incomplete-escape truncation pattern — src/json-recovery.test.ts
The escape-awareness test at line 24 covers escape-awareness inside strings but only tests a completed escape sequence ("hi). No test exercises the incomplete-escape case where the input ends on a bare backslash. Adding a test like
autoCloseTruncatedJson('{"a": "val\\')would catch the bug described above and serve as a regression gate.
🟡 LOW No test for array-rooted recovery in recoverTruncatedJson — src/json-recovery.test.ts
recoverTruncatedJson searches for both '{' and '[' openers (json-recovery.ts:119) and can recover truncated arrays like [1, 2, 3, but no test exercises this path. The current callers only pass object-rooted input (completion-verifier.ts:472 slices from '{'), so this is an untested public API surface.
🟡 LOW No test for autoCloseTruncatedJson on top-level array input — src/json-recovery.test.ts
The function handles both '{' and '[' structurally (json-recovery.ts:50), but no test verifies closing an unclosed top-level array like '[1, 2' → '[1, 2]'. All existing tests use object-rooted input only.
🟡 LOW No test for literal braces inside strings treated as non-structural — src/json-recovery.test.ts
When input contains literal braces inside string values (e.g. '{"a": "{"}'), the state machine must treat them as non-structural content (inString=true skips the brace push). No test verifies this: a balanced-but-brace-containing input like '{"a": "{"}' would be returned unchanged, but a truncated version like '{"a": "{' could theoretically confuse the depth counter if the implementation had a bug. Current implementation is correct — this is a coverage gap, not a bug.
🟡 LOW No test pins LIFO closure at nesting depth > 2 — src/json-recovery.test.ts
The only nesting test is {"a": [1, 2 → {"a": [1, 2]} (depth 2). Real judge payloads nest deeper (objects-in-arrays-in-objects). The LIFO suffix builder (while stack.length > 0: pop + append closer) is exercised only at depth 2, so a regression that closes in FIFO order or caps at one level would slip. Add a depth-3+ case like {"a": [{"b": [1, "x' → {"a": [{"b": [1, "x"]}]}].
🟡 LOW No test pins array-as-top-level recovery in recoverTruncatedJson — src/json-recovery.test.ts
Every recoverTruncatedJson case uses '{' as the opener. The implementation explicitly handles arrays (text.indexOf('['), Math.min(objStart, arrStart) slicing) but no test exercises e.g. recoverTruncatedJson('[1, 2, {"a": 3') → [1, 2, {a: 3}]. An array-truncation regression (e.g. dropping the array branch in the slice) would pass silently. Add one array-top-level recovery case to lock the branch.
🟡 LOW No test pins behavior when last trailing comma sits inside an array context — src/json-recovery.test.ts
lastCommaOutsideString does not track array-vs-object context — it finds the last comma anywhere outside a string. For input like {"a": [1, 2' the last comma is the one inside the array (after 1), so cutting there yields {"a": [1] which silently drops the 2. This is conservative-recovery behavior worth pinning either way (current behavior: keep prefix, drop partial member) so a future refactor that adds context-awareness doesn't change recovery shape without a test catching it.
🟡 LOW Overly permissive assertion masks exact recovery behavior — src/json-recovery.test.ts
The assertion
r === null || !('correct' in r)would still pass if the function returned an empty object{}instead ofnullfor input'{"correct": '. Since the input contains no complete member to recover, the precise contract isnull; tighten toexpect(r).toBeNull()or use a separate case where a previous complete member exists and assert the dangling key is dropped.
🟡 LOW String/escape walker duplicated between two functions — src/json-recovery.ts
autoCloseTruncatedJson (lines 30-56) and lastCommaOutsideString (lines 84-98) reimplement the same in-string / escaped / quote state-machine. They must stay byte-for-byte consistent or the comma-cut and the auto-close will disagree on what is 'inside a string' -- a subtle drift risk if either is edited alone. Consider extracting a small
forEachNonStringChar(s, visit)helper that both consume. Pure maintainability nit; no behavior defect today.
🟡 LOW Truncation on a dangling backslash inside an open string is not recoverable — src/json-recovery.ts
When the cap-hit lands immediately after a single backslash inside an open string (e.g. {"path":"C:\ -- a Windows path, regex, or LaTeX fragment cut mid-escape), the loop exits with escaped=true and inString=true, so line 59 appends """ to close the string. But that appended quote is consumed as part of the dangling escape (" = escaped quote), so JSON.parse still sees the string as open and rejects the candidate. Confirmed empirically with the exact char sequence: autoClose('{"path":"C:') returns '{"path":"C:"}' which JSON.parse rejects, and recoverTruncatedJson returns null because lastCommaOutsideString finds no comma outside a string to cut at. Blast radius is bound
🟡 LOW parseReflectionResponse lacks a dedicated unit test — src/reflective-mutation.ts
Pre-existing gap (not introduced by this PR): no
reflective-mutation.test.tsexists, so the truncation-tolerant fallback path at lines 267-279 — the only in-file consumer ofautoCloseTruncatedJson— has no direct coverage of its integration with the slice/candidate selection logic above it. The moved helper itself IS now well-tested in json-recovery.test.ts. Suggest adding a parseReflectionResponse test that feeds a cap-hit-truncated{"proposals":[{...},payload to lock the end-to-end recovery path. No merge blocker.
tangletools · 2026-07-02T03:22:05Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 30 non-blocking findings — 5d129947
Full multi-shot audit completed 5/5 planned shots over 5 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 5 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 5 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-07-02T03:22:05Z · immutable trace
…start slice fallback, doc truth - autoCloseTruncatedJson completes an incomplete trailing escape to a literal backslash before closing the string, so the appended quote is a terminator, not an escape target - recoverTruncatedJson tries each opener position (earliest first) and falls back to the next when a prose bracket poisons the slice - completionRate header comment states the measurable-only denominator - coverage: prose-bracket + trailing-backslash reproductions, custom maxAttempts, throwing raw sink; two pre-existing useLiteralKeys lints
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — 3713ffaa
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-07-02T04:58:42Z
|
All four MEDIUMs fixed (plus the cheap LOWs), latest push:
Full suite 2682 passed / 2 skipped; rtk lint + typecheck clean; branch merged with main (was 2 behind). |
Problem
A correctness-checker failure fabricated zeros.
parseCorrectnessResponserequired a closing}; a checker LLM reply truncated by the 200-token cap (e.g.{"correct": false, ") threw, the throw aborted the entire completion check, and downstream judge dimensions were recorded as 0 — violating the package's ownJudgeParseErrorcontract ("a synthetic zero is indistinguishable from a real low score"). Verified real-world harm: a legal-agent Kimi row with deterministic 0.96 and 26 valid proposals recorded 0 across all 7 judge dimensions because the checker's OWN reply truncated at 20 chars. Checker LLM calls were also captured nowhere, making these failures unauditable.Change
src/json-recovery.ts(new): truncation-tolerant JSON recovery (auto-close unclosed strings/objects/arrays, drop dangling members), lifted from the inline logic inreflective-mutation.tswhich now imports it.parseCorrectnessResponse: strict parse first; on failure, recover the truncated prefix — a recovered boolean verdict is a real measurement (the Kimi-row reply{"correct": false, "now parses instead of zeroing the run). No recoverable boolean →JudgeParseError, never a guess.createLlmCorrectnessChecker: one retry on parse/call failure (maxAttempts, default 2); optionalrawSinkrecords every checker request/response/error (provider: 'correctness-checker') for forensics.verifyCompletion/completionVerdict: a checker failure marks the requirementunmeasured+unmeasuredReason—correctstays null, the row leavescompletionRate's denominator,fullyCompletecannot be claimed, andunmeasuredCountis reported. All checks failing → throw (infrastructure failure, not a scored run).Verification
npx vitest run src/completion-verifier.test.ts src/json-recovery.test.ts src/reflective-mutation.test.ts→ 51 pass / 0 fail (incl. Kimi-shaped truncation recovery, cut-boolean non-fabrication, retry, raw-capture ordering, unmeasured denominator, all-failed throw)pnpm test→ 2671 pass / 2 skipped / 0 failnpx tsc --noEmit→ clean🤖 Generated with Claude Code