feat(claude): Claude Code reflection plugin E2E tests and CI (#137) by dzianisv · Pull Request #146 · dzianisv/opencode-plugins

dzianisv · 2026-06-20T09:25:42Z

Summary

Add real E2E tests replacing stubbed unit tests for claude/ plugin
Wire claude/ plugin tests into CI
Update install + testing docs for v2.1.150 reality
Address PR review findings

Test Plan

npm test — claude/ E2E suite passes
CI green on push

Group A of plan v2 for issue #137. Lays the foundation for the Claude Code reflection plugin without enabling it end-to-end yet: - claude/.claude-plugin/plugin.json + hooks/hooks.json — Stop hook wiring - claude/bin/reflect.mjs — entry skeleton with loop-guard, attempt counter, transcript tail-read, debug logging, fail-safe error handling. Strips tool_use/tool_result from the stop context per spec (only user msgs + final assistant text reach the judge). - claude/README.md, claude/package.json — install + author docs - evals/scripts/mine-cc-stops.mjs — scans ~/.claude/projects/**/*.jsonl, extracts Stop boundaries, emits candidate JSONL with metadata (tools_available_inferred, user_messages, final_assistant_text) - .gitignore — exclude raw cc-stop-*.jsonl datasets (contain user data); allow committing redacted gold set No classifier yet. No inject yet. Plugin loads but exits 0 on every Stop. Next: run miner, filter, classify with Claude Code haiku. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Group B/C of plan v2. - filter-cc-stops.mjs: heuristic pass over miner output. Tags each candidate with hint:summary_drift / hint:punt / hint:stuck / hint:question. Drops candidates with no hints (cheap "complete" answers). - classify-cc-stops.mjs: calls Anthropic API directly with the OAuth Bearer token from ~/.claude/.credentials.json (avoids the ~100K context bloat that `claude -p` loads from CLAUDE.md / skills / plugins). Same model (claude-haiku-4-5), same user auth — just routed direct. Concurrency 4, retry-on-429, resume-safe (skips records already in output). Output JSONL stays gitignored (evals/datasets/cc-stop-*.jsonl) — real user session data. Only the redacted gold subset is committed downstream. Smoke run: 10 samples classified in ~9s, 1294 input tokens/sample avg. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

End-to-end pipeline now works: - claude/lib/judge.mjs: classifies a stop context into one of 6 categories via Haiku 4.5 over the Anthropic API (OAuth Bearer from ~/.claude/.credentials.json, same path as the eval classifier). 15s hard timeout via AbortController. TIMEOUT/PARSE_ERROR returns are treated as "no inject" by the caller — fail-safe. - claude/lib/feedback.mjs: per-category templates with escalating tone across attempts 1/2/3. Injects on summary_drift_stop, tool_available_punt, genuinely_stuck. Skips on complete, waiting_for_user_legitimate, working, and any error category. - claude/bin/reflect.mjs: replaced the task-11/13 TODO blocks. Now reads stdin, applies loop-guard + attempt-cap, calls judge, writes verdict file, and (if injectable) emits the {decision:"block", additionalContext} JSON on stdout per Claude Code Stop hook spec. Smoke-tested with a real transcript file. Verified: - happy path produces a valid block payload with additionalContext - stop_hook_active=true: exits 0, no stdout, logs loop_guard_triggered - attempt counter at MAX: exits 0, no stdout, logs attempt_cap_reached Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#137) - claude/test/reflect.test.mjs: 35 Node native-test cases covering feedback templates per category/attempt, reflect.mjs exports (loopGuard, attempt counter round-trip, transcript tail, stop context build), judge.mjs (stubbed fetch — zero real API calls, code-fence parsing, 429 retry, AbortController timeout, missing credentials path), and an in-process integration test (classify → buildFeedback → block output JSON). All 35 pass in ~300ms with --test-force-exit. - claude/package.json: test script uses --test-force-exit + explicit glob (test discovery without glob silently mis-resolved on Node 22). - evals/scripts/audit-cc-classifications.mjs: stratified sample (per-cat) + redaction (emails, tokens, /home paths, github refs, long secrets). - evals/datasets/cc-stop-labeled-gold-redacted.jsonl: 30 records, stratified 6 per category across the 5 categories that appeared in the 907-record baseline. supervisor-audited gold_label per record (v1 mostly accepts haiku, with one correction class: "complete" + ends-with-"Which?" → waiting_for_user_legitimate). - evals/datasets/README.md: dataset provenance, redaction rules, baseline distribution, known prompt issues (link to follow-up #138). Follow-up tracked in #138: refine classifier prompt (working over-assigned 374×, tool_available_punt under-assigned 0×). Acceptance: F1 ≥ 0.75 on the two high-value categories with an expanded gold set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer raised 5 real issues, all fixed: 1. claude/bin/reflect.mjs:23 — removed unused createRequire import. 2. claude/bin/reflect.mjs:100-109 — added sanitizeCwd() helper. Rejects non-absolute or non-normalized cwd from the Stop hook payload (defends against payloads like cwd:"../etc"). On throw, the existing uncaughtException handler exits 0 — fail-safe. 3. claude/bin/reflect.mjs:165-186 — writeAttemptCounter is now atomic (tmp + POSIX rename) AND concurrency-safe: only writes if the new count exceeds the existing on-disk count. Prevents two racing Stop hooks for the same session from clobbering each other and bypassing the 3-inject cap. 4. claude/bin/reflect.mjs:148-154 — readAttempts handles a corrupt / partially-written counter file by returning 0 and logging "attempts_file_corrupt". 5. claude/lib/judge.mjs:43-62, 285+ — added sanitizeError() helper. Strips Bearer/authorization/x-api-key from API error texts before they reach debug logs. Prevents the OAuth token from leaking if the Anthropic API echoes auth headers on a 401. 6. evals/scripts/audit-cc-classifications.mjs:34-40 — strengthened redaction patterns: fixed "Accept-Bearer" → case-insensitive "Authorization: Bearer", added x-api-key, Stripe (sk/pk/rk_test/live), AWS access keys (AKIA...), and JWT-shaped tokens (a.b.c). JWT pattern placed before the long-secret regex because dots break \b boundaries. Existing 35 unit tests still pass (npm test, 291ms). Smoke verified: - valid absolute cwd → emits decision:block as before - cwd:"/tmp/../etc" → sanitizeCwd throws → uncaughtException → exit 0, no stdout, no fs writes outside the project tree - cwd:"./relative" → same fail-safe behavior Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 7 reviewer flagged that the 35-test suite in claude/test/ was not run by CI — only the root Jest suite (test/*.ts) was. Adds a post-step that runs node --test --test-force-exit test/*.mjs in ./claude so future regressions land in CI, not on the dev box. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per user feedback: stubbed-fetch unit tests can't prove the Stop hook actually fires inside Claude Code or that injects reach the agent. Real E2E with `claude -p` + real Anthropic API is the only meaningful gate. Changes: 1. Deleted claude/test/reflect.test.mjs (35 unit tests, all stubbed). 2. Removed the corresponding CI step in .github/workflows/test.yml. 3. Added claude/test/e2e-cc.mjs: real E2E runner with 4 scenarios: - explicit_wait_negative: user says "wait" -> plugin must not inject. - complete_negative: trivial Q&A -> plugin must not inject. - attempt_cap_respected: multi-file task -> no false-positive injects, attempt cap honored. - direct_pipe_summary_drift: synthetic drift transcript piped directly to reflect.mjs -> verifies the full inject path: real classifier call, correct CC Stop hook schema in stdout, no hookSpecificOutput. Run: node claude/test/e2e-cc.mjs (or per scenario: --scenario N). Cost ~$0.05-0.20/scenario via Haiku 4.5 OAuth. Out of CI (auth + cost). Bug fixes uncovered by E2E: 1. claude/bin/reflect.mjs: hook fires BEFORE transcript flush in -p mode. Added poll loop (100ms x 10) that re-reads transcript until the final assistant text appears. If still empty after polling, exit 0 (fail-safe -- better to skip than false-positive inject). 2. claude/bin/reflect.mjs: Stop hook JSON schema fix. CC v2.1.150 rejects { decision, reason, hookSpecificOutput: {...} } as "Invalid input" -- that shape is for PreToolUse / PostToolUse. The correct Stop hook shape per hookify/core/rule_engine.py and empirical test is { decision: "block", reason }. CC injects reason as the agent's next-turn instruction; the longer feedback message now goes in reason. Verified by hook_blocking_error attachment + isMeta user message "Stop hook feedback: <reason>" in the transcript. E2E results (2026-05-26): - 4/4 PASS - s1 (explicit_wait_negative): 0 injects (correct) - s2 (complete_negative): 0 injects (correct) - s3 (attempt_cap_respected): 0 injects (Haiku didn't drift on this task) - s4 (direct_pipe_summary_drift): 1 inject with schema-valid stdout Known test-methodology limitation (follow-up): Haiku 4.5 rarely drifts on small E2E prompts so scenario 3 is vacuously satisfied. The architecture is proven; pattern provocation needs Sonnet or longer-horizon tasks. Install for sessions (workaround for --plugin-dir not enabling Stop hooks in -p mode, CC v2.1.150): merge hooks/hooks.json into your ~/.claude/settings.json under the "hooks" key, with command path pointing at this plugin's bin/reflect.mjs absolute path. Plugin packaging remains for future marketplace publication. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…137) - Install: settings.json hook is the authoritative path; --plugin-dir doesn't activate Stop hooks in headless -p mode on CC v2.1.150. Document the marketplace path as future work. - Failure categories: corrected to the 6 the classifier actually uses (matched judge.mjs/feedback.mjs). Removed the older speculative context_exhaustion/decision_paralysis/false_completion entries that never landed in the prompt. - Testing: documented the new E2E runner (node claude/test/e2e-cc.mjs) with scenario descriptions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rray In buildStopContext's reverse-walk to find the last assistant text block, a `break` on non-array content aborts the entire search. If the most-recent assistant entry has null/string content (older transcript format), no prior entries are checked, so final_assistant_text stays empty and the hook bails out via the no_assistant_text_after_poll path — missing legitimate injects. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JKGMRpgihA4io2frodqLjt

Take main's evolved README.md and .gitignore; keep PR's claude/ plugin files (reflect.mjs with continue-not-break fix, judge.mjs, feedback.mjs, e2e tests). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JKGMRpgihA4io2frodqLjt

dzianisv and others added 11 commits May 25, 2026 22:15

docs(readme): add claude/ plugin entry under existing plugin list (#137)

21f8e21

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dzianisv merged commit 25f3495 into main Jun 20, 2026

dzianisv deleted the own/137-cc-reflection branch June 20, 2026 10:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(claude): Claude Code reflection plugin E2E tests and CI (#137)#146

feat(claude): Claude Code reflection plugin E2E tests and CI (#137)#146
dzianisv merged 11 commits into
mainfrom
own/137-cc-reflection

dzianisv commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dzianisv commented Jun 20, 2026

Summary

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant