Skip to content

feat(claude): Claude Code reflection plugin E2E tests and CI (#137)#146

Merged
dzianisv merged 11 commits into
mainfrom
own/137-cc-reflection
Jun 20, 2026
Merged

feat(claude): Claude Code reflection plugin E2E tests and CI (#137)#146
dzianisv merged 11 commits into
mainfrom
own/137-cc-reflection

Conversation

@dzianisv

Copy link
Copy Markdown
Owner

Summary

  • Add real E2E tests replacing stubbed unit tests for claude/ plugin
  • Wire claude/ plugin tests into CI
  • Update install + testing docs for v2.1.150 reality
  • Address PR review findings

Test Plan

  • npm test — claude/ E2E suite passes
  • CI green on push

dzianisv and others added 11 commits May 25, 2026 22:15
Group A of plan v2 for issue #137. Lays the foundation for the Claude Code
reflection plugin without enabling it end-to-end yet:

- claude/.claude-plugin/plugin.json + hooks/hooks.json — Stop hook wiring
- claude/bin/reflect.mjs — entry skeleton with loop-guard, attempt counter,
  transcript tail-read, debug logging, fail-safe error handling. Strips
  tool_use/tool_result from the stop context per spec (only user msgs +
  final assistant text reach the judge).
- claude/README.md, claude/package.json — install + author docs
- evals/scripts/mine-cc-stops.mjs — scans ~/.claude/projects/**/*.jsonl,
  extracts Stop boundaries, emits candidate JSONL with metadata
  (tools_available_inferred, user_messages, final_assistant_text)
- .gitignore — exclude raw cc-stop-*.jsonl datasets (contain user data);
  allow committing redacted gold set

No classifier yet. No inject yet. Plugin loads but exits 0 on every Stop.
Next: run miner, filter, classify with Claude Code haiku.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Group B/C of plan v2.

- filter-cc-stops.mjs: heuristic pass over miner output. Tags each candidate
  with hint:summary_drift / hint:punt / hint:stuck / hint:question. Drops
  candidates with no hints (cheap "complete" answers).
- classify-cc-stops.mjs: calls Anthropic API directly with the OAuth Bearer
  token from ~/.claude/.credentials.json (avoids the ~100K context bloat
  that `claude -p` loads from CLAUDE.md / skills / plugins). Same model
  (claude-haiku-4-5), same user auth — just routed direct. Concurrency 4,
  retry-on-429, resume-safe (skips records already in output).

Output JSONL stays gitignored (evals/datasets/cc-stop-*.jsonl) — real user
session data. Only the redacted gold subset is committed downstream.

Smoke run: 10 samples classified in ~9s, 1294 input tokens/sample avg.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end pipeline now works:

- claude/lib/judge.mjs: classifies a stop context into one of 6 categories
  via Haiku 4.5 over the Anthropic API (OAuth Bearer from
  ~/.claude/.credentials.json, same path as the eval classifier). 15s
  hard timeout via AbortController. TIMEOUT/PARSE_ERROR returns are
  treated as "no inject" by the caller — fail-safe.
- claude/lib/feedback.mjs: per-category templates with escalating tone
  across attempts 1/2/3. Injects on summary_drift_stop, tool_available_punt,
  genuinely_stuck. Skips on complete, waiting_for_user_legitimate, working,
  and any error category.
- claude/bin/reflect.mjs: replaced the task-11/13 TODO blocks. Now reads
  stdin, applies loop-guard + attempt-cap, calls judge, writes verdict
  file, and (if injectable) emits the {decision:"block", additionalContext}
  JSON on stdout per Claude Code Stop hook spec.

Smoke-tested with a real transcript file. Verified:
- happy path produces a valid block payload with additionalContext
- stop_hook_active=true: exits 0, no stdout, logs loop_guard_triggered
- attempt counter at MAX: exits 0, no stdout, logs attempt_cap_reached

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#137)

- claude/test/reflect.test.mjs: 35 Node native-test cases covering
  feedback templates per category/attempt, reflect.mjs exports
  (loopGuard, attempt counter round-trip, transcript tail, stop context
  build), judge.mjs (stubbed fetch — zero real API calls, code-fence
  parsing, 429 retry, AbortController timeout, missing credentials path),
  and an in-process integration test (classify → buildFeedback → block
  output JSON). All 35 pass in ~300ms with --test-force-exit.
- claude/package.json: test script uses --test-force-exit + explicit glob
  (test discovery without glob silently mis-resolved on Node 22).
- evals/scripts/audit-cc-classifications.mjs: stratified sample (per-cat)
  + redaction (emails, tokens, /home paths, github refs, long secrets).
- evals/datasets/cc-stop-labeled-gold-redacted.jsonl: 30 records, stratified
  6 per category across the 5 categories that appeared in the 907-record
  baseline. supervisor-audited gold_label per record (v1 mostly accepts
  haiku, with one correction class: "complete" + ends-with-"Which?" →
  waiting_for_user_legitimate).
- evals/datasets/README.md: dataset provenance, redaction rules, baseline
  distribution, known prompt issues (link to follow-up #138).

Follow-up tracked in #138: refine classifier prompt (working over-assigned
374×, tool_available_punt under-assigned 0×). Acceptance: F1 ≥ 0.75 on the
two high-value categories with an expanded gold set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer raised 5 real issues, all fixed:

1. claude/bin/reflect.mjs:23 — removed unused createRequire import.
2. claude/bin/reflect.mjs:100-109 — added sanitizeCwd() helper. Rejects
   non-absolute or non-normalized cwd from the Stop hook payload (defends
   against payloads like cwd:"../etc"). On throw, the existing
   uncaughtException handler exits 0 — fail-safe.
3. claude/bin/reflect.mjs:165-186 — writeAttemptCounter is now
   atomic (tmp + POSIX rename) AND concurrency-safe: only writes if the
   new count exceeds the existing on-disk count. Prevents two racing Stop
   hooks for the same session from clobbering each other and bypassing
   the 3-inject cap.
4. claude/bin/reflect.mjs:148-154 — readAttempts handles a corrupt /
   partially-written counter file by returning 0 and logging
   "attempts_file_corrupt".
5. claude/lib/judge.mjs:43-62, 285+ — added sanitizeError() helper.
   Strips Bearer/authorization/x-api-key from API error texts before
   they reach debug logs. Prevents the OAuth token from leaking if the
   Anthropic API echoes auth headers on a 401.
6. evals/scripts/audit-cc-classifications.mjs:34-40 — strengthened
   redaction patterns: fixed "Accept-Bearer" → case-insensitive
   "Authorization: Bearer", added x-api-key, Stripe (sk/pk/rk_test/live),
   AWS access keys (AKIA...), and JWT-shaped tokens (a.b.c). JWT pattern
   placed before the long-secret regex because dots break \b boundaries.

Existing 35 unit tests still pass (npm test, 291ms).

Smoke verified:
- valid absolute cwd → emits decision:block as before
- cwd:"/tmp/../etc" → sanitizeCwd throws → uncaughtException → exit 0,
  no stdout, no fs writes outside the project tree
- cwd:"./relative" → same fail-safe behavior

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 7 reviewer flagged that the 35-test suite in claude/test/ was not
run by CI — only the root Jest suite (test/*.ts) was. Adds a post-step
that runs node --test --test-force-exit test/*.mjs in ./claude so
future regressions land in CI, not on the dev box.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per user feedback: stubbed-fetch unit tests can't prove the Stop hook
actually fires inside Claude Code or that injects reach the agent. Real
E2E with `claude -p` + real Anthropic API is the only meaningful gate.

Changes:

1. Deleted claude/test/reflect.test.mjs (35 unit tests, all stubbed).
2. Removed the corresponding CI step in .github/workflows/test.yml.
3. Added claude/test/e2e-cc.mjs: real E2E runner with 4 scenarios:
   - explicit_wait_negative: user says "wait" -> plugin must not inject.
   - complete_negative: trivial Q&A -> plugin must not inject.
   - attempt_cap_respected: multi-file task -> no false-positive injects,
     attempt cap honored.
   - direct_pipe_summary_drift: synthetic drift transcript piped directly
     to reflect.mjs -> verifies the full inject path: real classifier
     call, correct CC Stop hook schema in stdout, no hookSpecificOutput.

Run: node claude/test/e2e-cc.mjs (or per scenario: --scenario N).
Cost ~$0.05-0.20/scenario via Haiku 4.5 OAuth. Out of CI (auth + cost).

Bug fixes uncovered by E2E:

1. claude/bin/reflect.mjs: hook fires BEFORE transcript flush in -p
   mode. Added poll loop (100ms x 10) that re-reads transcript until the
   final assistant text appears. If still empty after polling, exit 0
   (fail-safe -- better to skip than false-positive inject).

2. claude/bin/reflect.mjs: Stop hook JSON schema fix. CC v2.1.150
   rejects { decision, reason, hookSpecificOutput: {...} } as "Invalid
   input" -- that shape is for PreToolUse / PostToolUse. The correct
   Stop hook shape per hookify/core/rule_engine.py and empirical test
   is { decision: "block", reason }. CC injects reason as the agent's
   next-turn instruction; the longer feedback message now goes in
   reason. Verified by hook_blocking_error attachment + isMeta user
   message "Stop hook feedback: <reason>" in the transcript.

E2E results (2026-05-26):
- 4/4 PASS
- s1 (explicit_wait_negative): 0 injects (correct)
- s2 (complete_negative): 0 injects (correct)
- s3 (attempt_cap_respected): 0 injects (Haiku didn't drift on this task)
- s4 (direct_pipe_summary_drift): 1 inject with schema-valid stdout

Known test-methodology limitation (follow-up): Haiku 4.5 rarely drifts
on small E2E prompts so scenario 3 is vacuously satisfied. The architecture
is proven; pattern provocation needs Sonnet or longer-horizon tasks.

Install for sessions (workaround for --plugin-dir not enabling Stop
hooks in -p mode, CC v2.1.150): merge hooks/hooks.json into your
~/.claude/settings.json under the "hooks" key, with command path
pointing at this plugin's bin/reflect.mjs absolute path. Plugin packaging
remains for future marketplace publication.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…137)

- Install: settings.json hook is the authoritative path; --plugin-dir
  doesn't activate Stop hooks in headless -p mode on CC v2.1.150. Document
  the marketplace path as future work.
- Failure categories: corrected to the 6 the classifier actually uses
  (matched judge.mjs/feedback.mjs). Removed the older speculative
  context_exhaustion/decision_paralysis/false_completion entries that
  never landed in the prompt.
- Testing: documented the new E2E runner (node claude/test/e2e-cc.mjs)
  with scenario descriptions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rray

In buildStopContext's reverse-walk to find the last assistant text block,
a `break` on non-array content aborts the entire search. If the most-recent
assistant entry has null/string content (older transcript format), no prior
entries are checked, so final_assistant_text stays empty and the hook bails
out via the no_assistant_text_after_poll path — missing legitimate injects.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JKGMRpgihA4io2frodqLjt
Take main's evolved README.md and .gitignore; keep PR's claude/ plugin files
(reflect.mjs with continue-not-break fix, judge.mjs, feedback.mjs, e2e tests).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JKGMRpgihA4io2frodqLjt
@dzianisv dzianisv merged commit 25f3495 into main Jun 20, 2026
@dzianisv dzianisv deleted the own/137-cc-reflection branch June 20, 2026 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant