Skip to content

feat(sanitize): expand injection patterns 10→56, strip zero-width chars before match#74

Merged
tzone85 merged 1 commit into
mainfrom
feat/sanitize-patterns
Jun 12, 2026
Merged

feat(sanitize): expand injection patterns 10→56, strip zero-width chars before match#74
tzone85 merged 1 commit into
mainfrom
feat/sanitize-patterns

Conversation

@tzone85

@tzone85 tzone85 commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Summary

Closes the last open follow-up from the bulletproof certification pass: `internal/sanitize.DetectPromptInjection` was previously a 10-entry substring blocklist easily bypassed with zero-width characters spliced into the payload.

Before After
Substring patterns 10 56
Attack families 4 9
Zero-width normalisation none strips U+00AD, U+200B-U+200F, U+202A-U+202E, U+2060-U+206F, U+FEFF before matching
Pattern-match attribution none new `MatchInjectionPattern` returns the canonical pattern that fired

What changed

`internal/sanitize/sanitize.go`:

  • 56 patterns grouped by attack family with rationale comments: override/disregard, role/identity coercion, authority spoofing, output coercion, memory poisoning, action coercion, exfiltration, jailbreak labels (DAN / developer / jailbreak mode), and chat-template tags (`<|system|>`, `<|im_start/end|>`, `<|user|>`, `<|assistant|>`, `[INST]`/`[/INST]`, `<>`/`<>`).
  • `zeroWidthRe` strips zero-width and bidi-override characters before substring matching. Defeats `ignore previous instructions` style bypasses. Built with regex `\x{...}` escapes so the source file stays pure ASCII (embedded invisibles silently break diffs, and Go rejects U+FEFF in source).
  • `normaliseForInjectionMatch`: lowers, strips invisibles, collapses whitespace runs. Result is fed only to the matcher, never used for content storage.
  • `MatchInjectionPattern` returns the canonical pattern that fired (or `""` on no match) so callers can log which family triggered for post-mortems.

`internal/sanitize/sanitize_test.go`:

  • 26 positive cases driving every attack family (one `t.Run` per case).
  • 6 negative cases (benign developer text + the "new endpoint" trap).
  • 6 zero-width-bypass cases pinning the Unicode normalisation guard.
  • 3 whitespace-collapse cases (tabs, newlines, doubled spaces).
  • 3 `MatchInjectionPattern` tests including `HonoursUnicodeNormalisation` verifying the returned pattern is the canonical ASCII form, not the munged input.

Why this is still defence-in-depth, not the primary defence

The audit verdict is unchanged: a sufficiently motivated attacker can bypass any substring blocklist via base64 directives, lookalike Unicode (Cyrillic 'е' for Latin 'e'), or non-English variants. The durable defence is the structural `<untrusted_content>` framing applied by `analyzer.Triage` and `implementer.Implement`. A positive hit here remains a strong signal worth aborting on; an absent hit must NOT be read as "safe".

Test plan

  • `go build ./...` clean
  • `go vet ./...` clean
  • `go test ./... -count=1` — all 30 packages pass
  • `golangci-lint run --timeout=5m ./...` — 0 issues
  • CLAUDE.md item 55 added with the full pattern-family list
  • CLAUDE.md "Still open" section trimmed

…rs before match

Background

The bulletproof certification audit treated the heuristic substring
blocklist in DetectPromptInjection as a 1% defence — the durable
mitigation is the structural <untrusted_content> framing applied by
analyzer.Triage and implementer.Implement. The list was nevertheless
tracked as "pattern expansion deprioritized but worth doing". This
PR closes that follow-up.

What changed

internal/sanitize/sanitize.go:

- 10 → 56 patterns grouped by attack family with rationale comments:
  override/disregard, role/identity coercion, authority spoofing,
  output coercion, memory poisoning, action coercion, exfiltration,
  jailbreak labels (DAN/developer/jailbreak mode), and chat-template
  tags (<|system|>, <|im_start/end|>, <|user|>, <|assistant|>,
  [INST]/[/INST], <<SYS>>/<</SYS>>).

- New zeroWidthRe strips ZWSP, ZWNJ, ZWJ, LRM/RLM, LRE/RLE/PDF/LRO/RLO,
  the word joiner range, soft hyphen, and BOM before substring matching.
  Defeats the "ig<ZWSP>nore previous instructions" bypass that the
  audit flagged as a known weakness. Built with regex \x{...} escapes
  so the source file stays pure ASCII — embedded invisibles silently
  break diffs and Go rejects U+FEFF in source.

- New normaliseForInjectionMatch helper: lowers, strips invisibles,
  collapses whitespace runs. Result is fed only to the matcher, never
  used for content storage.

- New MatchInjectionPattern returns the canonical pattern that
  fired (or "" on no match) so callers can log which family triggered
  for post-mortems.

internal/sanitize/sanitize_test.go:

- 26 positive cases driving every attack family (one t.Run per case
  so failures point at the specific phrase).
- 6 negative cases covering benign developer text (refactor, bug fix,
  README, dep bump) and the "new endpoint" trap (the word "new" is
  fine; "new instructions" is the bad phrase).
- 6 zero-width-bypass cases pinning the Unicode normalisation guard.
- 3 whitespace-collapse cases (tabs, newlines, doubled spaces).
- 3 MatchInjectionPattern tests including the
  HonoursUnicodeNormalisation case verifying the returned pattern is
  the canonical ASCII form, not whatever munged form the input carried.

Verified

go build ./..., go vet ./..., go test ./... -count=1 — all 30
packages pass. golangci-lint run --timeout=5m ./... — 0 issues.
@tzone85 tzone85 merged commit a2faa26 into main Jun 12, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant