feat(sanitize): expand injection patterns 10→56, strip zero-width chars before match#74
Merged
Merged
Conversation
…rs before match
Background
The bulletproof certification audit treated the heuristic substring
blocklist in DetectPromptInjection as a 1% defence — the durable
mitigation is the structural <untrusted_content> framing applied by
analyzer.Triage and implementer.Implement. The list was nevertheless
tracked as "pattern expansion deprioritized but worth doing". This
PR closes that follow-up.
What changed
internal/sanitize/sanitize.go:
- 10 → 56 patterns grouped by attack family with rationale comments:
override/disregard, role/identity coercion, authority spoofing,
output coercion, memory poisoning, action coercion, exfiltration,
jailbreak labels (DAN/developer/jailbreak mode), and chat-template
tags (<|system|>, <|im_start/end|>, <|user|>, <|assistant|>,
[INST]/[/INST], <<SYS>>/<</SYS>>).
- New zeroWidthRe strips ZWSP, ZWNJ, ZWJ, LRM/RLM, LRE/RLE/PDF/LRO/RLO,
the word joiner range, soft hyphen, and BOM before substring matching.
Defeats the "ig<ZWSP>nore previous instructions" bypass that the
audit flagged as a known weakness. Built with regex \x{...} escapes
so the source file stays pure ASCII — embedded invisibles silently
break diffs and Go rejects U+FEFF in source.
- New normaliseForInjectionMatch helper: lowers, strips invisibles,
collapses whitespace runs. Result is fed only to the matcher, never
used for content storage.
- New MatchInjectionPattern returns the canonical pattern that
fired (or "" on no match) so callers can log which family triggered
for post-mortems.
internal/sanitize/sanitize_test.go:
- 26 positive cases driving every attack family (one t.Run per case
so failures point at the specific phrase).
- 6 negative cases covering benign developer text (refactor, bug fix,
README, dep bump) and the "new endpoint" trap (the word "new" is
fine; "new instructions" is the bad phrase).
- 6 zero-width-bypass cases pinning the Unicode normalisation guard.
- 3 whitespace-collapse cases (tabs, newlines, doubled spaces).
- 3 MatchInjectionPattern tests including the
HonoursUnicodeNormalisation case verifying the returned pattern is
the canonical ASCII form, not whatever munged form the input carried.
Verified
go build ./..., go vet ./..., go test ./... -count=1 — all 30
packages pass. golangci-lint run --timeout=5m ./... — 0 issues.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the last open follow-up from the bulletproof certification pass: `internal/sanitize.DetectPromptInjection` was previously a 10-entry substring blocklist easily bypassed with zero-width characters spliced into the payload.
What changed
`internal/sanitize/sanitize.go`:
`internal/sanitize/sanitize_test.go`:
Why this is still defence-in-depth, not the primary defence
The audit verdict is unchanged: a sufficiently motivated attacker can bypass any substring blocklist via base64 directives, lookalike Unicode (Cyrillic 'е' for Latin 'e'), or non-English variants. The durable defence is the structural `<untrusted_content>` framing applied by `analyzer.Triage` and `implementer.Implement`. A positive hit here remains a strong signal worth aborting on; an absent hit must NOT be read as "safe".
Test plan