Skip to content

Go: idna-ip-literal-smuggle (UTS-46 NFKC digit-fold SSRF)#3841

Open
astrogilda wants to merge 2 commits intosemgrep:mainfrom
astrogilda:idna-ip-literal-smuggle-go
Open

Go: idna-ip-literal-smuggle (UTS-46 NFKC digit-fold SSRF)#3841
astrogilda wants to merge 2 commits intosemgrep:mainfrom
astrogilda:idna-ip-literal-smuggle-go

Conversation

@astrogilda
Copy link
Copy Markdown

@astrogilda astrogilda commented May 4, 2026

Go: idna-ip-literal-smuggle (UTS-46 NFKC digit-fold SSRF)

Summary

  • New Go security rule at go/lang/security/audit/idna-ip-literal-smuggle.yaml detecting an SSRF anti-pattern where an untrusted hostname is mapped through golang.org/x/net/idna UTS-46 mapping ((*Profile).ToASCII or (*Profile).ToUnicode on a digit-folding profile) and then reaches a network sink with no post-mapping IP-literal recheck.
  • Taint mode, two labels (PRE_IDNA, POST_IDNA), label transformation at the IDNA call site, intra-procedural. Sanitizer requires a trailing-dot trim plus netip.ParseAddr or net.ParseIP recheck, in that order.
  • Ships with a single Go fixture (idna-ip-literal-smuggle.go) carrying 23 // ruleid: markers covering all seven in-scope Unicode-block fold ranges plus the ToUnicode entry-point fold, plus 6 // ok: negatives covering both compliant trim variants, the Punycode out-of-scope case, and a Devanagari Valid-but-not-mapped negative control.

Motivation

The rule complements an inter-procedural CodeQL companion query in the CodeQL community pack (PR https://github.com/github/codeql/pull/21784). The split is deliberate: Semgrep OSS provides a high-precision direct-call sweep that runs on free-tier CI, while CodeQL provides the inter-procedural recall vehicle for shapes where the IDNA call and the sink are wrapped behind an intermediate function (the canonical real-world canonicalAddr-style wrapper). The two engines hit different points on the precision/recall frontier; both are needed for a complete caller-side audit. This design is documented in docs/research/v0.1-detection-strategy.md of the upstream rule repository.

Threat model

idna.Lookup, idna.Display, idna.Registration, and any idna.New(idna.MapForLookup(), ...) profile run NFKC compatibility decomposition before producing ASCII output. The fold runs in both (*Profile).ToASCII and (*Profile).ToUnicode because the library executes validateAndMap before the encode-vs-decode branch. Enumerating the Unicode 16 table shipped with golang.org/x/text v0.21.0 yields 100 codepoints partitioned into seven Unicode-block ranges that fold to ASCII digits 0-9: Latin-1 superscripts (3), mathematical superscripts (7), mathematical subscripts (10), circled digits (10), fullwidth digits (10), the Mathematical Alphanumeric Symbols block (50), segmented digits (10). Devanagari digits (U+0966..U+096F) are not in scope: empirically verified against golang.org/x/net/idna v0.53.0, they pass through Punycode rather than fold to ASCII. An attacker-controlled hostname containing one of these codepoints passes a pre-IDNA net.ParseIP check (it is not ASCII), maps to an ASCII IP literal, and reaches the sink as if it were a DNS name. The result is SSRF against loopback, RFC 1918, link-local, or cloud metadata ranges.

The trailing-dot trim is required for the post-IDNA recheck to work. idna.Lookup.ToASCII("0.<superscript-1>.0.0.") returns "0.1.0.0.", which net.ParseIP rejects on its own. Without the trim, the recheck silently passes and the smuggle survives. The sanitizer pattern requires both predicates, in order.

Verification

  • semgrep --validate --config go/lang/security/audit/idna-ip-literal-smuggle.yaml on Semgrep 1.161.0: clean.
  • semgrep --config <rule> idna-ip-literal-smuggle.go: 23 findings, exactly matching the 23 // ruleid: markers.
  • Sweep across golang/go, kubernetes/kubernetes, and prometheus/prometheus (660 MB of Go source, three projects with a high incidence of host-string handling): zero findings outside the fixture. The canonical real-world hit shape, where the IDNA call and the sink are split across a wrapper function, is intentionally out of scope for the OSS tier and is covered by the companion CodeQL query.

Cross-language CVE precedent

  • CVE-2021-29923 (Go octal IPv4 literal SSRF). Established the post-parse-recheck contract for a different normalization vector.
  • CVE-2024-12224 (Servo idna crate): the same UTS-46 NFKC digit-fold surface, demonstrated cross-runtime.
  • CVE-2024-3651 (kjd/idna Python): related Unicode normalization preprocessing surface in a sibling IDNA implementation.

The pattern is not Go-specific; the rule is. Each ecosystem needs its own static-analysis vehicle.

Upstream library disposition

The Go security team reviewed an advisory for this anti-pattern and declined treating it as a library bug. The position is internally consistent: idna.Lookup.ToASCII is documented to implement UTS-46 mapping, and the post-mapping IP-literal recheck is a caller responsibility that the spec does not require the library to perform. Caller-side static analysis is the right vehicle for the gap, which is what this rule provides.

Companion PR

The CodeQL community-pack PR carries the inter-procedural-recall companion query: https://github.com/github/codeql/pull/21784. Reviewers may find that pack useful for understanding the full coverage envelope.

Pro and experimental variants

The OSS rule submitted here ships with intra-procedural taint only. An interfile (interfile: true) variant and an experimental field-name regex source-set widening (covering Endpoint, Server, Address, Addr, Target, Upstream, Origin field names in addition to Host/Hostname) live in the upstream rule repository at https://github.com/astrogilda/idna-ip-literal-smuggle-rules. They are not bundled into this PR because the registry convention in go/lang/security/ is one rule per topic; operators who need cross-file recall can pull the variant directly from the upstream repository.

Severity calibration

WARNING, not ERROR. A v0.1.x sweep across golang/go, kubernetes/kubernetes, and prometheus/prometheus produced zero alerts at the OSS tier, because production Go callers wrap idna.Lookup.ToASCII in a one-deep helper and intra-procedural taint cannot step through. The OSS tier is therefore a high-precision direct-call sweep with very low alert volume in practice; the recall vehicle is the inter-procedural CodeQL companion. Promoting to ERROR would not change the alert volume on real codebases but would change the failure mode on edge cases. confidence: MEDIUM, likelihood: MEDIUM, impact: HIGH reflect the specific attack: the precondition (caller passes user input into a UTS-46 profile) is not universal but is well-defined, and the consequence (SSRF against metadata or internal services) is severe.

Out of scope (documented in the rule message)

  • Bare idna.ToASCII(x) package-level helper. Punycode profile, nil mapping, no fold surface.
  • WHATWG-integrated URL parsers (url.Parse only, no direct idna.*.ToASCII call). Already validated post-decode.
  • IPv4-mapped IPv6 macro smuggles. Different normalization class, separate rule.

CLA

Will sign on first PR comment via cla-assistant.


Canonical source for these artefacts: https://github.com/astrogilda/idna-ip-literal-smuggle-rules

Detect untrusted hostnames that flow through golang.org/x/net/idna
UTS-46 ToASCII mapping and reach a network sink without a post-mapping
IP-literal recheck. UTS-46 NFKC mapping folds 100 non-ASCII codepoints
across 8 classes to ASCII digits 0-9, allowing inputs like
"0.<superscript-1>.0.0" to pass a pre-IDNA net.ParseIP check, map to
"0.1.0.0", and be dialed as if they were DNS names.

Taint mode with two labels (PRE_IDNA, POST_IDNA). Sanitizer requires a
trailing-dot trim followed by netip.ParseAddr or net.ParseIP, in that
order; without the trim the recheck silently passes the multi-trailing-
dot variant.

Ships with one Go fixture: 21 ruleid markers covering all 8 fold
classes plus 6 ok markers covering both compliant trim variants, the
Punycode out-of-scope case, and a Devanagari Valid-but-not-mapped
negative-control. Validates on Semgrep 1.161.0; finding count on the
fixture matches the marker count exactly.

Signed-off-by: Sankalp Gilda <sankalp.gilda@gmail.com>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 4, 2026

CLA assistant check
All committers have signed the CLA.

The library runs UTS-46 mapping in validateAndMap before the
encode-vs-decode branch, so (*Profile).ToUnicode produces the same
digit-folded ASCII output as (*Profile).ToASCII for in-scope
codepoints. Empirically verified against golang.org/x/net/idna v0.53.0
on all three digit-folding profiles. Without the ToUnicode patterns
the rule would miss any caller that uses ToUnicode for hostname
canonicalisation and then passes the result to a network sink.

Rule changes:
- Add ToUnicode patterns alongside ToASCII for each named profile
  (idna.Lookup, idna.Display, idna.Registration, idna.New(...))
  and for the var-bound *idna.Profile receiver fallback.
- Drop the redundant "Pattern: ..." paragraph that paraphrased the
  opening sentence verbatim.
- Update the family-count claim ("8 classes") to the seven
  Unicode-block ranges that account for the 100 fold codepoints, and
  add an explicit note that Devanagari digits (U+0966-096F) are not
  in scope: empirical testing confirms they pass through Punycode
  rather than fold to ASCII.

Test fixture: add Class 8 (P8a, P8b) covering Lookup.ToUnicode and
Display.ToUnicode. Verified against semgrep 1.161.0: 23 findings on
the rule, up from 21 before this commit.
@astrogilda
Copy link
Copy Markdown
Author

Pushed a follow-up extending the rule scope and tightening the message (93f728d0).

The library runs UTS-46 mapping in validateAndMap before the encode-vs-decode branch, so (*Profile).ToUnicode produces the same digit-folded ASCII output as (*Profile).ToASCII for in-scope codepoints. Empirically verified against golang.org/x/net/idna v0.53.0: Lookup.ToUnicode("0.¹.0.0") returns "0.1.0.0", and Display.ToUnicode behaves identically. Without the ToUnicode patterns the rule would miss any caller that uses ToUnicode for hostname canonicalisation and then passes the result to a network sink.

What changed:

  • Rule: added ToUnicode patterns alongside ToASCII for each named profile (idna.Lookup, idna.Display, idna.Registration, idna.New(...)) and for the var-bound *idna.Profile receiver fallback.
  • Test fixture: added two ToUnicode positives (Lookup.ToUnicode -> http.Get, Display.ToUnicode -> net.LookupHost). Expected positive count moves from 21 to 23. Verified against semgrep 1.161.0.
  • Rule message: dropped a redundant "Pattern: ..." paragraph that paraphrased the opening sentence verbatim.
  • Rule message: replaced the "8 classes" claim with the seven Unicode-block ranges that account for the 100 fold codepoints, and added a note that Devanagari digits (U+0966-096F) are not in scope. The negative-control N6 already pinned the Devanagari exclusion in the fixture.

PR description updated to reflect the new scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants