Go: idna-ip-literal-smuggle (UTS-46 NFKC digit-fold SSRF) by astrogilda · Pull Request #3841 · semgrep/semgrep-rules

astrogilda · 2026-05-04T06:44:09Z

Go: idna-ip-literal-smuggle (UTS-46 NFKC digit-fold SSRF)

Summary

New Go security rule at go/lang/security/audit/idna-ip-literal-smuggle.yaml detecting an SSRF anti-pattern where an untrusted hostname is mapped through golang.org/x/net/idna UTS-46 mapping ((*Profile).ToASCII or (*Profile).ToUnicode on a digit-folding profile) and then reaches a network sink with no post-mapping IP-literal recheck.
Taint mode, two labels (PRE_IDNA, POST_IDNA), label transformation at the IDNA call site, intra-procedural. Sanitizer requires a trailing-dot trim plus netip.ParseAddr or net.ParseIP recheck, in that order.
Ships with a single Go fixture (idna-ip-literal-smuggle.go) carrying 23 // ruleid: markers covering all seven in-scope Unicode-block fold ranges plus the ToUnicode entry-point fold, plus 6 // ok: negatives covering both compliant trim variants, the Punycode out-of-scope case, and a Devanagari Valid-but-not-mapped negative control.

Motivation

The rule complements an inter-procedural CodeQL companion query in the CodeQL community pack (PR https://github.com/github/codeql/pull/21784). The split is deliberate: Semgrep OSS provides a high-precision direct-call sweep that runs on free-tier CI, while CodeQL provides the inter-procedural recall vehicle for shapes where the IDNA call and the sink are wrapped behind an intermediate function (the canonical real-world canonicalAddr-style wrapper). The two engines hit different points on the precision/recall frontier; both are needed for a complete caller-side audit. This design is documented in docs/research/v0.1-detection-strategy.md of the upstream rule repository.

Threat model

idna.Lookup, idna.Display, idna.Registration, and any idna.New(idna.MapForLookup(), ...) profile run NFKC compatibility decomposition before producing ASCII output. The fold runs in both (*Profile).ToASCII and (*Profile).ToUnicode because the library executes validateAndMap before the encode-vs-decode branch. Enumerating the Unicode 16 table shipped with golang.org/x/text v0.21.0 yields 100 codepoints partitioned into seven Unicode-block ranges that fold to ASCII digits 0-9: Latin-1 superscripts (3), mathematical superscripts (7), mathematical subscripts (10), circled digits (10), fullwidth digits (10), the Mathematical Alphanumeric Symbols block (50), segmented digits (10). Devanagari digits (U+0966..U+096F) are not in scope: empirically verified against golang.org/x/net/idna v0.53.0, they pass through Punycode rather than fold to ASCII. An attacker-controlled hostname containing one of these codepoints passes a pre-IDNA net.ParseIP check (it is not ASCII), maps to an ASCII IP literal, and reaches the sink as if it were a DNS name. The result is SSRF against loopback, RFC 1918, link-local, or cloud metadata ranges.

The trailing-dot trim is required for the post-IDNA recheck to work. idna.Lookup.ToASCII("0.<superscript-1>.0.0.") returns "0.1.0.0.", which net.ParseIP rejects on its own. Without the trim, the recheck silently passes and the smuggle survives. The sanitizer pattern requires both predicates, in order.

Verification

semgrep --validate --config go/lang/security/audit/idna-ip-literal-smuggle.yaml on Semgrep 1.161.0: clean.
semgrep --config <rule> idna-ip-literal-smuggle.go: 23 findings, exactly matching the 23 // ruleid: markers.
Sweep across golang/go, kubernetes/kubernetes, and prometheus/prometheus (660 MB of Go source, three projects with a high incidence of host-string handling): zero findings outside the fixture. The canonical real-world hit shape, where the IDNA call and the sink are split across a wrapper function, is intentionally out of scope for the OSS tier and is covered by the companion CodeQL query.

Cross-language CVE precedent

CVE-2021-29923 (Go octal IPv4 literal SSRF). Established the post-parse-recheck contract for a different normalization vector.
CVE-2024-12224 (Servo idna crate): the same UTS-46 NFKC digit-fold surface, demonstrated cross-runtime.
CVE-2024-3651 (kjd/idna Python): related Unicode normalization preprocessing surface in a sibling IDNA implementation.

The pattern is not Go-specific; the rule is. Each ecosystem needs its own static-analysis vehicle.

Upstream library disposition

The Go security team reviewed an advisory for this anti-pattern and declined treating it as a library bug. The position is internally consistent: idna.Lookup.ToASCII is documented to implement UTS-46 mapping, and the post-mapping IP-literal recheck is a caller responsibility that the spec does not require the library to perform. Caller-side static analysis is the right vehicle for the gap, which is what this rule provides.

Companion PR

The CodeQL community-pack PR carries the inter-procedural-recall companion query: https://github.com/github/codeql/pull/21784. Reviewers may find that pack useful for understanding the full coverage envelope.

Pro and experimental variants

The OSS rule submitted here ships with intra-procedural taint only. An interfile (interfile: true) variant and an experimental field-name regex source-set widening (covering Endpoint, Server, Address, Addr, Target, Upstream, Origin field names in addition to Host/Hostname) live in the upstream rule repository at https://github.com/astrogilda/idna-ip-literal-smuggle-rules. They are not bundled into this PR because the registry convention in go/lang/security/ is one rule per topic; operators who need cross-file recall can pull the variant directly from the upstream repository.

Severity calibration

WARNING, not ERROR. A v0.1.x sweep across golang/go, kubernetes/kubernetes, and prometheus/prometheus produced zero alerts at the OSS tier, because production Go callers wrap idna.Lookup.ToASCII in a one-deep helper and intra-procedural taint cannot step through. The OSS tier is therefore a high-precision direct-call sweep with very low alert volume in practice; the recall vehicle is the inter-procedural CodeQL companion. Promoting to ERROR would not change the alert volume on real codebases but would change the failure mode on edge cases. confidence: MEDIUM, likelihood: MEDIUM, impact: HIGH reflect the specific attack: the precondition (caller passes user input into a UTS-46 profile) is not universal but is well-defined, and the consequence (SSRF against metadata or internal services) is severe.

Out of scope (documented in the rule message)

Bare idna.ToASCII(x) package-level helper. Punycode profile, nil mapping, no fold surface.
WHATWG-integrated URL parsers (url.Parse only, no direct idna.*.ToASCII call). Already validated post-decode.
IPv4-mapped IPv6 macro smuggles. Different normalization class, separate rule.

CLA

Will sign on first PR comment via cla-assistant.

Canonical source for these artefacts: https://github.com/astrogilda/idna-ip-literal-smuggle-rules

Detect untrusted hostnames that flow through golang.org/x/net/idna UTS-46 ToASCII mapping and reach a network sink without a post-mapping IP-literal recheck. UTS-46 NFKC mapping folds 100 non-ASCII codepoints across 8 classes to ASCII digits 0-9, allowing inputs like "0.<superscript-1>.0.0" to pass a pre-IDNA net.ParseIP check, map to "0.1.0.0", and be dialed as if they were DNS names. Taint mode with two labels (PRE_IDNA, POST_IDNA). Sanitizer requires a trailing-dot trim followed by netip.ParseAddr or net.ParseIP, in that order; without the trim the recheck silently passes the multi-trailing- dot variant. Ships with one Go fixture: 21 ruleid markers covering all 8 fold classes plus 6 ok markers covering both compliant trim variants, the Punycode out-of-scope case, and a Devanagari Valid-but-not-mapped negative-control. Validates on Semgrep 1.161.0; finding count on the fixture matches the marker count exactly. Signed-off-by: Sankalp Gilda <sankalp.gilda@gmail.com>

CLAassistant · 2026-05-04T06:44:18Z

All committers have signed the CLA.

The library runs UTS-46 mapping in validateAndMap before the encode-vs-decode branch, so (*Profile).ToUnicode produces the same digit-folded ASCII output as (*Profile).ToASCII for in-scope codepoints. Empirically verified against golang.org/x/net/idna v0.53.0 on all three digit-folding profiles. Without the ToUnicode patterns the rule would miss any caller that uses ToUnicode for hostname canonicalisation and then passes the result to a network sink. Rule changes: - Add ToUnicode patterns alongside ToASCII for each named profile (idna.Lookup, idna.Display, idna.Registration, idna.New(...)) and for the var-bound *idna.Profile receiver fallback. - Drop the redundant "Pattern: ..." paragraph that paraphrased the opening sentence verbatim. - Update the family-count claim ("8 classes") to the seven Unicode-block ranges that account for the 100 fold codepoints, and add an explicit note that Devanagari digits (U+0966-096F) are not in scope: empirical testing confirms they pass through Punycode rather than fold to ASCII. Test fixture: add Class 8 (P8a, P8b) covering Lookup.ToUnicode and Display.ToUnicode. Verified against semgrep 1.161.0: 23 findings on the rule, up from 21 before this commit.

astrogilda · 2026-05-04T14:40:17Z

Pushed a follow-up extending the rule scope and tightening the message (93f728d0).

The library runs UTS-46 mapping in validateAndMap before the encode-vs-decode branch, so (*Profile).ToUnicode produces the same digit-folded ASCII output as (*Profile).ToASCII for in-scope codepoints. Empirically verified against golang.org/x/net/idna v0.53.0: Lookup.ToUnicode("0.¹.0.0") returns "0.1.0.0", and Display.ToUnicode behaves identically. Without the ToUnicode patterns the rule would miss any caller that uses ToUnicode for hostname canonicalisation and then passes the result to a network sink.

What changed:

Rule: added ToUnicode patterns alongside ToASCII for each named profile (idna.Lookup, idna.Display, idna.Registration, idna.New(...)) and for the var-bound *idna.Profile receiver fallback.
Test fixture: added two ToUnicode positives (Lookup.ToUnicode -> http.Get, Display.ToUnicode -> net.LookupHost). Expected positive count moves from 21 to 23. Verified against semgrep 1.161.0.
Rule message: dropped a redundant "Pattern: ..." paragraph that paraphrased the opening sentence verbatim.
Rule message: replaced the "8 classes" claim with the seven Unicode-block ranges that account for the 100 fold codepoints, and added a note that Devanagari digits (U+0966-096F) are not in scope. The negative-control N6 already pinned the Devanagari exclusion in the fixture.

PR description updated to reflect the new scope.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Go: idna-ip-literal-smuggle (UTS-46 NFKC digit-fold SSRF)#3841

Go: idna-ip-literal-smuggle (UTS-46 NFKC digit-fold SSRF)#3841
astrogilda wants to merge 2 commits intosemgrep:mainfrom
astrogilda:idna-ip-literal-smuggle-go

astrogilda commented May 4, 2026 •

edited

Loading

Uh oh!

CLAassistant commented May 4, 2026 •

edited

Loading

Uh oh!

astrogilda commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

astrogilda commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Go: idna-ip-literal-smuggle (UTS-46 NFKC digit-fold SSRF)

Summary

Motivation

Threat model

Verification

Cross-language CVE precedent

Upstream library disposition

Companion PR

Pro and experimental variants

Severity calibration

Out of scope (documented in the rule message)

CLA

Uh oh!

CLAassistant commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

astrogilda commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

astrogilda commented May 4, 2026 •

edited

Loading

CLAassistant commented May 4, 2026 •

edited

Loading