Skip to content

Fix GH-12703: parse_url colon-in-path + optimize control-char replacement#21718

Closed
iliaal wants to merge 1 commit intophp:masterfrom
iliaal:fix/gh12703-parse-url-colon-path
Closed

Fix GH-12703: parse_url colon-in-path + optimize control-char replacement#21718
iliaal wants to merge 1 commit intophp:masterfrom
iliaal:fix/gh12703-parse-url-colon-path

Conversation

@iliaal
Copy link
Copy Markdown
Contributor

@iliaal iliaal commented Apr 10, 2026

Fixes #12703

Two changes in the same commit, both in ext/standard/url.c.

Bug fix: GH-12703

parse_url('/page:1') returns false instead of ['path' => '/page:1'], and parse_url('//www.example.com/foo:65535/') reports a spurious port=65535 that should never exist (the 65535 is part of the path, not an authority port).

Both bugs stem from the scheme-walk error branch routing /-prefixed inputs into the parse_port fallback. A leading / can't start a scheme, so the input is either a path or a //host authority, never a host-less :port form.

The guard *s != '/' on the parse_port branch sends /page:1 to just_path and //www.example.com/foo:65535/ to the //host branch that follows it, both of which produce the correct output.

This matches vdauchy's abandoned PR #12704 (closed without technical objections, author cited the URL parsing API RFC). The RFC doesn't deprecate parse_url() in 8.5, and parse_url() remains in use by FTP stream wrappers, SoapClient, and session URL rewriting, so the fix is worth landing on master.

Uri\Rfc3986\Uri::parse() from the 8.5 URI extension already handles both cases correctly today. After this fix, parse_url() agrees with the RFC 3986 parser on 7 of 9 cases I tested (the remaining two, page:1 and ./page:1, are long-standing legacy quirks out of scope).

Optimization: php_replace_controlchars

Each parse_url() call invokes php_replace_controlchars once per parsed component (scheme, host, user, pass, path, query, fragment), and each call walked the bytes through iscntrl(). iscntrl() goes through __ctype_b_loc() for a locale-aware lookup on every byte. Callgrind showed this accounted for ~14% of total instructions on a realistic URL workload.

Replace the call with an inline *s < 0x20 || *s == 0x7f compare. URL components are bytes, not locale-dependent text, so C-locale semantics are what we want regardless of the process locale. The Zend language scanner already uses the same pattern (yych <= 0x1F), and ext/standard/quot_print.c uses iscntrl(c) || c == 0x7f for the same reason (the author didn't trust iscntrl to include DEL across locales).

Benchmark

Release build, 1M iterations per case, best-of-5, 13 realistic URL shapes:

Case len before after speedup
short_nopath (http://example.com) 18 76.7 ns 65.0 ns 15.3%
typical_web (https://www.example.com/path/to/page) 36 100.2 ns 78.3 ns 21.9%
with_query (https://api.example.com/v1/users?id=42&format=json) 50 117.4 ns 94.6 ns 19.4%
full_featured (user:pass@host:port/path?q#f) 83 181.9 ns 147.5 ns 18.9%
path_with_colons 44 102.6 ns 84.9 ns 17.3%
relative_scheme (//www.example.com/path/to/page) 30 80.8 ns 67.9 ns 16.0%
absolute_path (/page:1?foo=bar) 15 67.7 ns 57.6 ns 14.9%
short_absolute_path (/foo) 4 46.7 ns 40.3 ns 13.7%
ipv6 (http://[2001:db8::1]:8080/path) 30 113.6 ns 96.2 ns 15.3%
mailto (mailto:user@example.com) 23 68.6 ns 60.2 ns 12.2%
file_url (file:///home/user/file.txt) 26 75.2 ns 63.5 ns 15.6%
long_query (96-byte search URL) 96 141.8 ns 123.0 ns 13.3%
long_path (81-byte asset URL) 81 123.0 ns 99.0 ns 19.5%

Average: 16% faster. No behavior change in the C/POSIX locale.

Tests

New ext/standard/tests/url/gh12703.phpt covers both bugs with correct expected values (not the buggy port expectations locked into PR #12704's original test). Includes edge cases /:, /:80, //host/a:1/b:2/, IPv6-like inputs, and a full scheme://user:pass@host:port/path?query#fragment regression guard.

  • Test verified to fail on unfixed master and pass on the fixed build via git stash round-trip
  • ext/standard/tests/url/ — 32/32 pass, 0 regressions
  • ext/uri/tests/ — 309/309 pass (the 8.5 URI extension wraps parse_url, so wrapping regression check)

Two changes to ext/standard/url.c.

1. Fix phpGH-12703: parse_url mishandles colon inside path.

   parse_url('/page:1') returns false instead of the path, and
   parse_url('//www.example.com/foo:65535/') reports a spurious
   port of 65535. Both stem from the scheme walk's error branch
   routing /-prefixed inputs into the `parse_port` fallback.

   Guard the `parse_port` branch with `*s != '/'`. A leading slash
   can't start a scheme, so the input must be a path or a //host
   authority. The existing branch ordering then lets the //host
   check catch `//...` inputs and `just_path` catch the rest.

2. Speed up php_replace_controlchars on typical URLs.

   Each parse calls php_replace_controlchars once per component
   (scheme, host, user, pass, path, query, fragment), and each
   call walked the bytes through iscntrl(). iscntrl() hits
   __ctype_b_loc() for a locale-aware lookup, and callgrind showed
   it at ~14% of total instructions on a realistic URL workload.

   Replace the call with an inline `*s < 0x20 || *s == 0x7f`
   compare. URL components are bytes, not locale-dependent text,
   so C-locale semantics are what we want regardless of the
   process locale. The Zend language scanner uses the same pattern
   (`yych <= 0x1F`).

   Benchmark across 13 realistic URL shapes: 12-22% faster, 16%
   average. No behavior change in the C/POSIX locale.

Closes phpGH-12703
@iliaal iliaal force-pushed the fix/gh12703-parse-url-colon-path branch from 62b73ec to 854136f Compare April 11, 2026 13:22
@TimWolla
Copy link
Copy Markdown
Member

This PR is making two unrelated changes, it should be split up. But in any case, we shouldn't make any further functional changes to parse_url(), since at this point the URLs expected to be parsed by parse_url() are the URLs parsed by parse_url().

Users should migrate to the new ext/uri classes (Uri\* namespace) instead. See also php/doc-en#5477.

@TimWolla TimWolla closed this Apr 11, 2026
@iliaal
Copy link
Copy Markdown
Contributor Author

iliaal commented Apr 11, 2026

@TimWolla I can split performance related changes, if it'll be considered? Seems like 10%+ speed increase for a commonly used function is still of value, and https://www.php.net/manual/en/function.parse-url.php doesn't imply function is being deprecated or at risk of being so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Colon in url path is not parsed by parse_url

2 participants