Fix GH-12703: parse_url colon-in-path + optimize control-char replacement#21718
Closed
iliaal wants to merge 1 commit intophp:masterfrom
Closed
Fix GH-12703: parse_url colon-in-path + optimize control-char replacement#21718iliaal wants to merge 1 commit intophp:masterfrom
iliaal wants to merge 1 commit intophp:masterfrom
Conversation
Two changes to ext/standard/url.c. 1. Fix phpGH-12703: parse_url mishandles colon inside path. parse_url('/page:1') returns false instead of the path, and parse_url('//www.example.com/foo:65535/') reports a spurious port of 65535. Both stem from the scheme walk's error branch routing /-prefixed inputs into the `parse_port` fallback. Guard the `parse_port` branch with `*s != '/'`. A leading slash can't start a scheme, so the input must be a path or a //host authority. The existing branch ordering then lets the //host check catch `//...` inputs and `just_path` catch the rest. 2. Speed up php_replace_controlchars on typical URLs. Each parse calls php_replace_controlchars once per component (scheme, host, user, pass, path, query, fragment), and each call walked the bytes through iscntrl(). iscntrl() hits __ctype_b_loc() for a locale-aware lookup, and callgrind showed it at ~14% of total instructions on a realistic URL workload. Replace the call with an inline `*s < 0x20 || *s == 0x7f` compare. URL components are bytes, not locale-dependent text, so C-locale semantics are what we want regardless of the process locale. The Zend language scanner uses the same pattern (`yych <= 0x1F`). Benchmark across 13 realistic URL shapes: 12-22% faster, 16% average. No behavior change in the C/POSIX locale. Closes phpGH-12703
62b73ec to
854136f
Compare
Member
|
This PR is making two unrelated changes, it should be split up. But in any case, we shouldn't make any further functional changes to Users should migrate to the new ext/uri classes ( |
Contributor
Author
|
@TimWolla I can split performance related changes, if it'll be considered? Seems like 10%+ speed increase for a commonly used function is still of value, and https://www.php.net/manual/en/function.parse-url.php doesn't imply function is being deprecated or at risk of being so. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #12703
Two changes in the same commit, both in
ext/standard/url.c.Bug fix: GH-12703
parse_url('/page:1')returnsfalseinstead of['path' => '/page:1'], andparse_url('//www.example.com/foo:65535/')reports a spuriousport=65535that should never exist (the65535is part of the path, not an authority port).Both bugs stem from the scheme-walk error branch routing
/-prefixed inputs into theparse_portfallback. A leading/can't start a scheme, so the input is either a path or a//hostauthority, never a host-less:portform.The guard
*s != '/'on theparse_portbranch sends/page:1tojust_pathand//www.example.com/foo:65535/to the//hostbranch that follows it, both of which produce the correct output.This matches vdauchy's abandoned PR #12704 (closed without technical objections, author cited the URL parsing API RFC). The RFC doesn't deprecate
parse_url()in 8.5, andparse_url()remains in use by FTP stream wrappers, SoapClient, and session URL rewriting, so the fix is worth landing on master.Uri\Rfc3986\Uri::parse()from the 8.5 URI extension already handles both cases correctly today. After this fix,parse_url()agrees with the RFC 3986 parser on 7 of 9 cases I tested (the remaining two,page:1and./page:1, are long-standing legacy quirks out of scope).Optimization: php_replace_controlchars
Each
parse_url()call invokesphp_replace_controlcharsonce per parsed component (scheme, host, user, pass, path, query, fragment), and each call walked the bytes throughiscntrl().iscntrl()goes through__ctype_b_loc()for a locale-aware lookup on every byte. Callgrind showed this accounted for ~14% of total instructions on a realistic URL workload.Replace the call with an inline
*s < 0x20 || *s == 0x7fcompare. URL components are bytes, not locale-dependent text, so C-locale semantics are what we want regardless of the process locale. The Zend language scanner already uses the same pattern (yych <= 0x1F), andext/standard/quot_print.cusesiscntrl(c) || c == 0x7ffor the same reason (the author didn't trustiscntrlto include DEL across locales).Benchmark
Release build, 1M iterations per case, best-of-5, 13 realistic URL shapes:
http://example.com)https://www.example.com/path/to/page)https://api.example.com/v1/users?id=42&format=json)//www.example.com/path/to/page)/page:1?foo=bar)/foo)http://[2001:db8::1]:8080/path)mailto:user@example.com)file:///home/user/file.txt)Average: 16% faster. No behavior change in the C/POSIX locale.
Tests
New
ext/standard/tests/url/gh12703.phptcovers both bugs with correct expected values (not the buggy port expectations locked into PR #12704's original test). Includes edge cases/:,/:80,//host/a:1/b:2/, IPv6-like inputs, and a full scheme://user:pass@host:port/path?query#fragment regression guard.git stashround-tripext/standard/tests/url/— 32/32 pass, 0 regressionsext/uri/tests/— 309/309 pass (the 8.5 URI extension wraps parse_url, so wrapping regression check)