Make WordPress Core

Opened 7 weeks ago

Last modified 6 weeks ago

#64457 new enhancement

Early filter invalid hosts in wp_http_validate_url

Reported by: sirlouen's profile SirLouen Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version:
Component: HTTP API Keywords: needs-patch
Focuses: performance Cc:

Description (last modified by SirLouen)

A little performance improvement in wp_http_validate_url, early returning in the presence of invalid values.

Theoretically, a hostname check for a TLD with underscores will never succeed when calling gethostbyname but given how “expensive” in performance this function is, maybe early returning for a $host in the presence of any invalid value will save that call for certain malformed URLs.

Attachments (1)

64457.diff (1.2 KB) - added by manhphucofficial 7 weeks ago.

Download all attachments as: .zip

Change History (23)

#1 @westonruter
7 weeks ago

Why underscores alone? Shouldn't it short-circuit for any character which isn't allowed in a host name?

#2 @manhphucofficial
7 weeks ago

That’s a fair question, and I agree with the underlying point.

The reason the ticket originally mentions underscores specifically is mostly practical: _ is a very common mistake in hostnames, and it’s a case where we can be confident that gethostbyname() will never succeed. So the idea was to avoid that call early in an obviously invalid case, purely as a small performance win.

But you’re right that, once we start thinking about this more generally, singling out underscores feels a bit arbitrary. If we’re going to short-circuit at this stage, it probably makes more sense to do a broader “is this a valid hostname at all?” check and bail early for any disallowed characters, rather than hard-coding one specific case.

I’m open to adjusting the scope of the ticket in that direction if that’s the better approach.

#3 @SirLouen
7 weeks ago

  • Keywords needs-patch added; 2nd-opinion removed
  • Milestone changed from Awaiting Review to Future Release

@westonruter true, I was triaging a GB report which happened to not only enable underscores but also test truthy for underscore domains and I was a little TF?

In fact, I was thinking that we should be filtering input with FILTER_VALIDATE_DOMAIN

@manhphucofficial if you want take this.

Last edited 7 weeks ago by SirLouen (previous) (diff)

#4 @SirLouen
7 weeks ago

  • Description modified (diff)
  • Summary changed from Avoid underscores for hosts in wp_http_validate_url to Early filter invalid hosts in wp_http_validate_url

#5 @manhphucofficial
7 weeks ago

Thanks for updating the summary and description — that aligns well with the direction discussed.

I’ll work on a small patch using early hostname validation (e.g. FILTER_VALIDATE_DOMAIN) and follow up with tests.

#7 in reply to: ↑ 6 @SirLouen
7 weeks ago

Replying to westonruter:

Unless the situation has changed, I don't believe we can use filter_var(). See https://github.com/WordPress/wordpress-develop/blob/1bd29b14806f471f3ba1df0dc0e86b6aaae27b1e/src/wp-includes/functions.php#L7326-L7327

Weird

But like iconv, filter is recommended but not forced
https://make.wordpress.org/hosting/handbook/server-environment/

In this case, like we can see in other places, basically what I would do is a conditional function_exists, for those who have this, they will get a little performance upgrade; for those who doesn't, they will have to rely on gethostbyname

Meanwhile I'm going to ask in meta to see the % of installations with filter in place. I can't believe is anything below 99% nowadays (iconv is 99%, but I never disputed it, because nowadays mbstring almost replaces 100% of the iconv usage)

Last edited 7 weeks ago by SirLouen (previous) (diff)

#8 @westonruter
7 weeks ago

I would support the function_exists() check and make sure the Filter extension is included among the suggested extensions, like cURL is. I see Filter is included in the Hosting handbook (as you noted already): https://make.wordpress.org/hosting/handbook/server-environment/#php-extensions

So yes, I think we should be safe to use Filter, if we add safeguards for it not being enabled.

Last edited 7 weeks ago by westonruter (previous) (diff)

#9 @manhphucofficial
7 weeks ago

Patch attached.
Uses early hostname validation via Filter when available, with a fallback to the existing behavior otherwise. Includes a test for underscore hostnames.

#10 @westonruter
7 weeks ago

  • Priority changed from low to normal

@manhphucofficial please put that patch in a pull request so we can better review and see the tests passing on all environments.

This ticket was mentioned in PR #10669 on WordPress/wordpress-develop by @manhphucofficial.


7 weeks ago
#11

  • Keywords has-patch has-unit-tests added; needs-patch removed

Fixes #64457.

Adds early hostname validation using the Filter extension when available, while falling back to the existing behavior when it’s not. Includes a test case for underscore hostnames.

#12 @manhphucofficial
7 weeks ago

  • Keywords needs-patch added; has-patch has-unit-tests removed

Sure — I’ve opened a PR so the patch can be reviewed with CI:
https://github.com/WordPress/wordpress-develop/pull/10669

@manhphucofficial commented on PR #10669:


7 weeks ago
#13

Thanks for the review!

I’ve updated the patch to address all the points raised:

  • hostname validation now only applies when the host is not an IPv4 address
  • removed the FILTER_VALIDATE_IP check and related constant assumptions
  • added test coverage for underscores in hostnames

Please let me know if anything should be adjusted further. Appreciate you taking a look!

@manhphucofficial commented on PR #10669:


7 weeks ago
#14

Thanks for the feedback!

I’ve updated the patch to address all the points:

  • switched the check to extension_loaded( 'filter' ) as suggested
  • kept IPv4 handling separate to avoid affecting IP-based hosts
  • added a test case for a valid IP host (https://1.1.1.1/)

Happy to adjust further if there’s anything else you’d like me to refine.

@westonruter commented on PR #10669:


7 weeks ago
#15

@SirLouen what do you think?

#16 @peterwilsoncc
6 weeks ago

Story time...

In the lead up to and during the early days of the Iraq war in the early 2000's, Salam Pax blogged anonymously about the goings on in Iraq. Selected entries from his blog were subsequently released in a book titled he Baghdad Blog.

The blog, which is still available online, was hosted on Blogger at dear_raed.blogspot.com. The domain resolved at the time and continues to resolve now.

Arguably, Blogger should never have allowed sub-domains with underscores to be used but they did. I assume the case is true for other services as well.

My point is that the practical often differs from the theoretical, as is the case for DNS resolution. In its current form wp_http_validate_url() handles sub-domains with underscores and that will need to be the case in the future.

Testing the current pull request with the domain used by Salam Pax shows a change in behaviour that will need to be accounted for:

vagrant@wp-dev:/vagrant/wordpress-develop$ git checkout trunk 
Switched to branch 'trunk'
Your branch is up to date with 'origin/trunk'.
vagrant@wp-dev:/vagrant/wordpress-develop$ wp eval "var_dump( wp_http_validate_url( 'https://dear_raed.blogspot.com/' ) );"
eval()'d code:1:
string(31) "https://dear_raed.blogspot.com/"
vagrant@wp-dev:/vagrant/wordpress-develop$ git checkout 64457-early-filter-invalid-hosts 
Switched to branch '64457-early-filter-invalid-hosts'
vagrant@wp-dev:/vagrant/wordpress-develop$ wp eval "var_dump( wp_http_validate_url( 'https://dear_raed.blogspot.com/' ) );"
eval()'d code:1:
bool(false)

@westonruter commented on PR #10669:


6 weeks ago
#17

I asked Gemini to review the changes and it had some helpful feedback:


I have completed the review of the changes.

The changes introduce stricter validation for hostnames in wp_http_validate_url() using filter_var() with FILTER_VALIDATE_DOMAIN, which is a good improvement. However, I identified a critical regression regarding IPv6 support and a minor edge case with numeric hostnames.

### Review Findings

  1. Critical Issue: IPv6 Support Regression
    • Observation: The new validation block runs when $is_ipv4 is false. If the URL contains an IPv6 literal (e.g., http://[::1]/), $is_ipv4 will be false. The filter_var( '[::1]', FILTER_VALIDATE_DOMAIN, ... ) call returns false for bracketed IPv6 addresses, causing wp_http_validate_url to return false immediately.
    • Impact: Valid IPv6 URLs will be rejected.
    • Recommendation: The check should be skipped if the host appears to be an IPv6 literal. Since parse_url preserves brackets for IPv6 hosts, checking if $host starts with [ would be sufficient (e.g., && strpos( $host, '[' ) === false).
  1. Minor Issue: Handling of "0" Hostname
    • Observation: filter_var( '0', FILTER_VALIDATE_DOMAIN, ... ) returns the string "0". In PHP, ! "0" evaluates to true. This causes the check ! filter_var(...) to pass (evaluating as "invalid") for the hostname "0", returning false.
    • Recommendation: Use strict comparison === false to ensure only actual validation failures trigger the early return.
    • Code: && false === filter_var( ... )
  1. Tests
    • The added test case underscore_in_hostname correctly asserts that underscores are now invalid in domain names, which aligns with the FILTER_FLAG_HOSTNAME behavior.
    • Suggestion: It would be beneficial to add a test case for an IPv6 literal (e.g., http://[::1]/) to ensure this functionality is preserved and to prevent future regressions.
  1. Code Style & Compatibility
    • The code adheres to WordPress coding standards (indentation, spacing).
    • PHP 7.2 compatibility is maintained (FILTER_VALIDATE_DOMAIN is available since 7.0).

### Summary
The logic improvement is sound but needs to account for IPv6 literals to avoid breaking support for them. I recommend adjusting the condition to exclude IPv6 hosts and using strict comparison for the filter_var result.

I will not modify the code myself but I present these findings for the user to act upon.

#18 @SirLouen
6 weeks ago

My point is that the practical often differs from the theoretical, as is the case for DNS resolution. In its current form wp_http_validate_url() handles sub-domains with underscores and that will need to be the case in the future.

Trying to create a blogspot account now with an underscore.

https://i.imgur.com/eJYlGo3.png

It appears that Google has evolved.

Luckily, @peterwilsoncc has a vast memory to recall one of those in ten million cases.

Still, if we would like to play with Jurassic Park rules and avoid the T-Rex could escape from the enclosure, we could stick just to the domain, schema and tld part, because, in reality is the only thing that is sticking to the real RFC rulings (from there any kind of subdomain sublevel could be technically the jungle).

So (take notes for unit tests):

  1. h_ttp://example.org should be invalid
  2. hey_ho_lets_go._example.org should be invalid
  3. omg.c_om should be invalid
  4. peter_is_amazing.example.org should be VALID
Last edited 6 weeks ago by SirLouen (previous) (diff)

@SirLouen commented on PR #10669:


6 weeks ago
#19

@manhphuc check some additional suggestions in the Core Trac thread.

@SirLouen commented on PR #10669:


6 weeks ago
#20

@westonruter I was suspicious that filter adoption is 100% by now and I have been confirmed

We can still be conservative but I believe its time to update the Core docs and simply add filter in the pack of mandatory (and still no one will notice anything).

@SirLouen commented on PR #10669:


6 weeks ago
#21

@manhphuc check the core ticket. Specially because the particular case you used for the unit test, seemed to be conflictive. You can add a bunch of extra unit tests as I commented in the reply. I think we can move this forward.

@manhphucofficial commented on PR #10669:


6 weeks ago
#22

Thanks everyone for the detailed feedback and edge-case examples.

I’ve updated the hostname validation logic to avoid regressing legacy hosts that include underscores in subdomains (e.g. Blogspot), while still rejecting underscores in the registrable domain / TLD.

The implementation now:

  • Skips FILTER_VALIDATE_DOMAIN for IPv6 literals
  • Allows underscores in subdomains, but not in the registrable domain portion
  • Preserves existing behavior for valid legacy hosts

I’ve also added unit tests covering the cases discussed in the Trac thread:

All HTTP-related PHPUnit tests are passing locally.

Note: See TracTickets for help on using tickets.