Make WordPress Core

Opened 3 hours ago

#64151 new enhancement

Improve maintainability and robustness of sanitize_title_with_dashes()

Reported by: westonruter's profile westonruter Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 1.2
Component: Formatting Keywords: needs-patch
Focuses: Cc:

Description

This is a follow-up to #64089.

As discussed in PR #10204, the sanitize_title_with_dashes() function is difficult to maintain because it has a lot of URL-encoded characters and numeric HTML entities

I tried to improve the maintainability in ff2d2a7, but my approach was not as robust as it could have been, thanks to feedback from @dmsnell:

I strongly discourage replacements that attempt to match normative character references, or which mix UTF-8 characters and HTML character references. these lead to strange edge cases and can easily lead to situations where we cannot accomplish what should be allowable.

to that end if we want to make these replacements I would encourage backing up to the top of this function and replacing strip_tags() with a run through the HTML API to extract the title as decoded plaintext. once that’s done we can examine raw UTF-8 replacements and not have to concern ourselves if someone wrote   or &nbsp or   or &#0000000160 — all of these decode into the same U+00A0 code point.

If not wanting to reconsider this function more holistically, this can still be decoded as WP_HTML_Decoder::decode_text_node( $title ) before making these replacements. They can be done rather swiftly with strtr(). Further, since we are creating a static replacements array, we don’t have to use a potentially-missing runtime function to generate them: we can use Unicode string literals like \u{2011} for the patterns/matches.

Also a quick side note: HTML’s named character references are case-sensitive, so while I am guessing the use of str_ireplace() is to catch variations like  , if it actually does that it will transform _plaintext_ content and not the placeholder for a no-break space.

See that entire comment thread as well as his review:

we recently had similar work in PR #9103 (#62995).

[…] we could consider an approach similar to that taken over there, which is to rely on a Unicode-supported PCRE to replace everything with the `Dash_Punctuation` character property, and also the Space_Separator.

if ( _wp_can_use_pcre_u() ) {
	$title = preg_replace( '~[\p{Pd}\p{Zs}]~u', '-', $title );
}

Over time I think it’s okay to be more and more restrictive on these, but I hope we push more in the direction of finding ways to ensure the titles and filenames more closely match the content they are associated with.

Change History (0)

Note: See TracTickets for help on using tickets.