Make WordPress Core

Opened 2 years ago

Last modified 2 years ago

#56531 new defect (bug)

Aiming to “kill” entities, `sanitize_title_with_dashes()` happens to eat content

Reported by: anrghg's profile anrghg Owned by:
Milestone: Awaiting Review Priority: normal
Severity: major Version:
Component: Formatting Keywords:
Focuses: Cc:

Description

$title = preg_replace( '/&.+?;/', '', $title );

This regex deletes the part of the title between an ampersand and a semicolon. I ran into this issue when testing ASCII symbols and punctuation but it may affect titles using a semicolon instead of an inner period, and happen to use an ampersand before. Semicolon is less common, even less in titles, but WordPress should not make assumptions nor impose limitations.

In the process, sanitize_title_with_dashes() does again (cf. #56530) part of the job of remove_accents() but badly, deleting é instead of replacing it with e, and so on.

I’d suggest decoding all HTML entities, then reencoding <, >, &:

<?php
$title = html_entity_decode( $title );
$title = preg_replace( array( '/</', '/>/', '/&/' ), array( '&lt;', '&gt;', '&amp;' ), $title );

Change History (5)

#1 @anrghg
2 years ago

Note

The “kill” in the issue title is no kidding, it’s a quote from wp-includes/formatting.php:

<?php
// Kill entities.
$title = preg_replace( '/&.+?;/', '', $title );

#2 @anrghg
2 years ago

Please disregard the suggested preg_replace() and read this instead:

<?php
$title = htmlspecialchars( $title, ENT_NOQUOTES );

#3 follow-up: @costdev
2 years ago

  • Keywords changes-requested removed

changes-requested refers to: "Feedback has been provided, and the attached patch needs to be updated." Reference.

I'm not sure if such a change would need a dev note, so I'll leave this to others to give their thoughts.

#4 in reply to: ↑ 3 @anrghg
2 years ago

  • Keywords needs-dev-note needs-patch removed

Replying to costdev:

changes-requested refers to: "Feedback has been provided, and the attached patch needs to be updated." Reference.

I'm not sure if such a change would need a dev note, so I'll leave this to others to give their thoughts.

Sorry, indeed fixing a bug does not require any dev note, and as I’m to provide a patch, I’m removing all other keywords as well.

Context

It seems important to note that sanitize_title_with_dashes() is prone to delete a chunk of the title instead of just the ampersand — because the title input field seems to be plain text without automatic HTML encoding like in the article body. None of the editors converts even <, resulting in <some words> to become an invalid tag with an invalid attribute. That’s another issue I ran into while testing all ASCII in titles.

Users seem to be expected to use HTML entities in the title. But sanitize_title_with_dashes() handles only a few of these, deletes the rest.

#5 @anrghg
2 years ago

As I don’t have the resources to submit patches: The suggested rewrite could result in the following that also includes converting apostrophe to hyphen, and okina, letter apostrophe to underscore, as an important enhancement that would require a separate ticket and a dev note:

<?php
function sanitize_title_with_dashes( $title, $raw_title = '', $context = 'display' ) {
        $title = strip_tags( $title );
        // Maintains plus sign before calling `urldecode()`.
        $title = str_replace( '%2B', '+', $title );
        // URL-decodes to avoid screwing up percent sign removal.
        $title = urldecode( $title );
        // Removes percent signs.
        $title = str_replace( '%', '', $title );
        // Decodes HTML entities.
        $title = html_entity_decode( $title );
        // Reencodes <, >, &.
        $title = htmlspecialchars( $title, ENT_NOQUOTES );
        // Converts to lowercase.
        if ( seems_utf8( $title ) && function_exists( 'mb_strtolower' ) ) {
                $title = mb_strtolower( $title, 'UTF-8' );
        }
        $title = strtolower( $title );

        if ( 'save' === $context ) {
                
                // Converts okina, letter apostrophe to underscore.
                $title = str_replace( array( 'ʻ', 'ʼ' ), '_', $title );
                // Converts punctuation apostrophe to hyphen-minus.
                $title = str_replace( array( '’', '\'' ), '-', $title );
                // Converts spaces and dashes to hyphen-minus.
                $title = preg_replace( '/[\p{Zs}\p{Zl}\p{Zp}\x{2010}-\x{2015}\x{2212}]/u', '-', $title );
                // Converts &, <, >, @, /, * and dots to hyphen-minus.
                $title = str_replace( array( '&amp;', '&lt;', '&gt;', '@', '/', '*', '·', '‧' ), '-', $title );
                // Converts times to 'x'.
                $title = str_replace( '×', 'x', $title );
                // Removes entirely format controls, punctuation, symbols, modifier letters.
                $p_s_text = preg_replace( '/[\p{Cf}\p{Ps}\p{Pe}\p{Pi}\p{Pf}\p{Po}\p{Sk}\p{So}\p{Lm}]/u', '', $p_s_text );
        
        }
        
        // Converts period to hyphen-minus.
        $title = str_replace( '.', '-', $title );
        
        // Collapses and trims hyphen-minus.
        $title = preg_replace( '/-+/', '-', $title );
        $title = trim( $title, '-' );
        
        // Percent-encodes non-ASCII.
        if ( seems_utf8( $title ) ) {
                $title = utf8_uri_encode( $title, 200 );
        }

        // Deletes unsafe ASCII. (No more space.)
        $title = preg_replace( '/[^%a-z0-9_-]/', '', $title );

        return $title;
}
Last edited 2 years ago by anrghg (previous) (diff)
Note: See TracTickets for help on using tickets.