Make WordPress Core

Opened 5 weeks ago

Closed 11 days ago

#64607 closed defect (bug) (fixed)

HTML API: Normalize may remove significant text from PRE, TEXTAREA content

Reported by: jonsurrell's profile jonsurrell Owned by: jonsurrell's profile jonsurrell
Milestone: 7.0 Priority: normal
Severity: minor Version: 6.7
Component: HTML API Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

WP_HTML_Processor::normalize() will remove a single leading newline (U+000A) from TEXTAREA, PRE, and LISTING content. It is correct to recognize that a leading newline is ignored inside of these elements. However, an HTML-to-HTML normalization function must preserve the semantic content. There are cases where stripping a leading new line results in differing HTML.

Consider:

<!-- original -->
<textarea>
</textarea>
<!-- Normalized -->
<textarea></textarea>

In this case, stripping the leading newline is harmless because the newline is ignored by any spec-compliant HTML parser.

However, if there are multiple leading newlines, the result is different:

<!-- original -->
<textarea>

</textarea>
<!-- Normalized -->
<textarea>
</textarea>

In this case, the original contains a new line when parsed. But, when parsing the normalized version, the newline is stripped and the element is now empty — it contains no newlines. In this case, normalization has failed because the result is not semantically equivalent to the input.

This can be observed by repeatedly calling ::normalize(). On each call, a leading newline is stripped until none remain. Ideally, ::normalize() would be idempotent and calling ::normalize() on already-normalized HTML would result in no changes.

<?php
$html = <<<HTML
<textarea>

</textarea>
HTML;

echo "\n{$html}\n";
$html = WP_HTML_Processor::normalize( $html );
echo "\n{$html}\n";
$html = WP_HTML_Processor::normalize( $html );
echo "\n{$html}\n";
$html = WP_HTML_Processor::normalize( $html );
echo "\n{$html}\n";'

Prints the following. Notice that normalization consumes newlines until they're exhausted:

<textarea>

</textarea>

<textarea>
</textarea>

<textarea></textarea>

<textarea></textarea>

This behavior onl affects the leading newline inside elements with special leading newline behavior: TEXTAREA, PRE, and LISTING.

This note from the HTML standard is interesting:

For historical reasons, this algorithm does not round-trip an initial U+000A (LF) character in pre, textarea, or listing elements, even though (in the first two cases) the markup being round-tripped can be conforming. The HTML parser will drop such a character during parsing, but this algorithm does not serialize an extra U+000A (LF) character.

This is one case where HTML-to-HTML tooling like the HTML API can do better than the standard by ensuring that HTML content is preserved.

Change History (3)

This ticket was mentioned in PR #10871 on WordPress/wordpress-develop by @jonsurrell.


5 weeks ago
#1

  • Keywords has-patch has-unit-tests added

Ensure HTML ::normalize() does not alter HTML by stripping newlines from PRE, LISTING, or TEXTAREA elements.

These elements have special rules that strip a single leading newline from their contents if present.

This approach works by injecting a newline after the tag opener when serializing the open tag tokens.

This works in all cases resulting in a stable normalization:

Start Text content Normalized Text conted after
<pre>TEXT</pre> "TEXT" <pre>\nTEXT</pre> "TEXT"
<pre>\nTEXT</pre> "TEXT" <pre>\nTEXT</pre> "TEXT"
<pre>\n\nTEXT</pre> "\nTEXT" <pre>\n\nTEXT</pre> "\nTEXT"
<pre></pre> (none) <pre>\n</pre> ""

WP_HTML_Processor::normalize() will remove a single leading newline (U+000A) from TEXTAREA, PRE, and LISTING content. It is correct to recognize that a leading newline is ignored inside of these elements. However, an HTML-to-HTML normalization function must preserve the semantic content. There are cases where stripping a leading new line results in differing HTML.

Consider:

<textarea>
</textarea>

<textarea></textarea>
In this case, stripping the leading newline is harmless because the newline is ignored by any spec-compliant HTML parser.

However, if there are multiple leading newlines, the result is different:

<textarea>

</textarea>

<textarea>
</textarea>
In this case, the original contains a new line when parsed. But, when parsing the normalized version, the newline is stripped and the element is now empty — it contains no newlines. In this case, normalization has failed because the result is not semantically equivalent to the input.

This can be observed by repeatedly calling ::normalize(). On each call, a leading newline is stripped until none remain. Ideally, ::normalize() would be idempotent and calling ::normalize() on already-normalized HTML would result in no changes.

<?php
$html = <<<HTML
<textarea>

</textarea>
HTML;

echo "\n{$html}\n";
$html = WP_HTML_Processor::normalize( $html );
echo "\n{$html}\n";
$html = WP_HTML_Processor::normalize( $html );
echo "\n{$html}\n";
$html = WP_HTML_Processor::normalize( $html );
echo "\n{$html}\n";'
Prints the following. Notice that normalization consumes newlines until they're exhausted:

<textarea>

</textarea>

<textarea>
</textarea>

<textarea></textarea>

<textarea></textarea>
This behavior onl affects the leading newline inside elements with special leading newline behavior: TEXTAREA, PRE, and LISTING.

This note from the HTML standard is interesting:

For historical reasons, this algorithm does not round-trip an initial U+000A (LF) character in pre, textarea, or listing elements, even though (in the first two cases) the markup being round-tripped can be conforming. The HTML parser will drop such a character during parsing, but this algorithm does not serialize an extra U+000A (LF) character.

This is one case where HTML-to-HTML tooling like the HTML API can do better than the standard by ensuring that HTML content is preserved.

Trac ticket: https://core.trac.wordpress.org/ticket/64607

#2 @jonsurrell
4 weeks ago

  • Milestone changed from Awaiting Review to 7.0
  • Owner set to jonsurrell
  • Severity changed from normal to minor
  • Status changed from new to accepted

#3 @jonsurrell
11 days ago

  • Resolution set to fixed
  • Status changed from accepted to closed

In 61747:

HTML API: Preserve newlines when normalizing special elements.

Ensures normalization preserves content in PRE, LISTING, and TEXTAREA elements. These elements ignore a single leading newline during parsing. Normalization now injects a newline after the tag opener to trigger this behavior, preventing significant newlines from being incorrectly stripped.

Developed in https://github.com/WordPress/wordpress-develop/pull/10871.

Props jonsurrell, dmsnell, mukesh27.
Fixes #64607.

Note: See TracTickets for help on using tickets.