Opened 5 weeks ago
Closed 11 days ago
#64607 closed defect (bug) (fixed)
HTML API: Normalize may remove significant text from PRE, TEXTAREA content
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Milestone: | 7.0 | Priority: | normal |
| Severity: | minor | Version: | 6.7 |
| Component: | HTML API | Keywords: | has-patch has-unit-tests |
| Focuses: | Cc: |
Description
WP_HTML_Processor::normalize() will remove a single leading newline (U+000A) from TEXTAREA, PRE, and LISTING content. It is correct to recognize that a leading newline is ignored inside of these elements. However, an HTML-to-HTML normalization function must preserve the semantic content. There are cases where stripping a leading new line results in differing HTML.
Consider:
<!-- original --> <textarea> </textarea> <!-- Normalized --> <textarea></textarea>
In this case, stripping the leading newline is harmless because the newline is ignored by any spec-compliant HTML parser.
However, if there are multiple leading newlines, the result is different:
<!-- original --> <textarea> </textarea> <!-- Normalized --> <textarea> </textarea>
In this case, the original contains a new line when parsed. But, when parsing the normalized version, the newline is stripped and the element is now empty — it contains no newlines. In this case, normalization has failed because the result is not semantically equivalent to the input.
This can be observed by repeatedly calling ::normalize(). On each call, a leading newline is stripped until none remain. Ideally, ::normalize() would be idempotent and calling ::normalize() on already-normalized HTML would result in no changes.
<?php $html = <<<HTML <textarea> </textarea> HTML; echo "\n{$html}\n"; $html = WP_HTML_Processor::normalize( $html ); echo "\n{$html}\n"; $html = WP_HTML_Processor::normalize( $html ); echo "\n{$html}\n"; $html = WP_HTML_Processor::normalize( $html ); echo "\n{$html}\n";'
Prints the following. Notice that normalization consumes newlines until they're exhausted:
<textarea> </textarea> <textarea> </textarea> <textarea></textarea> <textarea></textarea>
This behavior onl affects the leading newline inside elements with special leading newline behavior: TEXTAREA, PRE, and LISTING.
This note from the HTML standard is interesting:
For historical reasons, this algorithm does not round-trip an initial U+000A (LF) character in
pre,textarea, orlistingelements, even though (in the first two cases) the markup being round-tripped can be conforming. The HTML parser will drop such a character during parsing, but this algorithm does not serialize an extra U+000A (LF) character.
This is one case where HTML-to-HTML tooling like the HTML API can do better than the standard by ensuring that HTML content is preserved.
Change History (3)
This ticket was mentioned in PR #10871 on WordPress/wordpress-develop by @jonsurrell.
5 weeks ago
#1
- Keywords has-patch has-unit-tests added
Ensure HTML
::normalize()does not alter HTML by stripping newlines fromPRE,LISTING, orTEXTAREAelements.These elements have special rules that strip a single leading newline from their contents if present.
This approach works by injecting a newline after the tag opener when serializing the open tag tokens.
This works in all cases resulting in a stable normalization:
<pre>TEXT</pre>"TEXT"<pre>\nTEXT</pre>"TEXT"<pre>\nTEXT</pre>"TEXT"<pre>\nTEXT</pre>"TEXT"<pre>\n\nTEXT</pre>"\nTEXT"<pre>\n\nTEXT</pre>"\nTEXT"<pre></pre><pre>\n</pre>""Trac ticket: https://core.trac.wordpress.org/ticket/64607