Opened 8 months ago
Closed 6 months ago
#61009 closed defect (bug) (fixed)
HTML API: Fix some existing bugs in `kses` comment detection, enable Bits storage.
Reported by: | dmsnell | Owned by: | |
---|---|---|---|
Milestone: | 6.6 | Priority: | normal |
Severity: | normal | Version: | 6.6 |
Component: | HTML API | Keywords: | has-patch 2nd-opinion has-unit-tests |
Focuses: | Cc: |
Description (last modified by )
When wp_kses_split
processes a document it attempts to leave HTML comments alone. It makes minor adjustments, but leaves the comments in the document in its output. Unfortunately it only recognizes one kind of HTML comment and rejects many others.
In HTML there are many kinds of invalid markup which, according to the specification, are to be interpreted as an HTML comment. These include, but are not limited to:
- HTML comments with invalid syntax,
<!-->
,<!-- --!>
, etc… - HTML closing tags whose tag name is invalid
</3>
,</%happy>
, etc… - Things that look like XML CDATA sections,
<![CDATA[…]]>
- Things that look like XML Processor Instruction nodes,
<?include "blarg">
This patch makes a minor adjustment to the algorithm in wp_kses_split
to allow two additional kinds of HTML comments:
- HTML comments with the incorrect closer
--!>
, because this one was a simple and easy change. - Closing tags with an invalid tag name, e.g.
</%dolly>
j, because these are required to open up explorations in Gutenberg on Bits, a new iteration of dynamic tokens for externally-sourced data, or "Shortcodes 2.0"
These invalid closing tags, which in a browser are interpreted as comments, are one proposal for a placeholder mechanism in the HTML API unlocking HTML templating, a new kind of shortcode, and more. Having these persist in Core is a requirement for exploring and utilizing the new syntax because as long as Core removes them, there's no way to load content from the database and experiment on the full life cycle of potential Bits systems.
On its own, however, this represents a kind of bug fix for Core, making the implementation of wp_kses_split()
more closely align with its stated goal of leaving HTML comments as comments. It doesn't attempt to fully fix the mis-parsed comments (because that is a much deeper issue and involves many more questions about existing expectations) but it does propose a couple of hopefully and expectedly minor fixes that hopefully won't break any existing code or projects.
Change History (18)
This ticket was mentioned in PR #6395 on WordPress/wordpress-develop by @dmsnell.
8 months ago
#1
@zieladam commented on PR #6395:
8 months ago
#2
The test failure is related to IE conditional comments. However, dealing with those shouldn't matter anymore:
Conditional comments are conditional statements interpreted by Microsoft Internet Explorer versions 5 through 9 in HTML source code
https://www.sitepoint.com/microsoft-drop-ie10-conditional-comments/
@zieladam commented on PR #6395:
8 months ago
#3
Other than my preg_match
question, this LGTM. I'd just remove the failing test case, conditional comments were only relevant up to IE9 which WordPress doesn't support anymore. And even if it did, the new output doesn't seem like an XSS vector.
@westonruter commented on PR #6395:
8 months ago
#4
8 months ago
#5
How common are these other invalid comments?
I'm guessing rare, but it might be a fun experiment to scan some sites and count them. This patch _only_ addresses a single kind of invalid comment leaving most other constructions alone. This is because I'm trying to be tactical here and leave a small surface area of change.
I presume you'll be adding tests specifically for the changes here?
Yes absolutely. Before I do that though I want to make sure folks are on board with the change. If we use these "funky comments" for Bits, template placeholders, translation wrappers, they could become quite popular.
In a local branch I was playing with replacing wp_kses_split2
using the HTML API and it provides some great affordances, but as usual, far more questions to resolve.
For instance, WordPress currently strips away <:::>
, interpreting it as an invalid tag. It's not though, it's legitimate plaintext _despite being invalid markup_. So now what do we do?
- Do we attempt to maintain this legacy behavior that corrupts documents that otherwise are fine?
- Do we preserve this legitimate HTML and allow its _surprising_ form to thread through a stack of code expecting more familiar HTML?
- Do we eagerly replace confusing HTML like this before passing it along into WordPress, taking a performance hit to provide regularity and mitigate further misinterpretation by later code on the render chain?
There are further nuances. For now I would like to preserve just this one case so that we can get testing and exploring the UX of potential "Bits" systems. In my opinion this code is no worse off than the existing code, so the fact that it doesn't solve more syntax irregularities seems like something that shouldn't be required.
@westonruter commented on PR #6395:
8 months ago
#6
This patch _only_ addresses a single kind of invalid comment leaving most other constructions alone.
Is there a list of all the possible invalid comment syntaxes?
8 months ago
#7
Is there a list of all the possible invalid comment syntaxes?
They are not listed separated like this but you can find them in the parsing errors in the specification.
The Tag Processor draws its own classification which is retrievable with `get_comment_type()`, whereas it distinguishes:
- Comments that are basically well-formed.
- Comments that were abruptly closed.
- Comments that appear as though someone wanted to write a CDATA section.
- Comments that appear as though someone wanted to write an SGML processing instruction.
- Comments that form from a variety of other invalid HTML constructions.
It separates out comments that appear as the result of being a closing tag with an invalid tag name. These it calls funky comments and these are the proposed syntax for Bits and other atomic shortcode like placeholders.
8 months ago
#8
How common are these other invalid comments?
@westonruter I took a list of the top-ranked domains and grabbed the first 296,000 pages at /
for the domain.
This file has ANSI terminal codes for clearer output. You can cat
it and see colors…
This ticket was mentioned in Slack in #core by joemcgill. View the logs.
7 months ago
#10
@
7 months ago
- Description modified (diff)
- Summary changed from HTML API: Preserve some additional invalid HTML comment syntaxes. to HTML API: Fix some existing bugs in `kses` comment detection, enable Bits storage.
@ellatrix commented on PR #6395:
6 months ago
#11
Thanks for omitting the others improvements, it really helps to understand that they are not related to allowing the some additional invalid comment syntaxes and are completely separate fixes. It also reduces the risk of a revert in the unlikely that case there's a problem with the additional fixes. Let's do those in a separate PR 👍
Trac ticket: Core-61009
When
wp_kses_split
processes a document it attempts to leave HTML comments relatively alone. It makes minor adjustments, but leaves the comments in the document in its output.Unfortunately it only recognizes one kind of HTML comment and rejects many other kinds which appear as the result of various invalid HTML markup.
This patch makes a minor adjustment to the algorithm in
wp_kses_split
to allow two additional kinds of HTML comments:--!>
.</%dolly>
.In an HTML parser these all become comments, and so leaving them in the document should be a benign operation, improving the reliability of detecting comments in Core. These invalid closing tags, which in a browser are interpreted as comments, are one proposal for a placeholder mechanism in the HTML API unlocking HTML templating, a new kind of shortcode, and more. Having these persist in Core is a requirement for exploring and utilizing the new syntax.