Make WordPress Core

Opened 8 months ago

Closed 6 months ago

#61009 closed defect (bug) (fixed)

HTML API: Fix some existing bugs in `kses` comment detection, enable Bits storage.

Reported by: dmsnell's profile dmsnell Owned by:
Milestone: 6.6 Priority: normal
Severity: normal Version: 6.6
Component: HTML API Keywords: has-patch 2nd-opinion has-unit-tests
Focuses: Cc:

Description (last modified by dmsnell)

When wp_kses_split processes a document it attempts to leave HTML comments alone. It makes minor adjustments, but leaves the comments in the document in its output. Unfortunately it only recognizes one kind of HTML comment and rejects many others.

In HTML there are many kinds of invalid markup which, according to the specification, are to be interpreted as an HTML comment. These include, but are not limited to:

  • HTML comments with invalid syntax, <!-->, <!-- --!>, etc…
  • HTML closing tags whose tag name is invalid </3>, </%happy>, etc…
  • Things that look like XML CDATA sections, <![CDATA[…]]>
  • Things that look like XML Processor Instruction nodes, <?include "blarg">

This patch makes a minor adjustment to the algorithm in wp_kses_split to allow two additional kinds of HTML comments:

  • HTML comments with the incorrect closer --!>, because this one was a simple and easy change.
  • Closing tags with an invalid tag name, e.g. </%dolly>j, because these are required to open up explorations in Gutenberg on Bits, a new iteration of dynamic tokens for externally-sourced data, or "Shortcodes 2.0"

These invalid closing tags, which in a browser are interpreted as comments, are one proposal for a placeholder mechanism in the HTML API unlocking HTML templating, a new kind of shortcode, and more. Having these persist in Core is a requirement for exploring and utilizing the new syntax because as long as Core removes them, there's no way to load content from the database and experiment on the full life cycle of potential Bits systems.

On its own, however, this represents a kind of bug fix for Core, making the implementation of wp_kses_split() more closely align with its stated goal of leaving HTML comments as comments. It doesn't attempt to fully fix the mis-parsed comments (because that is a much deeper issue and involves many more questions about existing expectations) but it does propose a couple of hopefully and expectedly minor fixes that hopefully won't break any existing code or projects.

Change History (18)

This ticket was mentioned in PR #6395 on WordPress/wordpress-develop by @dmsnell.


8 months ago
#1

Trac ticket: Core-61009

When wp_kses_split processes a document it attempts to leave HTML comments relatively alone. It makes minor adjustments, but leaves the comments in the document in its output.

Unfortunately it only recognizes one kind of HTML comment and rejects many other kinds which appear as the result of various invalid HTML markup.

This patch makes a minor adjustment to the algorithm in wp_kses_split to allow two additional kinds of HTML comments:

  • HTML comments with the incorrect closer --!>.
  • Closing tags with an invalid tag name, e.g. </%dolly>.

In an HTML parser these all become comments, and so leaving them in the document should be a benign operation, improving the reliability of detecting comments in Core. These invalid closing tags, which in a browser are interpreted as comments, are one proposal for a placeholder mechanism in the HTML API unlocking HTML templating, a new kind of shortcode, and more. Having these persist in Core is a requirement for exploring and utilizing the new syntax.

@zieladam commented on PR #6395:


8 months ago
#2

The test failure is related to IE conditional comments. However, dealing with those shouldn't matter anymore:

Conditional comments are conditional statements interpreted by Microsoft Internet Explorer versions 5 through 9 in HTML source code

https://www.sitepoint.com/microsoft-drop-ie10-conditional-comments/

@zieladam commented on PR #6395:


8 months ago
#3

Other than my preg_match question, this LGTM. I'd just remove the failing test case, conditional comments were only relevant up to IE9 which WordPress doesn't support anymore. And even if it did, the new output doesn't seem like an XSS vector.

@westonruter commented on PR #6395:


8 months ago
#4

Confirmed that Chrome parses these as comments:

foo <!-- comment 2 --!> bar baz

https://github.com/WordPress/wordpress-develop/assets/134745/747fca69-5326-4e47-b7c9-93b955829da9

How common are these other invalid comments?

I presume you'll be adding tests specifically for the changes here?

@dmsnell commented on PR #6395:


8 months ago
#5

How common are these other invalid comments?

I'm guessing rare, but it might be a fun experiment to scan some sites and count them. This patch _only_ addresses a single kind of invalid comment leaving most other constructions alone. This is because I'm trying to be tactical here and leave a small surface area of change.

I presume you'll be adding tests specifically for the changes here?

Yes absolutely. Before I do that though I want to make sure folks are on board with the change. If we use these "funky comments" for Bits, template placeholders, translation wrappers, they could become quite popular.

In a local branch I was playing with replacing wp_kses_split2 using the HTML API and it provides some great affordances, but as usual, far more questions to resolve.

For instance, WordPress currently strips away <:::>, interpreting it as an invalid tag. It's not though, it's legitimate plaintext _despite being invalid markup_. So now what do we do?

  • Do we attempt to maintain this legacy behavior that corrupts documents that otherwise are fine?
  • Do we preserve this legitimate HTML and allow its _surprising_ form to thread through a stack of code expecting more familiar HTML?
  • Do we eagerly replace confusing HTML like this before passing it along into WordPress, taking a performance hit to provide regularity and mitigate further misinterpretation by later code on the render chain?

There are further nuances. For now I would like to preserve just this one case so that we can get testing and exploring the UX of potential "Bits" systems. In my opinion this code is no worse off than the existing code, so the fact that it doesn't solve more syntax irregularities seems like something that shouldn't be required.

@westonruter commented on PR #6395:


8 months ago
#6

This patch _only_ addresses a single kind of invalid comment leaving most other constructions alone.

Is there a list of all the possible invalid comment syntaxes?

@dmsnell commented on PR #6395:


8 months ago
#7

Is there a list of all the possible invalid comment syntaxes?

They are not listed separated like this but you can find them in the parsing errors in the specification.

The Tag Processor draws its own classification which is retrievable with `get_comment_type()`, whereas it distinguishes:

  • Comments that are basically well-formed.
  • Comments that were abruptly closed.
  • Comments that appear as though someone wanted to write a CDATA section.
  • Comments that appear as though someone wanted to write an SGML processing instruction.
  • Comments that form from a variety of other invalid HTML constructions.

It separates out comments that appear as the result of being a closing tag with an invalid tag name. These it calls funky comments and these are the proposed syntax for Bits and other atomic shortcode like placeholders.

@dmsnell commented on PR #6395:


8 months ago
#8

How common are these other invalid comments?

@westonruter I took a list of the top-ranked domains and grabbed the first 296,000 pages at / for the domain.

invalid-comments.txt

This file has ANSI terminal codes for clearer output. You can cat it and see colors…

https://github.com/WordPress/wordpress-develop/assets/5431237/4a17c472-3da9-4d9a-bed7-aef671693d83

This ticket was mentioned in Slack in #core by joemcgill. View the logs.


7 months ago

#10 @dmsnell
7 months ago

  • Description modified (diff)
  • Summary changed from HTML API: Preserve some additional invalid HTML comment syntaxes. to HTML API: Fix some existing bugs in `kses` comment detection, enable Bits storage.

@ellatrix commented on PR #6395:


6 months ago
#11

Thanks for omitting the others improvements, it really helps to understand that they are not related to allowing the some additional invalid comment syntaxes and are completely separate fixes. It also reduces the risk of a revert in the unlikely that case there's a problem with the additional fixes. Let's do those in a separate PR 👍

@dmsnell commented on PR #6395:


6 months ago
#12

Found an issue with encoding HTML comments, I'll examine.

#13 @dmsnell
6 months ago

In 58418:

KSES: Preserve some additional invalid HTML comment syntaxes.

When wp_kses_split processes a document it attempts to leave HTML comments
alone. It makes minor adjustments, but leaves the comments in the document in
its output. Unfortunately it only recognizes one kind of HTML comment and
rejects many others.

This patch makes a minor adjustment to the algorithm in wp_kses_split to
recognize and preserve an additional kind of HTML comment: closing tags with
an invalid tag name, e.g. </%dolly>.

These invalid closing tags must be interpreted as comments by a browser.
This bug fix aligns the implementation of wp_kses_split() more closely
with its stated goal of leaving HTML comments as comments.

It doesn't attempt to fully fix the mis-parsed comments, but it does propose a
minor fix that hopefully won't break any existing code or projects.

Developed in https://github.com/WordPress/wordpress-develop/pull/6395
Discussed in https://core.trac.wordpress.org/ticket/61009

Props ellatrix, dmsnell, joemcgill, jorbin, westonruter, zieladam.
See #61009.

#15 @dmsnell
6 months ago

  • Keywords has-unit-tests added
  • Milestone changed from Awaiting Review to 6.6

#16 @dmsnell
6 months ago

In 58424:

KSES: Fix tests and detection of HTML Bogus Comment spans.

In [58418] a test was added without the test_ prefix in its function
name, and because of that, it wasn't run in the test suite.
The prefix has been added to ensure that it runs.

In the original patch, due to a logical bug, a recursive loop to
transform the inside contents of the bogus comments was never run
more than once. This has been fixed.

This patch also includes one more case where kses wasn't
properly detecting the bogus comment state, and adds a test case
to cover this. It limits itself to some but not all constructions
of invalid markup declaration so that it doesn't conflict with
existing behaviors around those and other kinds of invalid comments.

Props ellatrix, dmsnell.
See #61009.
Follow-up to [58418].

This ticket was mentioned in Slack in #core by nhrrob. View the logs.


6 months ago

#18 @dmsnell
6 months ago

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.