WordPress.org

Make WordPress Core

Opened 11 months ago

Last modified 9 months ago

#45387 new defect (bug)

Valid HTML get mangled on the frontend

Reported by: youknowriad Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version:
Component: Editor Keywords:
Focuses: Cc:
PR Number:

Description

Open the HTML editor, paste this HTML code

<p>To make this thing happen, go to Pages <a href="http://google.com/" target="_blank" rel="noreferrer noopener" aria-label="This is > an aria label">http://google.com</a></p>

And preview the post in frontend.

The HTML is a big mangled in the output, the link don't show up properly.

I suspect that it's related to one of the_content filters.

Reproduced in 4.9.8

Change History (8)

#1 @pbiron
11 months ago

First, the HTML you've got is invalid. It should be:

<p>To make this thing happen, go to Pages <a href="http://google.com/" target="_blank" rel="noreferrer noopener" aria-label="This is &gt; an aria label">http://google.com</a></p>

[Note that > has been replaced by &gt; in the aria-label attribute]

Is the (classic) editor supposed to try to "fix" bad HTML? If so, then there is a bug somewhere. If not, then I'd say this "operator error" and this ticket should be closed as invalid.

Last edited 11 months ago by pbiron (previous) (diff)

#2 follow-up: @youknowriad
11 months ago

This html is valid. You can try by pasting in the w3c official validator https://validator.w3.org/nu/#textarea

<!DOCTYPE html>
<html lang="fr">
<head>
<title>title test</title>
</head>
<body>
<p>To make this thing happen, go to Pages <a href="http://google.com/" target="_blank" rel="noreferrer noopener" aria-label="This is > an aria label">http://google.com</a></p>
</body>
</html>

#3 @chrisvanpatten
11 months ago

There's also some additional context on why > is valid inside HTML attributes in this Gutenberg ticket.

#4 in reply to: ↑ 2 @pbiron
11 months ago

Replying to youknowriad:

This html is valid. You can try by pasting in the w3c official validator https://validator.w3.org/nu/#textarea

You're right, my bad. It is just < and & that must be replaced by character entities or references. But using &gt; does allow the HTML to rendered correctly on the front-end, so the problem is probably in wptexturize() (which is hooked to the_content).

Let me see if I can find why it is confused by > in an attribute value.

#5 @pbiron
11 months ago

The problem is in _get_wptexturize_split_regex() (which is called by wptexturize()) at https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php#L713.

The regex interprets the > in the attribute value as the end of start tag (which is why using the character entity does not cause a problem), which results in the HTML getting "mangled" on the front end.

Unfortunately, I don't know how to fix it...I've never been good with regex assertions/lookahead/etc.

The same regex problem also exists in get_html_split_regex() at https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php#L673.

#6 follow-up: @aduth
11 months ago

Maybe worth pointing out that the function is documented to have been slated for removal by v4.5.0 :

https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php?rev=43571#L687

It's unclear from the history whether it's safe to be dropped without any additional changes.

Related:

I'm not certain whether this is on any person's radar for being addressed; my suspicion is it's not.

For the regular expression itself, I sense it may not be possible in a single expression to match the end of an opening tag while exempting ">" which occur within an attribute value.

It's unfortunate that this forces us into a position of needlessly escaping content, as it feels like compounding technical debt. The issue will remain for non-Gutenberg HTML entry.

#7 in reply to: ↑ 6 @pbiron
11 months ago

Replying to aduth:

Maybe worth pointing out that the function is documented to have been slated for removal by v4.5.0 :

I haven't looked through the history but I interpreted that comment as calling for replacing that regex with the one in get_html_split_regex()...which has the same problem.

#8 @afercia
9 months ago

#46114 was marked as a duplicate.

Note: See TracTickets for help on using tickets.