Opened 5 years ago
Last modified 7 weeks ago
#45387 new defect (bug)
Valid HTML get mangled on the frontend
Reported by: |
|
Owned by: | |
---|---|---|---|
Milestone: | Awaiting Review | Priority: | normal |
Severity: | normal | Version: | |
Component: | Editor | Keywords: | |
Focuses: | Cc: |
Description
Open the HTML editor, paste this HTML code
<p>To make this thing happen, go to Pages <a href="http://google.com/" target="_blank" rel="noreferrer noopener" aria-label="This is > an aria label">http://google.com</a></p>
And preview the post in frontend.
The HTML is a big mangled in the output, the link don't show up properly.
I suspect that it's related to one of the_content
filters.
Reproduced in 4.9.8
Change History (12)
#2
follow-up:
↓ 4
@
5 years ago
This html is valid. You can try by pasting in the w3c official validator https://validator.w3.org/nu/#textarea
<!DOCTYPE html> <html lang="fr"> <head> <title>title test</title> </head> <body> <p>To make this thing happen, go to Pages <a href="http://google.com/" target="_blank" rel="noreferrer noopener" aria-label="This is > an aria label">http://google.com</a></p> </body> </html>
#3
@
5 years ago
There's also some additional context on why >
is valid inside HTML attributes in this Gutenberg ticket.
#4
in reply to:
↑ 2
@
5 years ago
Replying to youknowriad:
This html is valid. You can try by pasting in the w3c official validator https://validator.w3.org/nu/#textarea
You're right, my bad. It is just <
and &
that must be replaced by character entities or references. But using >
does allow the HTML to rendered correctly on the front-end, so the problem is probably in wptexturize() (which is hooked to the_content
).
Let me see if I can find why it is confused by >
in an attribute value.
#5
@
5 years ago
The problem is in _get_wptexturize_split_regex()
(which is called by wptexturize()
) at https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php#L713.
The regex interprets the >
in the attribute value as the end of start tag (which is why using the character entity does not cause a problem), which results in the HTML getting "mangled" on the front end.
Unfortunately, I don't know how to fix it...I've never been good with regex assertions/lookahead/etc.
The same regex problem also exists in get_html_split_regex() at https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php#L673.
#6
follow-up:
↓ 7
@
5 years ago
Maybe worth pointing out that the function is documented to have been slated for removal by v4.5.0 :
https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php?rev=43571#L687
It's unclear from the history whether it's safe to be dropped without any additional changes.
Related:
I'm not certain whether this is on any person's radar for being addressed; my suspicion is it's not.
For the regular expression itself, I sense it may not be possible in a single expression to match the end of an opening tag while exempting ">" which occur within an attribute value.
It's unfortunate that this forces us into a position of needlessly escaping content, as it feels like compounding technical debt. The issue will remain for non-Gutenberg HTML entry.
#7
in reply to:
↑ 6
@
5 years ago
Replying to aduth:
Maybe worth pointing out that the function is documented to have been slated for removal by v4.5.0 :
I haven't looked through the history but I interpreted that comment as calling for replacing that regex with the one in get_html_split_regex()
...which has the same problem.
#9
@
2 years ago
This issue has come up again in a different form in the following Gutenberg bug report:
https://github.com/WordPress/gutenberg/issues/11789#issuecomment-847242464
(In a gist, <script>alert('rock & roll')</script>
works as expected, but <script>if (3 < 4) alert('rock & roll')</script>
yields an alert message with an escaped &
.)
In this case, we can't really work around the issue by escaping content (as described in #6), nor by escaping HTML attributes (as described in #1): we are dealing with a <script>
tag and so special characters like <
and &
should somehow be preserved.
#10
@
15 months ago
I hope this will get more attention soon. The issue @mcsf mentioned is really a pain for JS conditional logic. Consider that the following code will throw an error ("Uncaught SyntaxError: '#' not followed by identifier") when entered into a Gutenberg HTML block directly or as the output of a short code entered in Gutenberg, because & is converted to an HTML entity.
<script> x = 5; y = 10; z = 10 if (x < y) { console.log("X is Greater than Y."); } else if (x !== y && x !== z) { console.log("X is More than Y, and Unique From Z."); } </script>
#11
@
2 months ago
We have recently been made aware of an issue which sounds like this one found by use of the twentytwentythree theme.
An example is if, for example, you are using the twentytwentythree theme with WooCommerce, if you add some custom inline HTML and JS code via the woocommerce_after_add_to_cart_quantity
hook.
We added a select field with some options, the values of these options are URLs and when selected it redirects the user to the URL in the value, these URLs are concatenated strings based on some PHP conditions and add query args like &something=1
to the string which is then set as the value of the select options, the build of the URL and the functionality which does the redirect upon selection are in an inline <script>
tag, the resulting URLs don't remain &something=1
and get converted to #038;something=1
, this then means we cannot use the $_GET['something']
.
If we switch from twentytwentythree to a classic theme, the conversion of &
to #038;
doesn't occur.
First, the HTML you've got is invalid. It should be:
[Note that
>
has been replaced by>
in thearia-label
attribute]Is the (classic) editor supposed to try to "fix" bad HTML? If so, then there is a bug somewhere. If not, then I'd say this "operator error" and this ticket should be closed as
invalid
.