Opened 6 years ago
Last modified 8 months ago
#45387 new defect (bug)
Valid HTML get mangled on the frontend
Reported by: | youknowriad | Owned by: | |
---|---|---|---|
Milestone: | Awaiting Review | Priority: | normal |
Severity: | normal | Version: | |
Component: | Editor | Keywords: | has-patch has-unit-tests |
Focuses: | Cc: |
Description
Open the HTML editor, paste this HTML code
<p>To make this thing happen, go to Pages <a href="http://google.com/" target="_blank" rel="noreferrer noopener" aria-label="This is > an aria label">http://google.com</a></p>
And preview the post in frontend.
The HTML is a big mangled in the output, the link don't show up properly.
I suspect that it's related to one of the_content
filters.
Reproduced in 4.9.8
Change History (22)
#2
follow-up:
↓ 4
@
6 years ago
This html is valid. You can try by pasting in the w3c official validator https://validator.w3.org/nu/#textarea
<!DOCTYPE html> <html lang="fr"> <head> <title>title test</title> </head> <body> <p>To make this thing happen, go to Pages <a href="http://google.com/" target="_blank" rel="noreferrer noopener" aria-label="This is > an aria label">http://google.com</a></p> </body> </html>
#3
@
6 years ago
There's also some additional context on why >
is valid inside HTML attributes in this Gutenberg ticket.
#4
in reply to:
↑ 2
@
6 years ago
Replying to youknowriad:
This html is valid. You can try by pasting in the w3c official validator https://validator.w3.org/nu/#textarea
You're right, my bad. It is just <
and &
that must be replaced by character entities or references. But using >
does allow the HTML to rendered correctly on the front-end, so the problem is probably in wptexturize() (which is hooked to the_content
).
Let me see if I can find why it is confused by >
in an attribute value.
#5
@
6 years ago
The problem is in _get_wptexturize_split_regex()
(which is called by wptexturize()
) at https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php#L713.
The regex interprets the >
in the attribute value as the end of start tag (which is why using the character entity does not cause a problem), which results in the HTML getting "mangled" on the front end.
Unfortunately, I don't know how to fix it...I've never been good with regex assertions/lookahead/etc.
The same regex problem also exists in get_html_split_regex() at https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php#L673.
#6
follow-up:
↓ 7
@
6 years ago
Maybe worth pointing out that the function is documented to have been slated for removal by v4.5.0 :
https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php?rev=43571#L687
It's unclear from the history whether it's safe to be dropped without any additional changes.
Related:
I'm not certain whether this is on any person's radar for being addressed; my suspicion is it's not.
For the regular expression itself, I sense it may not be possible in a single expression to match the end of an opening tag while exempting ">" which occur within an attribute value.
It's unfortunate that this forces us into a position of needlessly escaping content, as it feels like compounding technical debt. The issue will remain for non-Gutenberg HTML entry.
#7
in reply to:
↑ 6
@
6 years ago
Replying to aduth:
Maybe worth pointing out that the function is documented to have been slated for removal by v4.5.0 :
I haven't looked through the history but I interpreted that comment as calling for replacing that regex with the one in get_html_split_regex()
...which has the same problem.
#9
@
4 years ago
This issue has come up again in a different form in the following Gutenberg bug report:
https://github.com/WordPress/gutenberg/issues/11789#issuecomment-847242464
(In a gist, <script>alert('rock & roll')</script>
works as expected, but <script>if (3 < 4) alert('rock & roll')</script>
yields an alert message with an escaped &
.)
In this case, we can't really work around the issue by escaping content (as described in #6), nor by escaping HTML attributes (as described in #1): we are dealing with a <script>
tag and so special characters like <
and &
should somehow be preserved.
#10
@
3 years ago
I hope this will get more attention soon. The issue @mcsf mentioned is really a pain for JS conditional logic. Consider that the following code will throw an error ("Uncaught SyntaxError: '#' not followed by identifier") when entered into a Gutenberg HTML block directly or as the output of a short code entered in Gutenberg, because & is converted to an HTML entity.
<script> x = 5; y = 10; z = 10 if (x < y) { console.log("X is Greater than Y."); } else if (x !== y && x !== z) { console.log("X is More than Y, and Unique From Z."); } </script>
#11
@
22 months ago
We have recently been made aware of an issue which sounds like this one found by use of the twentytwentythree theme.
An example is if, for example, you are using the twentytwentythree theme with WooCommerce, if you add some custom inline HTML and JS code via the woocommerce_after_add_to_cart_quantity
hook.
We added a select field with some options, the values of these options are URLs and when selected it redirects the user to the URL in the value, these URLs are concatenated strings based on some PHP conditions and add query args like &something=1
to the string which is then set as the value of the select options, the build of the URL and the functionality which does the redirect upon selection are in an inline <script>
tag, the resulting URLs don't remain &something=1
and get converted to #038;something=1
, this then means we cannot use the $_GET['something']
.
If we switch from twentytwentythree to a classic theme, the conversion of &
to #038;
doesn't occur.
This ticket was mentioned in Slack in #core-editor by ndiego. View the logs.
22 months ago
#13
@
20 months ago
With projects like Tailwind CSS and Alpine.js gaining a lot of popularity over the last few years, i've seen increasing numbers of developers troubleshooting why features using greater/smaller than characters break the markup in blocks. A resolution to this would save a lot of devs a lot of headache.
#14
@
17 months ago
I've recently experienced this too. Trying to paste an Active Campaign form embed into the HTML block. The embed is one part style, one part html markup and the last part is a script tag where all validation logic etc is happening.
This code:
if (elem.type != 'radio' && elem.type != 'checkbox')
Becomes:
if (elem.type != 'radio' && elem.type != 'checkbox')
which breaks the script. Sounds like what @ninetyninew reported although I'm using a custom classic theme.
#15
@
15 months ago
Still happening in WordPress 2023. I inverted my logic
if(a && b)
became
if(!(a || b))
since only & is affected 🤷♂️
This ticket was mentioned in PR #5697 on WordPress/wordpress-develop by co6x0.
14 months ago
#17
- Keywords has-patch has-unit-tests added
Ensures valid HTML is worked correctly by wptexturize(), wp_html_split(), etc.
I started working on this PR when I noticed that using TailwindCSS child selectors would break the layout of block theme (also reported in Trac ticket: 57381
).
I have identified a problem with the regular expression defined in _get_wptexturize_split_regex()
used in wptexturize()
.
This problem seemed to be affecting get_the_block_template_html() and causing the block theme layout collapse described above.
Changing this regex fixes the layout issue.
Also, wp_html_split()
uses almost the same regex.
Other trac tickets caused by this function will also be fixed by updating to a similar regex.
According to the HTML reference at html.spec.whatwg.org, attribute values can contain a variety of characters.
With this in mind, I have modified the regex to exclude matching characters within quotation marks.
This fixes the misplacement of GREATER-THAN SIGN(>)
and prevents other valid HTML structures from being mishandled.
I've included tests to cover these changes in tests/phpunit/tests/formatting/wpTexturize.php
and tests/phpunit/tests/formatting/wpHtmlSplit.php
. If there's anything I've missed, please let me know.
Trac ticket: https://core.trac.wordpress.org/ticket/45387
Trac ticket: https://core.trac.wordpress.org/ticket/43457
Trac ticket: https://core.trac.wordpress.org/ticket/57381
14 months ago
#18
Added commit.
Removed tranformation of &
to &
in HTML attribute values modified by <https://core.trac.wordpress.org/ticket/35008>.
This ticket seems to have been created because the W3C HTML Validator found it to be invalid HTML, but as of now, the &
in the URL is valid.
8 months ago
#20
howdy! just wanted to stop by and mention that I've been exploring updating these same functions using the HTML API, which provides a full spec-compliant parse of the HTML stream.
You can find some rough notes on the broader roadmap
#21
@
8 months ago
This Gutenberg issue seems to be caused by wptexturize
mangling otherwise valid HTML: https://github.com/WordPress/gutenberg/issues/56597
First, the HTML you've got is invalid. It should be:
[Note that
>
has been replaced by>
in thearia-label
attribute]Is the (classic) editor supposed to try to "fix" bad HTML? If so, then there is a bug somewhere. If not, then I'd say this "operator error" and this ticket should be closed as
invalid
.