Opened 10 years ago
Last modified 5 years ago
#29913 new defect (bug)
wptexturize should handle broken HTML consistently
Reported by: | kitchin | Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | minor | Version: | 1.5 |
Component: | Formatting | Keywords: | wptexturize needs-unit-tests needs-patch |
Focuses: | Cc: |
Description
Spunoff from ticket:29557#comment:93 because it's not that important.
When encountering broken HTML, wptexturize()
should match web browser behavior, without getting too complicated. This bug is for:
- unclosed comments:
<!-- foo ... texturize those dots?
- unclosed tags:
<a ... texturize those dots?
- valid terrible HTML that WP probably does not need to support:
<div data-cycle-slides="> .slides">
. It's encouraged by the popular jQuery Cycle2 plugin, and its examples.
Currently:
- unclosed comments are handled like a browser: WP thinks it is a comment, so does not texturize the dots (turn them into
…
).
- unclosed tags are not considered a tag, so the dots are texturized. A browser treats it as a tag, by contrast, and we don't normally texturize inside tags. The browser also hides it since it's a tag, so whatever we do is hidden. It's not that important, though it could have side effects on other parsing, plugins, etc.
- valid terrible HTML is not parsed as a tag by WP. Fully parsing valid HTML would slow down WP.
Some of this is already covered by unit tests or may be soon.
Fixing (2) to be consistent with (1) would be easy.
Change History (7)
#3
@
10 years ago
More info about the history of these bugs:
Before 4.0, unclosed HTML tags were sometimes not texturized depending on their position relative to other tags. Given that this bug has never come up before now, it seem like there would be no harm in making those tags always not texturized.
The example code, <div data-cycle-slides="> .slides">
is not allowed by kses as far as I know. I would be uncomfortable allowing this in one place and not the other. And to be honest, I don't think it's a good idea to allow it.
Unclosed comments should be taken care of by #29557.
#5
follow-up:
↓ 6
@
10 years ago
As html5 in Gecko, Webkit and Trident it renders the same as
a <hr> World --> <br>
Pathological cases like that are work for a full html parser. Whereas I'm guessing any simplified parser would strip comments first, giving
a <hr>
#6
in reply to:
↑ 5
@
10 years ago
Replying to kitchin:
Pathological cases like that are work for a full html parser. Whereas I'm guessing any simplified parser would strip comments first, giving
a <hr>
I agree. The new comment parser works great though. Starting with the <hr, everything up to the first > is matched as an HTML element and everything up to the <br is treated as plain text. I couldn't find any new bugs with this test case. :)
Patch could be as simple as adding a
?
to the regexp.The main thing we need to discuss is, what are we going to break with this patch?
Can there be any downside to not texturizing the guts of broken HTML?
Is not texturizing the correct behavior? Should we continue to texturize and rather than ignore the unclosed
<
convert it to a<
instead?