Make WordPress Core

Opened 10 years ago

Last modified 5 years ago

#29913 new defect (bug)

wptexturize should handle broken HTML consistently

Reported by: kitchin's profile kitchin Owned by:
Milestone: Priority: normal
Severity: minor Version: 1.5
Component: Formatting Keywords: wptexturize needs-unit-tests needs-patch
Focuses: Cc:

Description

Spunoff from ticket:29557#comment:93 because it's not that important.

When encountering broken HTML, wptexturize() should match web browser behavior, without getting too complicated. This bug is for:

  1. unclosed comments: <!-- foo ... texturize those dots?
  1. unclosed tags: <a ... texturize those dots?
  1. valid terrible HTML that WP probably does not need to support: <div data-cycle-slides="> .slides">. It's encouraged by the popular jQuery Cycle2 plugin, and its examples.

Currently:

  1. unclosed comments are handled like a browser: WP thinks it is a comment, so does not texturize the dots (turn them into &#8230;).
  1. unclosed tags are not considered a tag, so the dots are texturized. A browser treats it as a tag, by contrast, and we don't normally texturize inside tags. The browser also hides it since it's a tag, so whatever we do is hidden. It's not that important, though it could have side effects on other parsing, plugins, etc.
  1. valid terrible HTML is not parsed as a tag by WP. Fully parsing valid HTML would slow down WP.

Some of this is already covered by unit tests or may be soon.

Fixing (2) to be consistent with (1) would be easy.

Change History (7)

#1 @kitchin
10 years ago

  • Severity changed from normal to minor

#2 @miqrogroove
10 years ago

  • Keywords wptexturize needs-unit-tests added

Patch could be as simple as adding a ? to the regexp.

The main thing we need to discuss is, what are we going to break with this patch?

Can there be any downside to not texturizing the guts of broken HTML?

Is not texturizing the correct behavior? Should we continue to texturize and rather than ignore the unclosed < convert it to a &lt; instead?

Last edited 10 years ago by miqrogroove (previous) (diff)

#3 @miqrogroove
10 years ago

More info about the history of these bugs:

Before 4.0, unclosed HTML tags were sometimes not texturized depending on their position relative to other tags. Given that this bug has never come up before now, it seem like there would be no harm in making those tags always not texturized.

The example code, <div data-cycle-slides="> .slides"> is not allowed by kses as far as I know. I would be uncomfortable allowing this in one place and not the other. And to be honest, I don't think it's a good idea to allow it.

Unclosed comments should be taken care of by #29557.

#4 @miqrogroove
10 years ago

Interesting test:

a <hr / <!-- Hello > World --> <br />

#5 follow-up: @kitchin
10 years ago

As html5 in Gecko, Webkit and Trident it renders the same as
a <hr> World --&gt; <br>
Pathological cases like that are work for a full html parser. Whereas I'm guessing any simplified parser would strip comments first, giving
a <hr>

#6 in reply to: ↑ 5 @miqrogroove
10 years ago

Replying to kitchin:

Pathological cases like that are work for a full html parser. Whereas I'm guessing any simplified parser would strip comments first, giving
a <hr>

I agree. The new comment parser works great though. Starting with the <hr, everything up to the first > is matched as an HTML element and everything up to the <br is treated as plain text. I couldn't find any new bugs with this test case. :)

#7 @chriscct7
9 years ago

  • Keywords needs-patch added
Note: See TracTickets for help on using tickets.