Make WordPress Core

Opened 10 years ago

Closed 9 years ago

#30986 closed enhancement (wontfix)

Consider using an abstract syntax tree for wpautop

Reported by: ericlewis's profile ericlewis Owned by:
Milestone: Priority: normal
Severity: normal Version: 0.71
Component: Formatting Keywords:
Focuses: Cc:

Description

wpautop()has a bunch of problems.

wpautop() uses basic string-manipulation to figure out where to put paragraph tags. I think, although this approach has gotten us pretty far, it may be the wrong approach in general. HTML is a loosely defined language, so a more robust tool might be a better fit.

Browsers typically parse HTML into an abstract syntax tree via lexical analysis. Here's more than you ever wanted to know on that topic by Tali Garsiel and Paul Irish.

I wonder if this would be a more sturdy approach to auto-adding paragraph tags, and if there's existing WordPress-compatible software we could use.

Attachments (6)

class.wpautop.php (4.7 KB) - added by ericlewis 10 years ago.
30986.diff (9.8 KB) - added by ericlewis 10 years ago.
30986.2.diff (9.8 KB) - added by ericlewis 10 years ago.
30986.3.diff (541 bytes) - added by ericlewis 10 years ago.
30986.4.diff (41.1 KB) - added by ericlewis 10 years ago.
30986.5.diff (11.8 KB) - added by ericlewis 10 years ago.

Download all attachments as: .zip

Change History (13)

#1 @ericlewis
10 years ago

Started fooling around with a lexical analysis class for wpautop - attachment:class.wpautop.php, which uses PHP's DOMDocument class to parse HTML.

It's a rough draft, but curious if folks dig this approach.

Last edited 10 years ago by ericlewis (previous) (diff)

#2 @dd32
10 years ago

I think this is an interesting approach to look into, the main issues I see:

  1. Speed; wpautop() is called on the front end, needs to be a decent speed
  2. Security; DOMDocument needs to have it's external entities loading disabled at a minimum, any unknown quirks in it's HTML parsing could bypass kses potentially (I'm thinking worst-case scenario here)
  3. Availability; DOMDocument can be disabled in PHP, as can other XML parsing tools, and we've got conditionals on it's use elsewhere within core already I believe.

@ericlewis
10 years ago

@ericlewis
10 years ago

#3 @ericlewis
10 years ago

attachment:30986.2.diff is a patch against core which includes my second iteration.

The internal logic is easier to read. It now adds paragraph tags to roughly the proper locations.

Unit tests fail for two reasons:

  • wpautop() performs string trimming and \n stripping that the class would need to match.
  • DOMDocument replaces characters with HTML entities, which creates valid HTML but we might consider reversing for back-compat.

attachment:30986.2.diff also includes benchmarking versus wpautop().

Replying to dd32:

  1. Speed; wpautop() is called on the front end, needs to be a decent speed

Here's benchmark results from my basic test:

Seconds for 10000 wpautop() invocations: 0.020344972610474
Seconds for 10000 WP_AutoP class invocations: 0.34989809989929

So it looks like the class is 17x slower than wpautop() in its current form.

Replying to dd32:

Availability; DOMDocument can be disabled in PHP, as can other XML parsing tools, and we've got conditionals on it's use elsewhere within core already I believe.

That is a bummer. I wonder if this is a dealbreaker.

Last edited 10 years ago by ericlewis (previous) (diff)

#4 @ericlewis
10 years ago

Last edited 10 years ago by ericlewis (previous) (diff)

@ericlewis
10 years ago

@ericlewis
10 years ago

#5 @ericlewis
10 years ago

Replying to ericlewis:

Replying to dd32:

  1. Speed; wpautop() is called on the front end, needs to be a decent speed

Here's benchmark results from my basic test:

Seconds for 10000 wpautop() invocations: 0.020344972610474
Seconds for 10000 WP_AutoP class invocations: 0.34989809989929

So it looks like the class is 17x slower than wpautop() in its current form.

Scratch that, I bungled the test case here. In attachment:30986.4.diff I fixed the benchmarks, which produces these results:

Seconds for 1000 wpautop() invocations: 3.852117061615
Seconds for 1000 WP_AutoP class invocations: 8.0365650653839

So the class is ~1x slower than wpautop() in its current form.

@ericlewis
10 years ago

#6 @ericlewis
10 years ago

In attachment:30986.5.diff, remove the testing and short-circuit wpautop() with an invocation of the class.

Add better support for inline elements and the $br parameter.

Interestingly, DOMDocument seems to sanitize the href attribute in <a> tags, so I added placeholder support so $ and %3 passes through.

Will continue debugging unit tests until there's functional parity.

#7 @ericlewis
9 years ago

  • Milestone Awaiting Review deleted
  • Resolution set to wontfix
  • Status changed from new to closed

This was fun to riff on but as Dion mentioned DOMDocument can be disabled. Closing out.

Note: See TracTickets for help on using tickets.