Opened 10 years ago
Closed 9 years ago
#30986 closed enhancement (wontfix)
Consider using an abstract syntax tree for wpautop
Reported by: | ericlewis | Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | 0.71 |
Component: | Formatting | Keywords: | |
Focuses: | Cc: |
Description
wpautop()
has a bunch of problems.
wpautop()
uses basic string-manipulation to figure out where to put paragraph tags. I think, although this approach has gotten us pretty far, it may be the wrong approach in general. HTML is a loosely defined language, so a more robust tool might be a better fit.
Browsers typically parse HTML into an abstract syntax tree via lexical analysis. Here's more than you ever wanted to know on that topic by Tali Garsiel and Paul Irish.
I wonder if this would be a more sturdy approach to auto-adding paragraph tags, and if there's existing WordPress-compatible software we could use.
Attachments (6)
Change History (13)
#2
@
10 years ago
I think this is an interesting approach to look into, the main issues I see:
- Speed;
wpautop()
is called on the front end, needs to be a decent speed - Security;
DOMDocument
needs to have it's external entities loading disabled at a minimum, any unknown quirks in it's HTML parsing could bypasskses
potentially (I'm thinking worst-case scenario here) - Availability;
DOMDocument
can be disabled in PHP, as can other XML parsing tools, and we've got conditionals on it's use elsewhere within core already I believe.
#3
@
10 years ago
attachment:30986.2.diff is a patch against core which includes my second iteration.
The internal logic is easier to read. It now adds paragraph tags to roughly the proper locations.
Unit tests fail for two reasons:
wpautop()
performs string trimming and\n
stripping that the class would need to match.- DOMDocument replaces characters with HTML entities, which creates valid HTML but we might consider reversing for back-compat.
attachment:30986.2.diff also includes benchmarking versus wpautop().
Replying to dd32:
- Speed;
wpautop()
is called on the front end, needs to be a decent speed
Here's benchmark results from my basic test:
Seconds for 10000 wpautop() invocations: 0.020344972610474 Seconds for 10000 WP_AutoP class invocations: 0.34989809989929
So it looks like the class is 17x slower than wpautop()
in its current form.
Replying to dd32:
Availability; DOMDocument can be disabled in PHP, as can other XML parsing tools, and we've got conditionals on it's use elsewhere within core already I believe.
That is a bummer. I wonder if this is a dealbreaker.
#4
@
10 years ago
4 years ago mrclay did something similar.
#5
@
10 years ago
Replying to ericlewis:
Replying to dd32:
- Speed;
wpautop()
is called on the front end, needs to be a decent speedHere's benchmark results from my basic test:
Seconds for 10000 wpautop() invocations: 0.020344972610474 Seconds for 10000 WP_AutoP class invocations: 0.34989809989929So it looks like the class is 17x slower than
wpautop()
in its current form.
Scratch that, I bungled the test case here. In attachment:30986.4.diff I fixed the benchmarks, which produces these results:
Seconds for 1000 wpautop() invocations: 3.852117061615 Seconds for 1000 WP_AutoP class invocations: 8.0365650653839
So the class is ~1x slower than wpautop()
in its current form.
#6
@
10 years ago
In attachment:30986.5.diff, remove the testing and short-circuit wpautop()
with an invocation of the class.
Add better support for inline elements and the $br
parameter.
Interestingly, DOMDocument seems to sanitize the href
attribute in <a>
tags, so I added placeholder support so $
and %3
passes through.
Will continue debugging unit tests until there's functional parity.
#7
@
9 years ago
- Milestone Awaiting Review deleted
- Resolution set to wontfix
- Status changed from new to closed
This was fun to riff on but as Dion mentioned DOMDocument can be disabled. Closing out.
Started fooling around with a lexical analysis class for wpautop - attachment:class.wpautop.php, which uses PHP's
DOMDocument
class to parse HTML.It's a rough draft, but curious if folks dig this approach.