Opened 11 years ago
Last modified 9 months ago
#25856 new enhancement
debug of wpautop function with Unit Tests
Reported by: | mdbitz | Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | 3.8 |
Component: | Formatting | Keywords: | wpautop needs-patch |
Focuses: | performance | Cc: |
Description (last modified by )
Cleanup / Refactor of wpautop function
Upon review of the WordPress Trac needs-unit-test filter I saw that the wpautop function was needing both cleanup and unit tests. I have created a comprehensive test suite for the function that covers in my opinion the full breadth of use cases for the function. During this process I have been working on modifying the wpautop function to resolve out open defects and remove fix code that was placed to resolve an issue without resolving the underlying cause of the error.
I've attached the following files for support of the changes.
autop-20622.out --> OUTPUT from running the unit tests against current code base (revision 20622)
autop-diff.out --> OUTPUT from running the unit test against modififed wpautop function as given in the diff.
Autop.diff --> diff of changes made to the Autop test class :: Unit Tests
formatting.diff --> diff of changes made to the formatting.php function file :: wpautop changes
As a note I've added a trim call to the end of wpautop as it was always appending a new line, which I did not think was needed. This causes a good amount of test cases to fail as the expected output is not containing the additional new line given by current implementation.
Enhancements / Fixes to wpautop function
- Corrected core logic to identify when double new lines wrap open or close block element tags due to nested block level elements. Original Logic adds \n\n before open tags and \n\n after close tags which allows for output of <p><div> TEXT</p> that is not correct, and had several patches implemented to attempt to fix.
*Fixes :: Formatting of Lists, Nested Lists, orphaned <p> tags
- Added logic to not add <br/>tags after comments
- Added logic to not convert \n for svg, math, style, select or script codes
- Added logic to not format audio, video or object elements. This is implemented as the elements are in-line but can contain block level content for displaying in browsers that don't support the HTML5 elements.
- case insensitive recognition of block tags
- inclusion of variable re-naming as outlined in #25516
Overview of Tickets reviewed and covered by the Unit Tests
Ticket : Current Status : Summary
#6877 : Closed : large posts causes empty return
#3476 : Closed : Improper formatting of Object elements, additional <br/> and no ending </p>
#3669 : Closed : Don't insert <p> tags inside of block elements that only contain inline elements/text
#6809 : Closed : wpautop correctly handles auto <p> of basic text
#11024 : Closed : Incorrect handling of div tags with nested inline tags
#1305 : Closed : Handling of <code> blocks>
#1706 : Closed : <p> tags with attributes introduce additional <br/>
#2285 : Closed : Strip excessive <br /> in content
#2813 : Closed : <br> elements added instead of <br />
#3007 : Closed : Spacing of text in block elments reveals improper handling of nested elements
#3035 : Closed : Extra <br/> tags introduced in Object tags.
#3238 : Closed : Dangling <p> tags, no closing
#3621 : Closed : Don't format new lines in Script and Object tags
#3854 : Closed : Don't format new lines in Script and Style tags
#3935 : Closed (wf) : Erroneous </p> tags added to the content has fix
#3952 : Closed : Don't autop hr tags
#5250 : OPEN : incorrect handling of lists + nested Lists has fix
#7937 : Closed : incorrect handling of inline tags
#7988 : Closed (wf) : incorrect handling of nested tags has fix
#8644 : Closed : HTML5 block elements
#10033 : OPEN : autop does not handle Comments has fix *also contains wptexturize details that I am not sure are fully closed out.
#11257 : Closed : autop errors out on large content
#11678 : OPEN : autop does not recognize upper tags has fix
#13340 : OPEN : nested MATH tag handling has fix
#2833 : OPEN : No formating Style and Script tags has fix
#4298 : Accepted : handling of tags with attributes on new lines fix already existed
#3833 : OPEN : incorrect </p> tags in block elements (explicitly blockquote) has fix
#6041 : Closed : incorrect handling of DL lists and DT,DD elements
#9437 : OPEN : dont modify SVG content has fix
#10247 : Closed : no new lines in Video, Audio, Source
#4857 : OPEN : incorrect autop logic has fix
#15918 : OPEN : incorrect handling of nested tags has fix
#16456 : Closed : input is not block level
#3054 : Closed : input is not block level
#16790 : Reviewing : functional specifications can close
#18534 : Closed : noscript are block level
#20444 : OPEN : Single line handling (nested block level issue) has fix
#18136 : OPEN : extra </p> tags being added due to white space has fix
#23135 : OPEN : block element filter close as won't fix
#18807 : Closed : samp is not block level
#25516 : OPEN : variable name changes changes included
#9305 : Closed : handling of empty content
#6218 : Closed : handling of empty content
#11947 : Closed : no formating of select options has fix
#12335 : Closed : HTML5 Block Elements
Tickets concerning ShortCode handling
Ticket : Current Status
#12061 : OPEN
#6984 : OPEN
#21689 : OPEN
#24085 : OPEN
#24846 : OPEN
Currently the wpautop function does not concern itself with shortcodes. I suggest that 12061 remain open as a future enhancement and that other tickets can be closed referending 12061 unless there is issues with shortcode_unatop or other functions used.
Tickets Reviewed but Ignored/ Not Tested/ Duplicate
Ticket : Current Status : Summary
#11249 : Closed : standalone function
#10490 : Closed : Duplicate of #10082
#9218 : Closed : Duplicate of #4298
#19934 : Closed : Duplicate of #16456
#18330 : Rejected : Enhancement
#18514 : Rejected : Enhancement
#23858 : Closed : Duplicate of #18807
#23375 : Closed : Duplicate of #18807
Attachments (10)
Change History (35)
#3
@
11 years ago
@mdbitz: I haven't reviewed this yet, but gold star + like button + 10 upvotes for your work here.
#4
@
11 years ago
Seriously, great effort.
Related note - parts of formatting.diff would resolve #25615.
#5
@
11 years ago
- Milestone changed from Awaiting Review to 3.8
25856.diff is a combined diff with some code style changes. @mdbitz: the @dataProvider nestedTagsToIgnoreProvider
is missing in Tests_Formatting_Autop::test_no_formatting_of_predefined_tags()
.
That is the only test that fails out of the 225 (Two Hundred And Twenty Five) tests and 295 (Two Hundred And Ninety Five) assertions provided. This guy is slowly becoming my hero.
#6
@
11 years ago
Sorry I decided the name of that test case wasn't accurate and never updated the data provider to match. Please see 25856_fix.diff it contains the correctly named dataProvider. This diff only contains the Unit Tests in it so might need to be combined.
#7
@
11 years ago
This is fantastic work.
Since it is so massive, let's split the final test's data into two separate text files — tests/phpunit/data/formatting/wpautop-raw.txt and wpautop-expected.txt (or something). Ideally, test_verify_large_content() is actually split into a separate test for each block of text it is dealing with. I'm sure I'll have more feedback on how to best segment the tests as I continue to review this. Great work.
Functions like wpautop() definitely can only be improved upon with adequate unit test coverage. But there is also the need to balance that with performance. We need to make sure the patch to wpautop() keeps that function about as performant as it is now. We could benefit from some proper before-and-after profiling/benchmarking here for sure.
#8
@
11 years ago
Some profiling/benchmarking would be a good idea. I haven't done very much work in that space before but would be interested in giving it a crack. Are there any profiling standards or sample scenarios that are being done for other sections of WordPress. Also is performance testing something that should potentially be integrated into the test suite? It would be potentially subjective to the environment but might make sure that critical path of WordPress doesn't degrade as new releases/functionality is added.
I'll take a pass at separating the large content out to supporting files. The idea behind the test came from a previous defect where wpautop would return empty string on large content, so splitting the data would make the test obsolete.
#11
follow-up:
↓ 14
@
11 years ago
Yes, this needs more work still, but should have also already been in beta for more adequate testing to begin with. Not to say that this doesn't have very thorough unit test coverage, that's looking great.
As mentioned earlier, performance is a big concern here too, so we really also need time to put together some profiling data on this one.
I'd say punt for now.
Looking back at the unit tests though, I am concerned about where all this test content is coming from:
- Certainly
test_no_format_style_tags()
andtest_no_format_script_tags()
don't actually need full code samples to test that they work properly, maybe only a couple lines of JS/CSS, and preferably something copied from WP itself ideally, rather than some random Google Analytics code snippet. - Why do we need all these references to some arbitrary (but actually real) blog: "blogyul.miqrogroove.com"? Do we have appropriate copyright permissions on the content copied from there (arbitrary content published on a WP blog isn't automatically GPL)?
- What is this Romanian text used? It looks like legal code, so I'm guessing it might actually be public domain, but not all countries actually have public domain, nor does public domain mean the same thing in all countries. No-one has even asked or said what it is though and why it's safe. I'm certain we can find better test content for that as well.
#12
@
11 years ago
- Milestone changed from 3.8 to Future Release
This should not have been slated for 3.8. It's great, but a number of comments here highlight the need for further testing and also analyzing its performance.
#13
@
11 years ago
Hi, I did an initial xdebug profile of the tests and the cache grind output for current wpautop function and the patced version can be found at following location. (output to large to attach)
[current wpautop function|http://mdbitz.com/profile/callgrind.out.orig]
[patched wpautop function|http://mdbitz.com/profile/callgrind.out.changes]
I'm not seeing any performance hits based on the unit tests coverage. Let me know what benchmarking everyone feels is appropriate so we can setup some additional scenarios to reference.
#14
in reply to:
↑ 11
@
11 years ago
Replying to bpetty:
- Certainly
test_no_format_style_tags()
andtest_no_format_script_tags()
don't actually need full code samples to test that they work properly, maybe only a couple lines of JS/CSS, and preferably something copied from WP itself ideally, rather than some random Google Analytics code snippet.
We can switch these to any JS/CSS lines as long as they contain comments and multiple new lines it will allow us to verify the logic.
- Why do we need all these references to some arbitrary (but actually real) blog: "blogyul.miqrogroove.com"? Do we have appropriate copyright permissions on the content copied from there (arbitrary content published on a WP blog isn't automatically GPL)?
This content came from examples linked to previous tickets as such was used as a valid test scenario. We could replace if desired, would just need to create a large content section with elements.
- What is this Romanian text used? It looks like legal code, so I'm guessing it might actually be public domain, but not all countries actually have public domain, nor does public domain mean the same thing in all countries. No-one has even asked or said what it is though and why it's safe. I'm certain we can find better test content for that as well.
Same as above this came from a previuos ticket. Could swap for lorem ipsum or public domain text of same/greater length and new lines.
#15
@
11 years ago
Maybe I'm missing something, but your cachegrind profiles don't actually seem to contain any data on calls to wpautop:
Here's your original: wpautop-orig.png
Here's your patched profile: wpautop-25856_fix.diff.png
The unit tests frequently spawn separate processes for tests though, so maybe you just pulled the wrong profile files?
#16
@
11 years ago
Nevermind, I found it... was just restricting to 90% of call time.
Original:
Function: wpautop
Invocation Count: 229
Total Self Cost: 1.43
Total Inclusive Cost: 1.55
Patched:
Function: wpautop
Invocation Count: 298
Total Self Cost: 4.95
Total Inclusive Cost: 5.35
It does look slower in your profiles with the patched version, but also note that it was called a different number of times too. Did you run these with the same exact set of tests?
#17
@
11 years ago
Here's the call and called from breakdown from both of those:
There's definitely a different set of tests being run, it looks like one was run with these new tests while the original was tested with the current tests, but to get an accurate comparison, it would be helpful to run the patched version with the original tests still too.
Overall, it looks like it takes approximately 150% to 500% more time to process, but there are a few small test cases where it was faster.
#18
@
11 years ago
I'm also very happy to see these patches, really good work! It is about time to make wpautop() fully HTML 5.0 compliant.
Some "first-look" notes:
- There are no tags like
<header />
,<footer />
,<menu />
, etc. in HTML 5.0. All of these are completely invalid as they *require* closing tags, and the self-closing ending is optional. The tests should use valid tags. - I'm alot unhappy about removing the humorous var names. Why?? :)
- The purpose of this function is to replace line breaks with
<p>
s. Don't think it should be creating<p>
s from thin air when there are no line breaks, or prettifying the html. It is quite expensive at the moment, making it even more expensive is unwise imho. - There is a typo in
$blocklist // List of Blocking elements (minus p)
, it still contains|p|
. - The part
preg_replace('!(<' . $allblocks . '[^>]*\/>)!i', "$1\n\n"...
is not needed. As far as I know there aren't any block elements in HTML 5.0 that don't require a closing tag. - Not sure which tags
preg_match( '/<p[^>]*>.*?<\/p[^>]*>/si'...
is supposed to match. Currently it matches<p>|<pre>|<param>|<progress>
. Doesn't look like this was intended. If we need only<p>
, the start of the regex would be/<p(?: [^>]+)?>...
, not sure why the closing tag is expanded. - On the same line: the
preg_match( '/<' . $allblocks .'[^>]*\/>/si'...
doesn't make sense. See the first point. - Don't see a point in preserving formatting/line breaks in
<audio>
,<video>
and<object>
. The only HTML tag where this matters is<pre>
. All line breaks in the other tags should probably be stripped, saving some processing time. Further, this will fail for<object>
as it can be nested and often is.
There are no "self-closing" tags in HTML 5.0. Instead some tags don't have a closing tag and some closing tags can be omitted. Among them is <p>
. Been thinking whether it is practical to stop outputting </p>
s from wpautop(). This has been supported in all browsers since the early days (mid 90's) and works really well. It was "invalid" in the XHTML 1.0 era but that time is long gone.
Another thing I've been thinking about is to return early from wpautop() when there are no line breaks in the source. This will not only make it faster for one-paragraph posts, but would allow for storing "full-html" content that would bypass autop automatically.
PHPUnit Test Cases of wpautop function