Make WordPress Core

Opened 11 years ago

Last modified 9 months ago

#25856 new enhancement

debug of wpautop function with Unit Tests

Reported by: mdbitz's profile mdbitz Owned by:
Milestone: Priority: normal
Severity: normal Version: 3.8
Component: Formatting Keywords: wpautop needs-patch
Focuses: performance Cc:

Description (last modified by SergeyBiryukov)

Cleanup / Refactor of wpautop function

Upon review of the WordPress Trac needs-unit-test filter I saw that the wpautop function was needing both cleanup and unit tests. I have created a comprehensive test suite for the function that covers in my opinion the full breadth of use cases for the function. During this process I have been working on modifying the wpautop function to resolve out open defects and remove fix code that was placed to resolve an issue without resolving the underlying cause of the error.

I've attached the following files for support of the changes.

autop-20622.out --> OUTPUT from running the unit tests against current code base (revision 20622)
autop-diff.out --> OUTPUT from running the unit test against modififed wpautop function as given in the diff.
Autop.diff --> diff of changes made to the Autop test class :: Unit Tests
formatting.diff --> diff of changes made to the formatting.php function file :: wpautop changes

As a note I've added a trim call to the end of wpautop as it was always appending a new line, which I did not think was needed. This causes a good amount of test cases to fail as the expected output is not containing the additional new line given by current implementation.


Enhancements / Fixes to wpautop function

  1. Corrected core logic to identify when double new lines wrap open or close block element tags due to nested block level elements. Original Logic adds \n\n before open tags and \n\n after close tags which allows for output of <p><div> TEXT</p> that is not correct, and had several patches implemented to attempt to fix.

*Fixes :: Formatting of Lists, Nested Lists, orphaned <p> tags

  1. Added logic to not add <br/>tags after comments
  1. Added logic to not convert \n for svg, math, style, select or script codes
  1. Added logic to not format audio, video or object elements. This is implemented as the elements are in-line but can contain block level content for displaying in browsers that don't support the HTML5 elements.
  1. case insensitive recognition of block tags
  1. inclusion of variable re-naming as outlined in #25516

Overview of Tickets reviewed and covered by the Unit Tests

Ticket : Current Status : Summary
#6877 : Closed : large posts causes empty return
#3476 : Closed : Improper formatting of Object elements, additional <br/> and no ending </p>
#3669 : Closed : Don't insert <p> tags inside of block elements that only contain inline elements/text
#6809 : Closed : wpautop correctly handles auto <p> of basic text
#11024 : Closed : Incorrect handling of div tags with nested inline tags
#1305 : Closed : Handling of <code> blocks>
#1706 : Closed : <p> tags with attributes introduce additional <br/>
#2285 : Closed : Strip excessive <br /> in content
#2813 : Closed : <br> elements added instead of <br />
#3007 : Closed : Spacing of text in block elments reveals improper handling of nested elements
#3035 : Closed : Extra <br/> tags introduced in Object tags.
#3238 : Closed : Dangling <p> tags, no closing
#3621 : Closed : Don't format new lines in Script and Object tags
#3854 : Closed : Don't format new lines in Script and Style tags
#3935 : Closed (wf) : Erroneous </p> tags added to the content has fix
#3952 : Closed : Don't autop hr tags
#5250 : OPEN : incorrect handling of lists + nested Lists has fix
#7937 : Closed : incorrect handling of inline tags
#7988 : Closed (wf) : incorrect handling of nested tags has fix
#8644 : Closed : HTML5 block elements
#10033 : OPEN : autop does not handle Comments has fix *also contains wptexturize details that I am not sure are fully closed out.
#11257 : Closed : autop errors out on large content
#11678 : OPEN : autop does not recognize upper tags has fix
#13340 : OPEN : nested MATH tag handling has fix
#2833 : OPEN : No formating Style and Script tags has fix
#4298 : Accepted : handling of tags with attributes on new lines fix already existed
#3833 : OPEN : incorrect </p> tags in block elements (explicitly blockquote) has fix
#6041 : Closed : incorrect handling of DL lists and DT,DD elements
#9437 : OPEN : dont modify SVG content has fix
#10247 : Closed : no new lines in Video, Audio, Source
#4857 : OPEN : incorrect autop logic has fix
#15918 : OPEN : incorrect handling of nested tags has fix
#16456 : Closed : input is not block level
#3054 : Closed : input is not block level
#16790 : Reviewing : functional specifications can close
#18534 : Closed : noscript are block level
#20444 : OPEN : Single line handling (nested block level issue) has fix
#18136 : OPEN : extra </p> tags being added due to white space has fix
#23135 : OPEN : block element filter close as won't fix
#18807 : Closed : samp is not block level
#25516 : OPEN : variable name changes changes included
#9305 : Closed : handling of empty content
#6218 : Closed : handling of empty content
#11947 : Closed : no formating of select options has fix
#12335 : Closed : HTML5 Block Elements

Tickets concerning ShortCode handling

Ticket : Current Status
#12061 : OPEN
#6984 : OPEN
#21689 : OPEN
#24085 : OPEN
#24846 : OPEN

Currently the wpautop function does not concern itself with shortcodes. I suggest that 12061 remain open as a future enhancement and that other tickets can be closed referending 12061 unless there is issues with shortcode_unatop or other functions used.

Tickets Reviewed but Ignored/ Not Tested/ Duplicate


Ticket : Current Status : Summary
#11249 : Closed : standalone function
#10490 : Closed : Duplicate of #10082
#9218 : Closed : Duplicate of #4298
#19934 : Closed : Duplicate of #16456
#18330 : Rejected : Enhancement
#18514 : Rejected : Enhancement
#23858 : Closed : Duplicate of #18807
#23375 : Closed : Duplicate of #18807

Attachments (10)

Autop.diff (179.9 KB) - added by mdbitz 11 years ago.
PHPUnit Test Cases of wpautop function
formatting.diff (9.7 KB) - added by mdbitz 11 years ago.
diff of formatting.php function file with wpautop enhancements
autop-26022.out (97.0 KB) - added by mdbitz 11 years ago.
PHPUnit Test output against trunk (version 26022)
autop-diff.out (613 bytes) - added by mdbitz 11 years ago.
PHPUnit Test output against updated function
25856.diff (190.6 KB) - added by wonderboymusic 11 years ago.
25856_fix.diff (180.0 KB) - added by mdbitz 11 years ago.
Correction to Unit Test data Provider naming
wpautop-orig.png (56.0 KB) - added by bpetty 11 years ago.
wpautop-25856_fix.diff.png (86.7 KB) - added by bpetty 11 years ago.
wpautop-orig.pdf (174.8 KB) - added by bpetty 11 years ago.
wpautop-25856_fix.diff.pdf (214.1 KB) - added by bpetty 11 years ago.

Download all attachments as: .zip

Change History (35)

@mdbitz
11 years ago

PHPUnit Test Cases of wpautop function

@mdbitz
11 years ago

diff of formatting.php function file with wpautop enhancements

@mdbitz
11 years ago

PHPUnit Test output against trunk (version 26022)

@mdbitz
11 years ago

PHPUnit Test output against updated function

#1 @mdbitz
11 years ago

  • Cc matt@… added

#2 @SergeyBiryukov
11 years ago

  • Description modified (diff)

#3 @wonderboymusic
11 years ago

@mdbitz: I haven't reviewed this yet, but gold star + like button + 10 upvotes for your work here.

#4 @jeremyfelt
11 years ago

Seriously, great effort.

Related note - parts of formatting.diff would resolve #25615.

#5 @wonderboymusic
11 years ago

  • Milestone changed from Awaiting Review to 3.8

25856.diff​ is a combined diff with some code style changes. @mdbitz: the @dataProvider nestedTagsToIgnoreProvider is missing in Tests_Formatting_Autop::test_no_formatting_of_predefined_tags().

That is the only test that fails out of the 225 (Two Hundred And Twenty Five) tests and 295 (Two Hundred And Ninety Five) assertions provided. This guy is slowly becoming my hero.

#6 @mdbitz
11 years ago

Sorry I decided the name of that test case wasn't accurate and never updated the data provider to match. Please see 25856_fix.diff it contains the correctly named dataProvider. This diff only contains the Unit Tests in it so might need to be combined.

@mdbitz
11 years ago

Correction to Unit Test data Provider naming

#7 @nacin
11 years ago

This is fantastic work.

Since it is so massive, let's split the final test's data into two separate text files — tests/phpunit/data/formatting/wpautop-raw.txt and wpautop-expected.txt (or something). Ideally, test_verify_large_content() is actually split into a separate test for each block of text it is dealing with. I'm sure I'll have more feedback on how to best segment the tests as I continue to review this. Great work.

Functions like wpautop() definitely can only be improved upon with adequate unit test coverage. But there is also the need to balance that with performance. We need to make sure the patch to wpautop() keeps that function about as performant as it is now. We could benefit from some proper before-and-after profiling/benchmarking here for sure.

#8 @mdbitz
11 years ago

Some profiling/benchmarking would be a good idea. I haven't done very much work in that space before but would be interested in giving it a crack. Are there any profiling standards or sample scenarios that are being done for other sections of WordPress. Also is performance testing something that should potentially be integrated into the test suite? It would be potentially subjective to the environment but might make sure that critical path of WordPress doesn't degrade as new releases/functionality is added.

I'll take a pass at separating the large content out to supporting files. The idea behind the test came from a previous defect where wpautop would return empty string on large content, so splitting the data would make the test obsolete.

#9 @kovshenin
11 years ago

  • Cc kovshenin added

#10 @matt
11 years ago

Seems big, and getting later for 3.8. What do you folks think?

#11 follow-up: @bpetty
11 years ago

Yes, this needs more work still, but should have also already been in beta for more adequate testing to begin with. Not to say that this doesn't have very thorough unit test coverage, that's looking great.

As mentioned earlier, performance is a big concern here too, so we really also need time to put together some profiling data on this one.

I'd say punt for now.

Looking back at the unit tests though, I am concerned about where all this test content is coming from:

  • Certainly test_no_format_style_tags() and test_no_format_script_tags() don't actually need full code samples to test that they work properly, maybe only a couple lines of JS/CSS, and preferably something copied from WP itself ideally, rather than some random Google Analytics code snippet.
  • Why do we need all these references to some arbitrary (but actually real) blog: "blogyul.miqrogroove.com"? Do we have appropriate copyright permissions on the content copied from there (arbitrary content published on a WP blog isn't automatically GPL)?
  • What is this Romanian text used? It looks like legal code, so I'm guessing it might actually be public domain, but not all countries actually have public domain, nor does public domain mean the same thing in all countries. No-one has even asked or said what it is though and why it's safe. I'm certain we can find better test content for that as well.

#12 @nacin
11 years ago

  • Milestone changed from 3.8 to Future Release

This should not have been slated for 3.8. It's great, but a number of comments here highlight the need for further testing and also analyzing its performance.

#13 @mdbitz
11 years ago

Hi, I did an initial xdebug profile of the tests and the cache grind output for current wpautop function and the patced version can be found at following location. (output to large to attach)

[current wpautop function|http://mdbitz.com/profile/callgrind.out.orig]
[patched wpautop function|http://mdbitz.com/profile/callgrind.out.changes]

I'm not seeing any performance hits based on the unit tests coverage. Let me know what benchmarking everyone feels is appropriate so we can setup some additional scenarios to reference.

#14 in reply to: ↑ 11 @mdbitz
11 years ago

Replying to bpetty:

  • Certainly test_no_format_style_tags() and test_no_format_script_tags() don't actually need full code samples to test that they work properly, maybe only a couple lines of JS/CSS, and preferably something copied from WP itself ideally, rather than some random Google Analytics code snippet.

We can switch these to any JS/CSS lines as long as they contain comments and multiple new lines it will allow us to verify the logic.

  • Why do we need all these references to some arbitrary (but actually real) blog: "blogyul.miqrogroove.com"? Do we have appropriate copyright permissions on the content copied from there (arbitrary content published on a WP blog isn't automatically GPL)?

This content came from examples linked to previous tickets as such was used as a valid test scenario. We could replace if desired, would just need to create a large content section with elements.

  • What is this Romanian text used? It looks like legal code, so I'm guessing it might actually be public domain, but not all countries actually have public domain, nor does public domain mean the same thing in all countries. No-one has even asked or said what it is though and why it's safe. I'm certain we can find better test content for that as well.

Same as above this came from a previuos ticket. Could swap for lorem ipsum or public domain text of same/greater length and new lines.

@bpetty
11 years ago

#15 @bpetty
11 years ago

Maybe I'm missing something, but your cachegrind profiles don't actually seem to contain any data on calls to wpautop:

Here's your original: wpautop-orig.png

Here's your patched profile: wpautop-25856_fix.diff.png

The unit tests frequently spawn separate processes for tests though, so maybe you just pulled the wrong profile files?

#16 @bpetty
11 years ago

Nevermind, I found it... was just restricting to 90% of call time.

Original:

Function: wpautop
Invocation Count: 229
Total Self Cost: 1.43
Total Inclusive Cost: 1.55

Patched:

Function: wpautop
Invocation Count: 298
Total Self Cost: 4.95
Total Inclusive Cost: 5.35

It does look slower in your profiles with the patched version, but also note that it was called a different number of times too. Did you run these with the same exact set of tests?

@bpetty
11 years ago

#17 @bpetty
11 years ago

Here's the call and called from breakdown from both of those:

There's definitely a different set of tests being run, it looks like one was run with these new tests while the original was tested with the current tests, but to get an accurate comparison, it would be helpful to run the patched version with the original tests still too.

Overall, it looks like it takes approximately 150% to 500% more time to process, but there are a few small test cases where it was faster.

#18 @azaozz
11 years ago

I'm also very happy to see these patches, really good work! It is about time to make wpautop() fully HTML 5.0 compliant.

Some "first-look" notes:

  • There are no tags like <header />, <footer />, <menu />, etc. in HTML 5.0. All of these are completely invalid as they *require* closing tags, and the self-closing ending is optional. The tests should use valid tags.
  • I'm alot unhappy about removing the humorous var names. Why?? :)
  • The purpose of this function is to replace line breaks with <p>s. Don't think it should be creating <p>s from thin air when there are no line breaks, or prettifying the html. It is quite expensive at the moment, making it even more expensive is unwise imho.
  • There is a typo in $blocklist // List of Blocking elements (minus p), it still contains |p|.
  • The part preg_replace('!(<' . $allblocks . '[^>]*\/>)!i', "$1\n\n"... is not needed. As far as I know there aren't any block elements in HTML 5.0 that don't require a closing tag.
  • Not sure which tags preg_match( '/<p[^>]*>.*?<\/p[^>]*>/si'... is supposed to match. Currently it matches <p>|<pre>|<param>|<progress>. Doesn't look like this was intended. If we need only <p>, the start of the regex would be /<p(?: [^>]+)?>..., not sure why the closing tag is expanded.
  • On the same line: the preg_match( '/<' . $allblocks .'[^>]*\/>/si'... doesn't make sense. See the first point.
  • Don't see a point in preserving formatting/line breaks in <audio>, <video> and <object>. The only HTML tag where this matters is <pre>. All line breaks in the other tags should probably be stripped, saving some processing time. Further, this will fail for <object> as it can be nested and often is.

There are no "self-closing" tags in HTML 5.0. Instead some tags don't have a closing tag and some closing tags can be omitted. Among them is <p>. Been thinking whether it is practical to stop outputting </p>s from wpautop(). This has been supported in all browsers since the early days (mid 90's) and works really well. It was "invalid" in the XHTML 1.0 era but that time is long gone.

Another thing I've been thinking about is to return early from wpautop() when there are no line breaks in the source. This will not only make it faster for one-paragraph posts, but would allow for storing "full-html" content that would bypass autop automatically.

Last edited 11 years ago by azaozz (previous) (diff)

#19 @nacin
11 years ago

  • Keywords wpautop added

This ticket was mentioned in IRC in #wordpress-dev by SergeyBiryukov. View the logs.


11 years ago

This ticket was mentioned in IRC in #wordpress-dev by mdbitz. View the logs.


11 years ago

This ticket was mentioned in IRC in #wordpress-dev by azaozz. View the logs.


11 years ago

#23 @wonderboymusic
10 years ago

In 28853:

Fix wpautop() unit tests. See #25646, #22230, #25856.

Props rachelbaker.
Fixes #28638.

#24 @chriscct7
9 years ago

  • Focuses performance added
  • Keywords needs-patch added; has-patch removed

#25 @pbearne
9 months ago

Is this going anywhere?
Is it still needed?

Note: See TracTickets for help on using tickets.