WordPress.org

Make WordPress Core

Opened 17 months ago

Closed 3 weeks ago

Last modified 3 weeks ago

#22692 closed defect (bug) (fixed)

Quotes Are Messing Up

Reported by: miqrogroove Owned by: nacin
Milestone: 3.9 Priority: normal
Severity: normal Version: 1.2
Component: Formatting Keywords: has-patch commit 3.9-early wptexturize
Focuses: Cc:

Description (last modified by kpdesign)

Type this into the body of a new post:

A sentence.  "A quote after 2 spaces."

Preview that, and be amazed. The first quote is backwards. :(

Due to font differences, this is easiest to see on the Twenty Ten theme.

Attachments (24)

miqro-22692.patch (787 bytes) - added by miqrogroove 13 months ago.
miqro-22692-tests.diff (1.2 KB) - added by miqrogroove 6 months ago.
miqro-22692.2.patch (1.0 KB) - added by miqrogroove 5 months ago.
Patch improved based on ticket feedback.
miqro-22692.3.patch (3.2 KB) - added by miqrogroove 5 months ago.
A more robust solution conceived in IRC. Eliminates the \s sequence.
miqro-22692-tests.2.diff (1.7 KB) - added by miqrogroove 5 months ago.
miqro-22692.4.patch (3.3 KB) - added by miqrogroove 6 weeks ago.
Expand regexp to include future use of HTML output.
miqro-22692-tests.patch (2.0 KB) - added by miqrogroove 6 weeks ago.
miqro-22692.5.patch (4.8 KB) - added by miqrogroove 6 weeks ago.
Move regexp filter to a new function for reuse.
miqro-22692-tests.3.diff (3.7 KB) - added by miqrogroove 6 weeks ago.
Covers work on smilies for 26842
miqro-22692.6.patch (5.6 KB) - added by miqrogroove 6 weeks ago.
Touch unautop for spaces around shortcodes.
miqro-22692-tests.4.diff (4.5 KB) - added by miqrogroove 6 weeks ago.
miqro-22692.7.patch (5.7 KB) - added by miqrogroove 5 weeks ago.
Correction: Preserve behavior of periods after single quotes.
miqro-22692.8.patch (53.8 KB) - added by miqrogroove 5 weeks ago.
Merge src and tests.
miqro-22692.9.patch (10.3 KB) - added by miqrogroove 5 weeks ago.
Src and tests merge, take 2.
WPTexturize.php.diff (10.4 KB) - added by miqrogroove 5 weeks ago.
First pass at unit tests to cover all of wptexturize.
WPTexturize.php.2.diff (14.5 KB) - added by miqrogroove 5 weeks ago.
Second pass at unit tests to cover all of wptexturize.
WPTexturize.php.3.diff (18.9 KB) - added by miqrogroove 5 weeks ago.
miqro-22692.10.patch (27.2 KB) - added by miqrogroove 5 weeks ago.
Patch and tests rollup.
miqro-22692.11.patch (32.1 KB) - added by miqrogroove 5 weeks ago.
Divide unit tests into smaller units.
miqro-22692.12.patch (32.1 KB) - added by miqrogroove 5 weeks ago.
Fixed syntax in unit tests.
miqro-22692.13.patch (33.2 KB) - added by miqrogroove 5 weeks ago.
Avoid running the 9x9 regexp, which is inherently slow.
miqro-22692.14.patch (33.0 KB) - added by miqrogroove 5 weeks ago.
miqro-22692.15.patch (24.2 KB) - added by miqrogroove 3 weeks ago.
Patch refreshed without code for smilies, no tests for other bugs.
miqro-22692.16.patch (22.6 KB) - added by miqrogroove 3 weeks ago.
Patch refreshed without fix for shortcodes.

Download all attachments as: .zip

Change History (115)

comment:1 jond3r17 months ago

  • Cc jond3r@… added

Isn't this caused by the wptexturize filter?
Try to put

remove_filter('the_content', 'wptexturize');

in the theme's functions.php file.

comment:2 dd3217 months ago

Isn't this caused by the wptexturize filter?

Yes, It's a bug in the texturize filter, this ticket is to fix it.

comment:3 pavelevap17 months ago

Related: #16957 ?

comment:4 SergeyBiryukov17 months ago

  • Component changed from General to Formatting

comment:5 follow-up: nacin17 months ago

Cannot reproduce in the editor.

wptexturize( 'A sentence. "A quote after 2 spaces."' ) returns proper quotes.

comment:6 in reply to: ↑ 5 miqrogroove17 months ago

Replying to nacin:

Cannot reproduce in the editor.

Sometimes pasting doesn't preserve both spaces. Try removing the space and hitting spacebar twice from Visual editor. That'll do it.

comment:7 follow-up: miqrogroove16 months ago

Here's another bug:

"<code>blah</code>"

The trailing quote is backwards when output. 3.4.2 and 3.5 both affected.

comment:8 in reply to: ↑ 7 SergeyBiryukov13 months ago

Replying to miqrogroove:

"<code>blah</code>"

The trailing quote is backwards when output.

Related: #18549

comment:9 miqrogroove13 months ago

Any chance of getting a milestone for the 2-spaces bug? I'm in the habit of typing with 2 spaces so this bothers me a bit.

comment:10 follow-up: SergeyBiryukov13 months ago

Could not reproduce either:

  1. Pasted the example into editor (tried both Text and Visual).
  2. Made sure there are two spaces before the quote.
  3. The post contains proper quotes:
    A sentence.  &#8220;A quote after 2 spaces.&#8221;
    

comment:11 miqrogroove13 months ago

Tested 3.5.1:

<p>A sentence.  &#8221;A quote after 2 spaces.&#8221;</p>

comment:12 helen13 months ago

Please always test trunk.

comment:13 SergeyBiryukov13 months ago

  • Keywords reporter-feedback added

I can reproduce in 3.5.1 with WP Editor plugin activated, but not on a clean install. Could you try on a clean install?

comment:14 in reply to: ↑ 10 miqrogroove13 months ago

Replying to SergeyBiryukov:

Could not reproduce either:

  1. Pasted the example into editor (tried both Text and Visual).

OK couple things to clarify here. I can't reproduce this bug by pasting or by using the Text tab. Go to the Visual tab, type in the spaces, and then see what happens in preview.

comment:15 follow-up: miqrogroove13 months ago

  • Keywords needs-patch added; reporter-feedback removed

I've also confirmed this is reproducible in 3.6-beta1. Any chance this can get on a milestone now?

comment:16 in reply to: ↑ 15 nacin13 months ago

  • Keywords needs-unit-tests added

Replying to miqrogroove:

I've also confirmed this is reproducible in 3.6-beta1. Any chance this can get on a milestone now?

Would need a patch and unit tests for that.

comment:17 miqrogroove13 months ago

OK, I'm in an optimistic mood today :) The problem is likely in a filter somewhere and I'll track it down.

comment:18 miqrogroove13 months ago

*sigh*

Not a filter issue. I'm getting different results with identical inputs on two servers now. Starting to suspect an mb string issue. I'll boil this down to a test case for you to look at.

comment:19 miqrogroove13 months ago

$text = "A sentence. \xC2\xA0\"A quote after 2 spaces.\"";

$opening_quote = '&#8220;';
$closing_quote = '&#8221;';

$dynamic = array();
$dynamic[ '/(\s|\A|[([{<])"(?!\s)/'    ] = '$1' . $opening_quote . '$2'; 
$dynamic[ '/"(\s|\S|\Z)/'              ] = $closing_quote . '$1'; 

$dynamic_characters = array_keys( $dynamic );
$dynamic_replacements = array_values( $dynamic );

echo preg_replace($dynamic_characters, $dynamic_replacements, $text);

The above test case produces bad output on my BlueHost account, but works perfectly on my home server. Configurations are quite different; unsure the culprit. Suspect mb string.

comment:20 miqrogroove13 months ago

Oops. Fixed it. The PCRE_UTF8 modifier forces identical output on both servers.

$dynamic[ '/(\s|\A|[([{<])"(?!\s)/u'   ] = '$1' . $opening_quote . '$2'; 
$dynamic[ '/"(\s|\S|\Z)/u'             ] = $closing_quote . '$1'; 

I would think that's safe to do as long as everyone is using UTF-8.

miqrogroove13 months ago

comment:21 miqrogroove13 months ago

  • Keywords needs-patch removed

Patch added.

comment:22 miqrogroove13 months ago

Btw, I don't do unit testing. What can we do to get this on a milestone now?

comment:23 SergeyBiryukov13 months ago

  • Keywords has-patch added
  • Milestone changed from Awaiting Review to 3.6

comment:24 miqrogroove13 months ago

Thank you Sergey

comment:25 miqrogroove12 months ago

Anyone have a few minutes to test my patch?

comment:26 markjaquith10 months ago

  • Milestone changed from 3.6 to Future Release

Punting this as we're focusing on 3.6 blockers now.

comment:27 follow-up: miqrogroove9 months ago

3.7 early? Please?

comment:28 in reply to: ↑ 27 nacin7 months ago

Replying to miqrogroove:

3.7 early? Please?

Needs unit tests. http://make.wordpress.org/core/handbook/automated-testing/

comment:29 miqrogroove7 months ago

I've got the test case in comment 19 above. The output of preg_replace in that case or of wptexturize itself needs to be compared to the wanted and unwanted outputs (which I could provide).

I don't know how to make unit tests. Could you convert it for me?

Last edited 7 months ago by miqrogroove (previous) (diff)

comment:30 miqrogroove6 months ago

3.8 early? Please?
Begging!

comment:31 markoheijnen6 months ago

The point nacin mentioned is still not resolved. If you want to have it on the 3.8 milestone then you should look into create unit tests. It's not that hard. So creating small functions that make calls to wptexturize() and compare the output with what you expect.

Look at http://core.trac.wordpress.org/browser/trunk/tests/phpunit/tests/formatting/WPTexturize.php.

comment:32 miqrogroove6 months ago

miqro-22692-tests.diff attached - My first attempt ever making a unit test. Thanks tierra and azaozz in IRC for pointing me in a direction. This file needs testing.

comment:33 miqrogroove6 months ago

Possible concern - Do the unit tests have to be translatable? The opening and closing quotes in wptexturize() also run through _x() which means the output could be considerably different in other languages.

comment:34 follow-up: miqrogroove6 months ago

also,

/"(\s|\S|\Z)/

Does that regexp make sense to anyone? Isn't it equivalent to /"/ ??

comment:35 azaozz6 months ago

The tests look good, but... There is something going on here. Without the u modifier, in some cases \s matches \xA0 which happens to be the second byte from U+00A0 and all works as expected. The tests pass.

Adding this to any testcase:

/**
 * @ticket 999999
 */
function test_regex_xa0() {
	$this->assertTrue( 1 === preg_match( '/^\s$/', "\xA0" ) );
	$this->assertRegExp( '/^\s$/', "\xA0" );
}

Then run it with phpunit --group 999999. Both pass on my test install. However this:

add_action( 'shutdown', 'my_test_s' );
function my_test_s() {
	var_dump( preg_match( '/^\s$/', "\xA0" ) );
}

outputs int(0)... So which one is right?

comment:36 miqrogroove6 months ago

On my home server, var_dump( preg_match( '/^\s$/', "\xA0" ) ); throws int(1). But on BlueHost I'm having bad output. I can't explain the difference, but the configurations are quite diverse.

comment:37 miqrogroove6 months ago

BlueHost has mbstring installed and throws int(0), whereas my home server does not have mbstring and throws int(1). Does that help in any way?

comment:38 miqrogroove6 months ago

Another good guess would be that my home server is properly referencing the ISO-8859-1 code page (\xA0 is a space), whereas BlueHost could be using ASCII (\xA0 is invalid). Neither has a default charset configured in PHP, and I wonder if it falls back to an OS setting.

comment:39 azaozz6 months ago

...my home server is properly referencing the ISO-8859-1 code page (\xA0 is a space), whereas BlueHost could be using ASCII (\xA0 is invalid).

Yes, that's the most likely reason. Still couldn't confirm/reproduce it though. There are more inconsistencies. looking in the PCRE documentation:

The \s characters are HT (9), LF (10), FF (12), CR (13), and space (32).
If "use locale;" is included in a Perl script, \s may match the VT charac-
ter. In PCRE, it never does.

And later on:

By default, in a UTF mode, characters  with  values  greater  than  128
never  match  \d,  \s,  or  \w,  and always match \D, \S, and \W. These
sequences retain their original meanings from before  UTF  support  was
available,  mainly for efficiency reasons. However, if PCRE is compiled
with Unicode property support, and the PCRE_UCP option is set, the  be-
haviour  is  changed  so  that Unicode properties are used to determine
character types, as follows:

 \d  any character that \p{Nd} matches (decimal digit)
 \s  any character that \p{Z} matches, plus HT, LF, FF, CR
 \w  any character that \p{L} or \p{N} matches, plus underscore

So, \s will not match U+00A0 if PCRE was compiled without Unicode property support. In addition the PCRE_UCP option is set as of PHP 5.3.4, so the u modifier doesn't change \s in earlier versions: https://bugs.php.net/bug.php?id=52971.

In that terms, to fix this particular case we should use [\s\xA0] to match white space. However to make it fully compatible, this should include all white space chars from \p{Z} plus \r, \n, \t and \f.

In addition:

  • In PHP 5.3.4+, when using preg_* functions in UTF mode and PCRE has been compiled with Unicode property support, the behavior of \d, \D, \s, \S, \w, and \W changes. They match the Unicode property of each char, including a lot more characters and making the regex quite slower.
  • Currently this cannot be used for two reasons:
    • The lowest required PHP version is 5.2.4.
    • Some hosts don't have Unicode property support in PCRE.
Last edited 6 months ago by azaozz (previous) (diff)

comment:40 azaozz6 months ago

Actually the proper regex is \s|\xC2\xA0 as we are not in UTF mode and \xA0 can match part of an UTF-8 char.

comment:41 miqrogroove6 months ago

So then matching both bytes works better with UTF-8, but matching only the Latin byte might work better with other types of encoding. I agree both bytes is the way to go if UTF-8 is the expected input. It also wont break other multi-byte types the way that /u would.

comment:42 in reply to: ↑ 34 miqrogroove6 months ago

Replying to miqrogroove:

also,

/"(\s|\S|\Z)/

Does that regexp make sense to anyone? Isn't it equivalent to /"/ ??

Clarification.. this is replacement mode, so what's the difference between these, if any:

$dynamic[ '/"(\s|\S|\Z)/' ] = $closing_quote . '$1';

or

$dynamic[ '/"(.?)/' ] = $closing_quote . '$1';

or

$dynamic[ '/"/' ] = $closing_quote;

Wouldn't those all do the same thing?

Last edited 6 months ago by miqrogroove (previous) (diff)

comment:43 miqrogroove5 months ago

I'm seeing other problems that need to be resolved with this patch.

(?! is a non-capturing lookahead assertion, which means the $2 replacement token does nothing.

miqrogroove5 months ago

Patch improved based on ticket feedback.

comment:44 miqrogroove5 months ago

azaozz, miqro-22692.2.patch​ is based on your feedback in comment 40, as well as fixes for comments 42 and 43.

comment:45 miqrogroove5 months ago

  • Keywords needs-unit-tests removed

comment:46 miqrogroove5 months ago

Another note on miqro-22692.2.patch :

Changing the lookahead from negative (?!\s) to positive (?=\S) which I think is more readable, also has the effect of changing the expected output for a lone double quote, as in wptexturize('"').

I made that change on the assumption it would be trivial.

comment:47 SergeyBiryukov5 months ago

  • Milestone changed from Future Release to 3.8

comment:48 SergeyBiryukov5 months ago

  • Version changed from 3.4.2 to 1.2

The original regex appears to be introduced in [985].

comment:49 miqrogroove5 months ago

The "original" bug probably looked more like #16957 because the post editor was inserting the HTML character entity reference &nbsp; instead of its UTF-8 encoding.

miqrogroove5 months ago

A more robust solution conceived in IRC. Eliminates the \s sequence.

comment:50 miqrogroove5 months ago

Added a third patch option. If we can't figure out exactly why \s is causing semi-random behavior, then this patch should be used to eliminate \s from wptexturize().

comment:51 miqrogroove5 months ago

I got in touch with one of the PCRE devs who says, "Clearly, the different PHP implementations you are using must be calling pcre_compile() with different options." http://bugs.exim.org/show_bug.cgi?id=1412

comment:52 azaozz5 months ago

Very nice, super fast turnaround from PCRE :)

As far as I see the problem comes from here:

/\s/W
\x{a0}
 0: \xa0

(non-UTF mode, but with Unicode properties).

If PCRE was compiled with Unicode properties support *and* ? PCRE_UCP is passed at compile time, \s will match \xa0 even when not in UTF mode. However in PHP PCRE_UCP is only passed when the u modifier is set: http://svn.php.net/viewvc/php/php-src/trunk/ext/pcre/php_pcre.c?r1=303963&r2=303962&pathrev=303963 since PHP 5.3.4. And that hasn't changed in newer versions: http://git.php.net/?p=php-src.git;a=blob;f=ext/pcre/php_pcre.c;h=7d34d9feb15a81b5e80973cf1aaa1c4936543173;hb=refs/heads/master#l366.

... Still can't pinpoint the inconsistencies with \s.

comment:53 miqrogroove5 months ago

I'm not familiar with that PCRE Unicode mode, but it sounds similar to UTF-16. In Windows, \xA0 is sometimes encoded \x00\xA0 and that's when data typing becomes so important.

I put in a ticket at php.net today. They tend to be slow.

comment:54 miqrogroove5 months ago

azaozz, there's a request from PHP to post as much phpinfo() as possible from systems returning int(1) for the test code. Could you drop a phpinfo() call into your phpunit and post the results? Maybe edit out usernames and addresses for safety. See https://bugs.php.net/bug.php?id=66068

comment:55 miqrogroove5 months ago

And in case it's relevant, let's see if your command line environment shows a different result for

var_dump( setlocale( LC_ALL, 0 ) );

comment:56 miqrogroove5 months ago

On BlueHost, where we get the "expected" behavior of not matching \xA0, the setlocale() result is

string(1) "C"

On my home server, I get

string(85) "LC_COLLATE=C;LC_CTYPE=English_United States.1252;LC_MONETARY=C;LC_NUMERIC=C;LC_TIME=C"

comment:57 miqrogroove5 months ago

setlocale( LC_CTYPE, 'C' );
var_dump( preg_match( '/^\s$/', "\xA0" ) );

int(0)

comment:58 azaozz5 months ago

Yeah, now we are getting somewhere...

PHP 5.4.16 on Windows 7:

var_dump( setlocale( LC_ALL, 0 ) ); // string(1) "C"
var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0)

var_dump( setlocale( LC_ALL, '' ) ); // On Windows this sets it to the system default
// string(19) "English_Canada.1252"

var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(1)
var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // bool(false) (as \xA0 is not full UTF char)

setlocale( LC_ALL, 'C' );
var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0)
var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // bool(false)

PHP 5.3.1 on Mac OSX (note: 5.3.1 doesn't set PCRE_UCP with the u modifier)

var_dump( setlocale( LC_ALL, 0 ) ); // string(1) "C"
var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0)

setlocale( LC_ALL, 'en_CA' ); // Also with 'en_CA.UTF-8', 'en_CA.ISO8859-1', 'en_CA.ISO8859-15', etc.
var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(1)
var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // int(0)

setlocale( LC_ALL, 'C' ); // Also with 'en_CA.US-ASCII'
var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0)
var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // int(0)

So when the locale is anything other than C or US-ASCII equivalent, \s matches \xA0. Also on multithreaded servers like Apache on Windows, setlocale() is "sticky", more info.

From the PCRE manual:

PCRE  handles  caseless matching, and determines whether characters are
letters, digits, or whatever, by reference to a set of tables,  indexed
by  character  value.  When running in UTF-8 mode, this applies only to
characters with codes less than 128.

It is not mentioned there but seems \s is also affected by these tables.

The internal tables can always be overridden by tables supplied by  the
application that calls PCRE... External  tables  are  built by calling
pcre_maketables()

PHP uses pcre_maketables() only if the locale is 'C': http://git.php.net/?p=php-src.git;a=blob;f=ext/pcre/php_pcre.c;h=7d34d9feb15a81b5e80973cf1aaa1c4936543173;hb=refs/heads/master#l392

So it seems that the unexpected behavior of \s is caused by PCRE when a locale other than C is set.

comment:59 miqrogroove5 months ago

I'm just amazed that none of this is documented anywhere as far as I can find.

comment:60 azaozz5 months ago

Yes, thinking we can ask that the PCRE (and PHP) documentation is updated to reflect this.

comment:61 miqrogroove5 months ago

Based on what we know now, I think the most recent patch and tests are appropriate. I wouldn't want to involve locale information in the fix for this bug.

comment:62 nacin5 months ago

  • Keywords commit 3.9-early added
  • Milestone changed from 3.8 to Future Release

See also, #22363, apparently.

Let's try this out early 3.9.

comment:63 miqrogroove5 months ago

nacin, the current patch here does not involve PCRE unicode modes. We decided to go strictly ASCII. Other than that, the filenames ticket doesn't seem related.

comment:64 miqrogroove4 months ago

Could someone edit the ticket description to say "Type" instead of "Paste" ? A while back we figured out there was a big difference.

comment:65 kpdesign4 months ago

  • Description modified (diff)

@miqrogroove: I've made the change in the ticket description.

comment:66 miqrogroove4 months ago

miqro-22692.3.patch is ready to commit.

miqro-22692-tests.2.diff let me know if I need to refresh that one.

Last edited 3 months ago by miqrogroove (previous) (diff)

comment:67 SergeyBiryukov4 months ago

  • Milestone changed from Future Release to 3.9

comment:68 nacin3 months ago

  • Keywords wptexturize added

miqrogroove6 weeks ago

Expand regexp to include future use of HTML output.

miqrogroove6 weeks ago

Move regexp filter to a new function for reuse.

miqrogroove6 weeks ago

Covers work on smilies for 26842

comment:69 miqrogroove6 weeks ago

Patches and unit tests expanded for compatibility with changes proposed in #26842

comment:70 miqrogroove6 weeks ago

I decided this needs to be expanded into wpautop(). I reviewed the impact earlier and determined my patches would not break anything there. However, it would be silly to leave pre-existing \s bugs in wpautop() when we are already tackling them in wptexturize(). Check for new attachments in a couple hours.

miqrogroove6 weeks ago

Touch unautop for spaces around shortcodes.

comment:71 miqrogroove5 weeks ago

I'm triple checking everything before the IRC meeting. Just noticed a stray wildcard in the original regexp for closing single quotes: '/\'([\s.]|\Z)/' That lone period may represent any character, similar to the \s|\S that was used in closing double quotes. I will tweak that now if needed.

comment:72 miqrogroove5 weeks ago

Testing shows lone periods inside character classes are not treated as wildcards. So [\s.] is matching both spaces and periods. This is still a significant difference between double quotes and single quotes, and for the sake of this patch I want that original behavior unchanged. Patch revision coming momentarily.

Last edited 5 weeks ago by miqrogroove (previous) (diff)

miqrogroove5 weeks ago

Correction: Preserve behavior of periods after single quotes.

miqrogroove5 weeks ago

Merge src and tests.

miqrogroove5 weeks ago

Src and tests merge, take 2.

comment:73 ircbot5 weeks ago

This ticket was mentioned in IRC in #wordpress-dev by azaozz. View the logs.

miqrogroove5 weeks ago

First pass at unit tests to cover all of wptexturize.

comment:74 miqrogroove5 weeks ago

In IRC, we decided to add a very large number of unit tests for wptexturize. These tests will:

  1. Fail upon any change of behavior from v3.8.
  2. Fail for all known bugs.
  3. Eventually, not fail. We can set aside some tests later as needed.

A first set of tests is attached. More to follow as time permits.

miqrogroove5 weeks ago

Second pass at unit tests to cover all of wptexturize.

comment:75 miqrogroove5 weeks ago

Uploaded more tests. This covers everything in wptexturize v3.8 except for the $no_texturize_shortcodes feature. It looks like many of the known bugs already have tests in the trunk version, but I plan to check if any need to be added.

comment:76 ircbot5 weeks ago

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.

miqrogroove5 weeks ago

Patch and tests rollup.

miqrogroove5 weeks ago

Divide unit tests into smaller units.

miqrogroove5 weeks ago

Fixed syntax in unit tests.

comment:77 miqrogroove5 weeks ago

The unit tests are working now. In IRC we decided the next step is to benchmark the two different regexp styles, lookbehinds vs. capturing backreferences. There are any number of reasons one might be faster than the other, so it will be good to find out.

comment:78 azaozz5 weeks ago

miqro-22692.12.patch looks good. The only tests that fail after applying the patch are in test_more_bugs() in WPTexturize.php.

comment:79 miqrogroove5 weeks ago

Initial testing shows a massive performance improvement after the patch, as-is!

Using, for example:

$start = microtime(TRUE);
for ($i = 0; $i<5; $i++) {
  $curl = preg_replace($dynamic_characters, $dynamic_replacements, $post);
}
$end = microtime(TRUE);

Here were the results based on the old and new regexp strings, tested against a 9,911 byte post.

Old logic, 1 iteration = 0.0128 sec
Old logic, 5 iteration = 0.0658 sec
Old logic, 10 iteration = 0.1308 sec
Old logic, 25 iteration = 0.3159 sec

New logic, 1 iteration = 0.0020 sec
New logic, 5 iteration = 0.0094 sec
New logic, 10 iteration = 0.0188 sec
New logic, 25 iteration = 0.0468 sec

I will take a closer look now to see if any tuning would be worthwhile.

miqrogroove5 weeks ago

Avoid running the 9x9 regexp, which is inherently slow.

miqrogroove5 weeks ago

comment:80 miqrogroove5 weeks ago

Patch 14 gives another 3x performance boost by skipping the 9x9 replacement when possible. The extra check is negligible but saves a lot of time:

Patch 14, 9x9 off, 1 iteration = 0.0006 sec
Patch 14, 9x9 off, 5 iteration = 0.0027 sec
Patch 14, 9x9 off, 10 iteration = 0.0053 sec
Patch 14, 9x9 off, 25 iteration = 0.0131 sec

Patch 14, 9x9 on, 1 iteration = 0.0020 sec
Patch 14, 9x9 on, 5 iteration = 0.0095 sec
Patch 14, 9x9 on, 10 iteration = 0.0188 sec
Patch 14, 9x9 on, 25 iteration = 0.0467 sec

comment:81 ircbot5 weeks ago

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.

comment:82 miqrogroove4 weeks ago

Might want to expand test_ampersand() and test_more_bugs() with malformed entity references like &&amp; and &!amp;

wptexturize in 3.8 will predictably fail to convert the first character because the second character in the pattern is only required to not be a #.

comment:83 ircbot3 weeks ago

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.

comment:84 miqrogroove3 weeks ago

Meeting minutes:

  • Is it worth committing this 3 weeks before release?
    • Patch would be reverted at the first sign of trouble.
    • Could alter the perception of other bugs.
    • Paves the way for other bugs to be fixed.
    • Improves performance.
    • A decision will be made by the Beta 3 deadline.
  • What changes are made other than quote replacement patterns?
    • wptexturize() will treat any 0xC2A0 or &nbsp; the same as a regular space.
    • Bugs outside the scope of this patch are preserved (not fixed).
    • Smilies and shortcodes are fixed in the current patch. Should this be separate?
    • Extra spaces removed from translate_smiley() because it would lead to nbsp being replaced by normal space and other strange problems.
    • Were people counting on multiple spaces around smilies img? - Unlikely
  • What is the overlap with #26842?
    • That ticket would change how the editor outputs NBSP characters.
    • Smilies and shortcodes must be patched before that change.
    • It remains a separate issue.
  • miqrogroove will be available to address any fallout from this patch.
Last edited 3 weeks ago by miqrogroove (previous) (diff)

comment:85 miqrogroove3 weeks ago

Executive summary of miqro-22692.14.patch:

  • new function wp_spaces_regexp()
    • Provides the chars and strings that can represent a space in regexp.
    • Adds a run-once filter for any customization needs.
  • wptexturize
    • Replaced \s sequence by wp_spaces_regexp() in regex.
    • Now quotes can be neighbors with nbsp.
    • Removed duplicate patterns and junk like \s|\S for performance.
    • Replaced back-reference patterns with look-behind patterns for performance.
    • Moved the expensive 9x9 pattern so it can be skipped when not needed.
  • shortcode_unautop
    • Replaced \s sequence by wp_spaces_regexp() in regex.
    • Now shortcodes can be neighors with nbsp.
  • translate_smiley
    • Removed extra spaces that were not be needed after fixing smilies_init().
  • smilies_init
    • Replaced \s sequence by wp_spaces_regexp() in regex.
    • Now smilies can be neighbors with nbsp.
    • Added a look-behind to prevent replacement of leading spaces.
    • Removed extra capturing groups for performance.

miqrogroove3 weeks ago

Patch refreshed without code for smilies, no tests for other bugs.

comment:86 miqrogroove3 weeks ago

Based on feedback last night, the unit tests for smilies need a lot of work.

miqro-22692.15.patch adjusted as follows:

Dropped all changes to smilies_init() and translate_smiley().

Dropped all new unit tests for bugs not fixed by this patch.

miqrogroove3 weeks ago

Patch refreshed without fix for shortcodes.

comment:87 miqrogroove3 weeks ago

miqro-22692.16.patch is limited in scope to wptexturize() only. I added the two assertions from comment 82 under test_ampersand().

comment:88 nacin3 weeks ago

  • Owner set to nacin
  • Resolution set to fixed
  • Status changed from new to closed

In 27839:

Texturize: Massive performance improvements (~600% faster); better handling of nbsp, double, and weird spaces; 136 new unit tests.

big props miqrogroove.
fixes #22692.

comment:89 nacin3 weeks ago

Bravo, miqrogroove.

Let's spin out the smiley and shortcode stuff into new tickets, and hit the ground running with them in 4.0.

If there are any issues in [27839] we're likely to do a full or partial revert (for example, I'd want to leave in the 9x9 speed-up), given how little time there is remaining in the 3.9 development cycle, and we'd try it again in 4.0. But this was too good to pass up.

As I noticed in IRC at one point, I did some profiling and benchmarking using the text of Alice in Wonderland (nearly 25,000 words). I noticed that with the patch, the page loaded about 25% faster.

comment:90 nacin3 weeks ago

In 27844:

Add braces to wptexturize() after [27839]. see #22692.

comment:91 miqrogroove3 weeks ago

For future releases:

#26842 - Editor
#27426 - i18n
#27587 - Smilies
#27588 - Shortcodes

Note: See TracTickets for help on using tickets.