Make WordPress Core

Opened 11 years ago

Closed 10 years ago

Last modified 10 years ago

#22692 closed defect (bug) (fixed)

Quotes Are Messing Up

Reported by: miqrogroove's profile miqrogroove Owned by: nacin's profile nacin
Milestone: 3.9 Priority: normal
Severity: normal Version: 1.2
Component: Formatting Keywords: has-patch commit 3.9-early wptexturize
Focuses: Cc:

Description (last modified by kpdesign)

Type this into the body of a new post:

A sentence.  "A quote after 2 spaces."

Preview that, and be amazed. The first quote is backwards. :(

Due to font differences, this is easiest to see on the Twenty Ten theme.

Attachments (24)

miqro-22692.patch (787 bytes) - added by miqrogroove 11 years ago.
miqro-22692-tests.diff (1.2 KB) - added by miqrogroove 11 years ago.
miqro-22692.2.patch (1.0 KB) - added by miqrogroove 11 years ago.
Patch improved based on ticket feedback.
miqro-22692.3.patch (3.2 KB) - added by miqrogroove 11 years ago.
A more robust solution conceived in IRC. Eliminates the \s sequence.
miqro-22692-tests.2.diff (1.7 KB) - added by miqrogroove 11 years ago.
miqro-22692.4.patch (3.3 KB) - added by miqrogroove 10 years ago.
Expand regexp to include future use of HTML output.
miqro-22692-tests.patch (2.0 KB) - added by miqrogroove 10 years ago.
miqro-22692.5.patch (4.8 KB) - added by miqrogroove 10 years ago.
Move regexp filter to a new function for reuse.
miqro-22692-tests.3.diff (3.7 KB) - added by miqrogroove 10 years ago.
Covers work on smilies for 26842
miqro-22692.6.patch (5.6 KB) - added by miqrogroove 10 years ago.
Touch unautop for spaces around shortcodes.
miqro-22692-tests.4.diff (4.5 KB) - added by miqrogroove 10 years ago.
miqro-22692.7.patch (5.7 KB) - added by miqrogroove 10 years ago.
Correction: Preserve behavior of periods after single quotes.
miqro-22692.8.patch (53.8 KB) - added by miqrogroove 10 years ago.
Merge src and tests.
miqro-22692.9.patch (10.3 KB) - added by miqrogroove 10 years ago.
Src and tests merge, take 2.
WPTexturize.php.diff (10.4 KB) - added by miqrogroove 10 years ago.
First pass at unit tests to cover all of wptexturize.
WPTexturize.php.2.diff (14.5 KB) - added by miqrogroove 10 years ago.
Second pass at unit tests to cover all of wptexturize.
WPTexturize.php.3.diff (18.9 KB) - added by miqrogroove 10 years ago.
miqro-22692.10.patch (27.2 KB) - added by miqrogroove 10 years ago.
Patch and tests rollup.
miqro-22692.11.patch (32.1 KB) - added by miqrogroove 10 years ago.
Divide unit tests into smaller units.
miqro-22692.12.patch (32.1 KB) - added by miqrogroove 10 years ago.
Fixed syntax in unit tests.
miqro-22692.13.patch (33.2 KB) - added by miqrogroove 10 years ago.
Avoid running the 9x9 regexp, which is inherently slow.
miqro-22692.14.patch (33.0 KB) - added by miqrogroove 10 years ago.
miqro-22692.15.patch (24.2 KB) - added by miqrogroove 10 years ago.
Patch refreshed without code for smilies, no tests for other bugs.
miqro-22692.16.patch (22.6 KB) - added by miqrogroove 10 years ago.
Patch refreshed without fix for shortcodes.

Download all attachments as: .zip

Change History (115)

#1 @jond3r
11 years ago

  • Cc jond3r@… added

Isn't this caused by the wptexturize filter?
Try to put

remove_filter('the_content', 'wptexturize');

in the theme's functions.php file.

#2 @dd32
11 years ago

Isn't this caused by the wptexturize filter?

Yes, It's a bug in the texturize filter, this ticket is to fix it.

#3 @pavelevap
11 years ago

Related: #16957 ?

#4 @SergeyBiryukov
11 years ago

  • Component changed from General to Formatting

#5 follow-up: @nacin
11 years ago

Cannot reproduce in the editor.

wptexturize( 'A sentence. "A quote after 2 spaces."' ) returns proper quotes.

#6 in reply to: ↑ 5 @miqrogroove
11 years ago

Replying to nacin:

Cannot reproduce in the editor.

Sometimes pasting doesn't preserve both spaces. Try removing the space and hitting spacebar twice from Visual editor. That'll do it.

#7 follow-up: @miqrogroove
11 years ago

Here's another bug:

"<code>blah</code>"

The trailing quote is backwards when output. 3.4.2 and 3.5 both affected.

#8 in reply to: ↑ 7 @SergeyBiryukov
11 years ago

Replying to miqrogroove:

"<code>blah</code>"

The trailing quote is backwards when output.

Related: #18549

#9 @miqrogroove
11 years ago

Any chance of getting a milestone for the 2-spaces bug? I'm in the habit of typing with 2 spaces so this bothers me a bit.

#10 follow-up: @SergeyBiryukov
11 years ago

Could not reproduce either:

  1. Pasted the example into editor (tried both Text and Visual).
  2. Made sure there are two spaces before the quote.
  3. The post contains proper quotes:
    A sentence.  &#8220;A quote after 2 spaces.&#8221;
    

#11 @miqrogroove
11 years ago

Tested 3.5.1:

<p>A sentence.  &#8221;A quote after 2 spaces.&#8221;</p>

#12 @helen
11 years ago

Please always test trunk.

#13 @SergeyBiryukov
11 years ago

  • Keywords reporter-feedback added

I can reproduce in 3.5.1 with WP Editor plugin activated, but not on a clean install. Could you try on a clean install?

#14 in reply to: ↑ 10 @miqrogroove
11 years ago

Replying to SergeyBiryukov:

Could not reproduce either:

  1. Pasted the example into editor (tried both Text and Visual).

OK couple things to clarify here. I can't reproduce this bug by pasting or by using the Text tab. Go to the Visual tab, type in the spaces, and then see what happens in preview.

#15 follow-up: @miqrogroove
11 years ago

  • Keywords needs-patch added; reporter-feedback removed

I've also confirmed this is reproducible in 3.6-beta1. Any chance this can get on a milestone now?

#16 in reply to: ↑ 15 @nacin
11 years ago

  • Keywords needs-unit-tests added

Replying to miqrogroove:

I've also confirmed this is reproducible in 3.6-beta1. Any chance this can get on a milestone now?

Would need a patch and unit tests for that.

#17 @miqrogroove
11 years ago

OK, I'm in an optimistic mood today :) The problem is likely in a filter somewhere and I'll track it down.

#18 @miqrogroove
11 years ago

*sigh*

Not a filter issue. I'm getting different results with identical inputs on two servers now. Starting to suspect an mb string issue. I'll boil this down to a test case for you to look at.

#19 @miqrogroove
11 years ago

$text = "A sentence. \xC2\xA0\"A quote after 2 spaces.\"";

$opening_quote = '&#8220;';
$closing_quote = '&#8221;';

$dynamic = array();
$dynamic[ '/(\s|\A|[([{<])"(?!\s)/'    ] = '$1' . $opening_quote . '$2'; 
$dynamic[ '/"(\s|\S|\Z)/'              ] = $closing_quote . '$1'; 

$dynamic_characters = array_keys( $dynamic );
$dynamic_replacements = array_values( $dynamic );

echo preg_replace($dynamic_characters, $dynamic_replacements, $text);

The above test case produces bad output on my BlueHost account, but works perfectly on my home server. Configurations are quite different; unsure the culprit. Suspect mb string.

#20 @miqrogroove
11 years ago

Oops. Fixed it. The PCRE_UTF8 modifier forces identical output on both servers.

$dynamic[ '/(\s|\A|[([{<])"(?!\s)/u'   ] = '$1' . $opening_quote . '$2'; 
$dynamic[ '/"(\s|\S|\Z)/u'             ] = $closing_quote . '$1'; 

I would think that's safe to do as long as everyone is using UTF-8.

#21 @miqrogroove
11 years ago

  • Keywords needs-patch removed

Patch added.

#22 @miqrogroove
11 years ago

Btw, I don't do unit testing. What can we do to get this on a milestone now?

#23 @SergeyBiryukov
11 years ago

  • Keywords has-patch added
  • Milestone changed from Awaiting Review to 3.6

#24 @miqrogroove
11 years ago

Thank you Sergey

#25 @miqrogroove
11 years ago

Anyone have a few minutes to test my patch?

#26 @markjaquith
11 years ago

  • Milestone changed from 3.6 to Future Release

Punting this as we're focusing on 3.6 blockers now.

#27 follow-up: @miqrogroove
11 years ago

3.7 early? Please?

#28 in reply to: ↑ 27 @nacin
11 years ago

Replying to miqrogroove:

3.7 early? Please?

Needs unit tests. http://make.wordpress.org/core/handbook/automated-testing/

#29 @miqrogroove
11 years ago

I've got the test case in comment 19 above. The output of preg_replace in that case or of wptexturize itself needs to be compared to the wanted and unwanted outputs (which I could provide).

I don't know how to make unit tests. Could you convert it for me?

Last edited 11 years ago by miqrogroove (previous) (diff)

#30 @miqrogroove
11 years ago

3.8 early? Please?
Begging!

#31 @markoheijnen
11 years ago

The point nacin mentioned is still not resolved. If you want to have it on the 3.8 milestone then you should look into create unit tests. It's not that hard. So creating small functions that make calls to wptexturize() and compare the output with what you expect.

Look at http://core.trac.wordpress.org/browser/trunk/tests/phpunit/tests/formatting/WPTexturize.php.

#32 @miqrogroove
11 years ago

miqro-22692-tests.diff attached - My first attempt ever making a unit test. Thanks tierra and azaozz in IRC for pointing me in a direction. This file needs testing.

#33 @miqrogroove
11 years ago

Possible concern - Do the unit tests have to be translatable? The opening and closing quotes in wptexturize() also run through _x() which means the output could be considerably different in other languages.

#34 follow-up: @miqrogroove
11 years ago

also,

/"(\s|\S|\Z)/

Does that regexp make sense to anyone? Isn't it equivalent to /"/ ??

#35 @azaozz
11 years ago

The tests look good, but... There is something going on here. Without the u modifier, in some cases \s matches \xA0 which happens to be the second byte from U+00A0 and all works as expected. The tests pass.

Adding this to any testcase:

/**
 * @ticket 999999
 */
function test_regex_xa0() {
	$this->assertTrue( 1 === preg_match( '/^\s$/', "\xA0" ) );
	$this->assertRegExp( '/^\s$/', "\xA0" );
}

Then run it with phpunit --group 999999. Both pass on my test install. However this:

add_action( 'shutdown', 'my_test_s' );
function my_test_s() {
	var_dump( preg_match( '/^\s$/', "\xA0" ) );
}

outputs int(0)... So which one is right?

#36 @miqrogroove
11 years ago

On my home server, var_dump( preg_match( '/^\s$/', "\xA0" ) ); throws int(1). But on BlueHost I'm having bad output. I can't explain the difference, but the configurations are quite diverse.

#37 @miqrogroove
11 years ago

BlueHost has mbstring installed and throws int(0), whereas my home server does not have mbstring and throws int(1). Does that help in any way?

#38 @miqrogroove
11 years ago

Another good guess would be that my home server is properly referencing the ISO-8859-1 code page (\xA0 is a space), whereas BlueHost could be using ASCII (\xA0 is invalid). Neither has a default charset configured in PHP, and I wonder if it falls back to an OS setting.

#39 @azaozz
11 years ago

...my home server is properly referencing the ISO-8859-1 code page (\xA0 is a space), whereas BlueHost could be using ASCII (\xA0 is invalid).

Yes, that's the most likely reason. Still couldn't confirm/reproduce it though. There are more inconsistencies. looking in the PCRE documentation:

The \s characters are HT (9), LF (10), FF (12), CR (13), and space (32).
If "use locale;" is included in a Perl script, \s may match the VT charac-
ter. In PCRE, it never does.

And later on:

By default, in a UTF mode, characters  with  values  greater  than  128
never  match  \d,  \s,  or  \w,  and always match \D, \S, and \W. These
sequences retain their original meanings from before  UTF  support  was
available,  mainly for efficiency reasons. However, if PCRE is compiled
with Unicode property support, and the PCRE_UCP option is set, the  be-
haviour  is  changed  so  that Unicode properties are used to determine
character types, as follows:

 \d  any character that \p{Nd} matches (decimal digit)
 \s  any character that \p{Z} matches, plus HT, LF, FF, CR
 \w  any character that \p{L} or \p{N} matches, plus underscore

So, \s will not match U+00A0 if PCRE was compiled without Unicode property support. In addition the PCRE_UCP option is set as of PHP 5.3.4, so the u modifier doesn't change \s in earlier versions: https://bugs.php.net/bug.php?id=52971.

In that terms, to fix this particular case we should use [\s\xA0] to match white space. However to make it fully compatible, this should include all white space chars from \p{Z} plus \r, \n, \t and \f.

In addition:

  • In PHP 5.3.4+, when using preg_* functions in UTF mode and PCRE has been compiled with Unicode property support, the behavior of \d, \D, \s, \S, \w, and \W changes. They match the Unicode property of each char, including a lot more characters and making the regex quite slower.
  • Currently this cannot be used for two reasons:
    • The lowest required PHP version is 5.2.4.
    • Some hosts don't have Unicode property support in PCRE.
Last edited 11 years ago by azaozz (previous) (diff)

#40 @azaozz
11 years ago

Actually the proper regex is \s|\xC2\xA0 as we are not in UTF mode and \xA0 can match part of an UTF-8 char.

#41 @miqrogroove
11 years ago

So then matching both bytes works better with UTF-8, but matching only the Latin byte might work better with other types of encoding. I agree both bytes is the way to go if UTF-8 is the expected input. It also wont break other multi-byte types the way that /u would.

#42 in reply to: ↑ 34 @miqrogroove
11 years ago

Replying to miqrogroove:

also,

/"(\s|\S|\Z)/

Does that regexp make sense to anyone? Isn't it equivalent to /"/ ??

Clarification.. this is replacement mode, so what's the difference between these, if any:

$dynamic[ '/"(\s|\S|\Z)/' ] = $closing_quote . '$1';

or

$dynamic[ '/"(.?)/' ] = $closing_quote . '$1';

or

$dynamic[ '/"/' ] = $closing_quote;

Wouldn't those all do the same thing?

Last edited 11 years ago by miqrogroove (previous) (diff)

#43 @miqrogroove
11 years ago

I'm seeing other problems that need to be resolved with this patch.

(?! is a non-capturing lookahead assertion, which means the $2 replacement token does nothing.

@miqrogroove
11 years ago

Patch improved based on ticket feedback.

#44 @miqrogroove
11 years ago

azaozz, miqro-22692.2.patch​ is based on your feedback in comment 40, as well as fixes for comments 42 and 43.

#45 @miqrogroove
11 years ago

  • Keywords needs-unit-tests removed

#46 @miqrogroove
11 years ago

Another note on miqro-22692.2.patch :

Changing the lookahead from negative (?!\s) to positive (?=\S) which I think is more readable, also has the effect of changing the expected output for a lone double quote, as in wptexturize('"').

I made that change on the assumption it would be trivial.

#47 @SergeyBiryukov
11 years ago

  • Milestone changed from Future Release to 3.8

#48 @SergeyBiryukov
11 years ago

  • Version changed from 3.4.2 to 1.2

The original regex appears to be introduced in [985].

#49 @miqrogroove
11 years ago

The "original" bug probably looked more like #16957 because the post editor was inserting the HTML character entity reference &nbsp; instead of its UTF-8 encoding.

@miqrogroove
11 years ago

A more robust solution conceived in IRC. Eliminates the \s sequence.

#50 @miqrogroove
11 years ago

Added a third patch option. If we can't figure out exactly why \s is causing semi-random behavior, then this patch should be used to eliminate \s from wptexturize().

#51 @miqrogroove
10 years ago

I got in touch with one of the PCRE devs who says, "Clearly, the different PHP implementations you are using must be calling pcre_compile() with different options." http://bugs.exim.org/show_bug.cgi?id=1412

#52 @azaozz
10 years ago

Very nice, super fast turnaround from PCRE :)

As far as I see the problem comes from here:

/\s/W
\x{a0}
 0: \xa0

(non-UTF mode, but with Unicode properties).

If PCRE was compiled with Unicode properties support *and* ? PCRE_UCP is passed at compile time, \s will match \xa0 even when not in UTF mode. However in PHP PCRE_UCP is only passed when the u modifier is set: http://svn.php.net/viewvc/php/php-src/trunk/ext/pcre/php_pcre.c?r1=303963&r2=303962&pathrev=303963 since PHP 5.3.4. And that hasn't changed in newer versions: http://git.php.net/?p=php-src.git;a=blob;f=ext/pcre/php_pcre.c;h=7d34d9feb15a81b5e80973cf1aaa1c4936543173;hb=refs/heads/master#l366.

... Still can't pinpoint the inconsistencies with \s.

#53 @miqrogroove
10 years ago

I'm not familiar with that PCRE Unicode mode, but it sounds similar to UTF-16. In Windows, \xA0 is sometimes encoded \x00\xA0 and that's when data typing becomes so important.

I put in a ticket at php.net today. They tend to be slow.

#54 @miqrogroove
10 years ago

azaozz, there's a request from PHP to post as much phpinfo() as possible from systems returning int(1) for the test code. Could you drop a phpinfo() call into your phpunit and post the results? Maybe edit out usernames and addresses for safety. See https://bugs.php.net/bug.php?id=66068

#55 @miqrogroove
10 years ago

And in case it's relevant, let's see if your command line environment shows a different result for

var_dump( setlocale( LC_ALL, 0 ) );

#56 @miqrogroove
10 years ago

On BlueHost, where we get the "expected" behavior of not matching \xA0, the setlocale() result is

string(1) "C"

On my home server, I get

string(85) "LC_COLLATE=C;LC_CTYPE=English_United States.1252;LC_MONETARY=C;LC_NUMERIC=C;LC_TIME=C"

#57 @miqrogroove
10 years ago

setlocale( LC_CTYPE, 'C' );
var_dump( preg_match( '/^\s$/', "\xA0" ) );

int(0)

#58 @azaozz
10 years ago

Yeah, now we are getting somewhere...

PHP 5.4.16 on Windows 7:

var_dump( setlocale( LC_ALL, 0 ) ); // string(1) "C"
var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0)

var_dump( setlocale( LC_ALL, '' ) ); // On Windows this sets it to the system default
// string(19) "English_Canada.1252"

var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(1)
var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // bool(false) (as \xA0 is not full UTF char)

setlocale( LC_ALL, 'C' );
var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0)
var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // bool(false)

PHP 5.3.1 on Mac OSX (note: 5.3.1 doesn't set PCRE_UCP with the u modifier)

var_dump( setlocale( LC_ALL, 0 ) ); // string(1) "C"
var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0)

setlocale( LC_ALL, 'en_CA' ); // Also with 'en_CA.UTF-8', 'en_CA.ISO8859-1', 'en_CA.ISO8859-15', etc.
var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(1)
var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // int(0)

setlocale( LC_ALL, 'C' ); // Also with 'en_CA.US-ASCII'
var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0)
var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // int(0)

So when the locale is anything other than C or US-ASCII equivalent, \s matches \xA0. Also on multithreaded servers like Apache on Windows, setlocale() is "sticky", more info.

From the PCRE manual:

PCRE  handles  caseless matching, and determines whether characters are
letters, digits, or whatever, by reference to a set of tables,  indexed
by  character  value.  When running in UTF-8 mode, this applies only to
characters with codes less than 128.

It is not mentioned there but seems \s is also affected by these tables.

The internal tables can always be overridden by tables supplied by  the
application that calls PCRE... External  tables  are  built by calling
pcre_maketables()

PHP uses pcre_maketables() only if the locale is 'C': http://git.php.net/?p=php-src.git;a=blob;f=ext/pcre/php_pcre.c;h=7d34d9feb15a81b5e80973cf1aaa1c4936543173;hb=refs/heads/master#l392

So it seems that the unexpected behavior of \s is caused by PCRE when a locale other than C is set.

#59 @miqrogroove
10 years ago

I'm just amazed that none of this is documented anywhere as far as I can find.

#60 @azaozz
10 years ago

Yes, thinking we can ask that the PCRE (and PHP) documentation is updated to reflect this.

#61 @miqrogroove
10 years ago

Based on what we know now, I think the most recent patch and tests are appropriate. I wouldn't want to involve locale information in the fix for this bug.

#62 @nacin
10 years ago

  • Keywords commit 3.9-early added
  • Milestone changed from 3.8 to Future Release

See also, #22363, apparently.

Let's try this out early 3.9.

#63 @miqrogroove
10 years ago

nacin, the current patch here does not involve PCRE unicode modes. We decided to go strictly ASCII. Other than that, the filenames ticket doesn't seem related.

#64 @miqrogroove
10 years ago

Could someone edit the ticket description to say "Type" instead of "Paste" ? A while back we figured out there was a big difference.

#65 @kpdesign
10 years ago

  • Description modified (diff)

@miqrogroove: I've made the change in the ticket description.

#66 @miqrogroove
10 years ago

miqro-22692.3.patch is ready to commit.

miqro-22692-tests.2.diff let me know if I need to refresh that one.

Last edited 10 years ago by miqrogroove (previous) (diff)

#67 @SergeyBiryukov
10 years ago

  • Milestone changed from Future Release to 3.9

#68 @nacin
10 years ago

  • Keywords wptexturize added

@miqrogroove
10 years ago

Expand regexp to include future use of HTML output.

@miqrogroove
10 years ago

Move regexp filter to a new function for reuse.

@miqrogroove
10 years ago

Covers work on smilies for 26842

#69 @miqrogroove
10 years ago

Patches and unit tests expanded for compatibility with changes proposed in #26842

#70 @miqrogroove
10 years ago

I decided this needs to be expanded into wpautop(). I reviewed the impact earlier and determined my patches would not break anything there. However, it would be silly to leave pre-existing \s bugs in wpautop() when we are already tackling them in wptexturize(). Check for new attachments in a couple hours.

@miqrogroove
10 years ago

Touch unautop for spaces around shortcodes.

#71 @miqrogroove
10 years ago

I'm triple checking everything before the IRC meeting. Just noticed a stray wildcard in the original regexp for closing single quotes: '/\'([\s.]|\Z)/' That lone period may represent any character, similar to the \s|\S that was used in closing double quotes. I will tweak that now if needed.

#72 @miqrogroove
10 years ago

Testing shows lone periods inside character classes are not treated as wildcards. So [\s.] is match both spaces and periods. This is still a significant difference between double quotes and single quotes, and for the sake of this patch I want that original behavior unchanged. Patch revision coming momentarily.

Version 0, edited 10 years ago by miqrogroove (next)

@miqrogroove
10 years ago

Correction: Preserve behavior of periods after single quotes.

@miqrogroove
10 years ago

Merge src and tests.

@miqrogroove
10 years ago

Src and tests merge, take 2.

This ticket was mentioned in IRC in #wordpress-dev by azaozz. View the logs.


10 years ago

@miqrogroove
10 years ago

First pass at unit tests to cover all of wptexturize.

#74 @miqrogroove
10 years ago

In IRC, we decided to add a very large number of unit tests for wptexturize. These tests will:

  1. Fail upon any change of behavior from v3.8.
  2. Fail for all known bugs.
  3. Eventually, not fail. We can set aside some tests later as needed.

A first set of tests is attached. More to follow as time permits.

@miqrogroove
10 years ago

Second pass at unit tests to cover all of wptexturize.

#75 @miqrogroove
10 years ago

Uploaded more tests. This covers everything in wptexturize v3.8 except for the $no_texturize_shortcodes feature. It looks like many of the known bugs already have tests in the trunk version, but I plan to check if any need to be added.

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.


10 years ago

@miqrogroove
10 years ago

Patch and tests rollup.

@miqrogroove
10 years ago

Divide unit tests into smaller units.

@miqrogroove
10 years ago

Fixed syntax in unit tests.

#77 @miqrogroove
10 years ago

The unit tests are working now. In IRC we decided the next step is to benchmark the two different regexp styles, lookbehinds vs. capturing backreferences. There are any number of reasons one might be faster than the other, so it will be good to find out.

#78 @azaozz
10 years ago

miqro-22692.12.patch looks good. The only tests that fail after applying the patch are in test_more_bugs() in WPTexturize.php.

#79 @miqrogroove
10 years ago

Initial testing shows a massive performance improvement after the patch, as-is!

Using, for example:

$start = microtime(TRUE);
for ($i = 0; $i<5; $i++) {
  $curl = preg_replace($dynamic_characters, $dynamic_replacements, $post);
}
$end = microtime(TRUE);

Here were the results based on the old and new regexp strings, tested against a 9,911 byte post.

Old logic, 1 iteration = 0.0128 sec
Old logic, 5 iteration = 0.0658 sec
Old logic, 10 iteration = 0.1308 sec
Old logic, 25 iteration = 0.3159 sec

New logic, 1 iteration = 0.0020 sec
New logic, 5 iteration = 0.0094 sec
New logic, 10 iteration = 0.0188 sec
New logic, 25 iteration = 0.0468 sec

I will take a closer look now to see if any tuning would be worthwhile.

@miqrogroove
10 years ago

Avoid running the 9x9 regexp, which is inherently slow.

#80 @miqrogroove
10 years ago

Patch 14 gives another 3x performance boost by skipping the 9x9 replacement when possible. The extra check is negligible but saves a lot of time:

Patch 14, 9x9 off, 1 iteration = 0.0006 sec
Patch 14, 9x9 off, 5 iteration = 0.0027 sec
Patch 14, 9x9 off, 10 iteration = 0.0053 sec
Patch 14, 9x9 off, 25 iteration = 0.0131 sec

Patch 14, 9x9 on, 1 iteration = 0.0020 sec
Patch 14, 9x9 on, 5 iteration = 0.0095 sec
Patch 14, 9x9 on, 10 iteration = 0.0188 sec
Patch 14, 9x9 on, 25 iteration = 0.0467 sec

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.


10 years ago

#82 @miqrogroove
10 years ago

Might want to expand test_ampersand() and test_more_bugs() with malformed entity references like &&amp; and &!amp;

wptexturize in 3.8 will predictably fail to convert the first character because the second character in the pattern is only required to not be a #.

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.


10 years ago

#84 @miqrogroove
10 years ago

Meeting minutes:

  • Is it worth committing this 3 weeks before release?
    • Patch would be reverted at the first sign of trouble.
    • Could alter the perception of other bugs.
    • Paves the way for other bugs to be fixed.
    • Improves performance.
    • A decision will be made by the Beta 3 deadline.
  • What changes are made other than quote replacement patterns?
    • wptexturize() will treat any 0xC2A0 or &nbsp; the same as a regular space.
    • Bugs outside the scope of this patch are preserved (not fixed).
    • Smilies and shortcodes are fixed in the current patch. Should this be separate?
    • Extra spaces removed from translate_smiley() because it would lead to nbsp being replaced by normal space and other strange problems.
    • Were people counting on multiple spaces around smilies img? - Unlikely
  • What is the overlap with #26842?
    • That ticket would change how the editor outputs NBSP characters.
    • Smilies and shortcodes must be patched before that change.
    • It remains a separate issue.
  • miqrogroove will be available to address any fallout from this patch.
Last edited 10 years ago by miqrogroove (previous) (diff)

#85 @miqrogroove
10 years ago

Executive summary of miqro-22692.14.patch:

  • new function wp_spaces_regexp()
    • Provides the chars and strings that can represent a space in regexp.
    • Adds a run-once filter for any customization needs.
  • wptexturize
    • Replaced \s sequence by wp_spaces_regexp() in regex.
    • Now quotes can be neighbors with nbsp.
    • Removed duplicate patterns and junk like \s|\S for performance.
    • Replaced back-reference patterns with look-behind patterns for performance.
    • Moved the expensive 9x9 pattern so it can be skipped when not needed.
  • shortcode_unautop
    • Replaced \s sequence by wp_spaces_regexp() in regex.
    • Now shortcodes can be neighors with nbsp.
  • translate_smiley
    • Removed extra spaces that were not be needed after fixing smilies_init().
  • smilies_init
    • Replaced \s sequence by wp_spaces_regexp() in regex.
    • Now smilies can be neighbors with nbsp.
    • Added a look-behind to prevent replacement of leading spaces.
    • Removed extra capturing groups for performance.

@miqrogroove
10 years ago

Patch refreshed without code for smilies, no tests for other bugs.

#86 @miqrogroove
10 years ago

Based on feedback last night, the unit tests for smilies need a lot of work.

miqro-22692.15.patch adjusted as follows:

Dropped all changes to smilies_init() and translate_smiley().

Dropped all new unit tests for bugs not fixed by this patch.

@miqrogroove
10 years ago

Patch refreshed without fix for shortcodes.

#87 @miqrogroove
10 years ago

miqro-22692.16.patch is limited in scope to wptexturize() only. I added the two assertions from comment 82 under test_ampersand().

#88 @nacin
10 years ago

  • Owner set to nacin
  • Resolution set to fixed
  • Status changed from new to closed

In 27839:

Texturize: Massive performance improvements (~600% faster); better handling of nbsp, double, and weird spaces; 136 new unit tests.

big props miqrogroove.
fixes #22692.

#89 @nacin
10 years ago

Bravo, miqrogroove.

Let's spin out the smiley and shortcode stuff into new tickets, and hit the ground running with them in 4.0.

If there are any issues in [27839] we're likely to do a full or partial revert (for example, I'd want to leave in the 9x9 speed-up), given how little time there is remaining in the 3.9 development cycle, and we'd try it again in 4.0. But this was too good to pass up.

As I noticed in IRC at one point, I did some profiling and benchmarking using the text of Alice in Wonderland (nearly 25,000 words). I noticed that with the patch, the page loaded about 25% faster.

#90 @nacin
10 years ago

In 27844:

Add braces to wptexturize() after [27839]. see #22692.

#91 @miqrogroove
10 years ago

For future releases:

#26842 - Editor
#27426 - i18n
#27587 - Smilies
#27588 - Shortcodes

Note: See TracTickets for help on using tickets.