WordPress.org

Make WordPress Core

Opened 7 years ago

Closed 5 years ago

Last modified 5 years ago

#22692 closed defect (bug) (fixed)

Quotes Are Messing Up

Reported by: miqrogroove Owned by: nacin
Milestone: 3.9 Priority: normal
Severity: normal Version: 1.2
Component: Formatting Keywords: has-patch commit 3.9-early wptexturize
Focuses: Cc:

Description (last modified by kpdesign)

Type this into the body of a new post:

A sentence.  "A quote after 2 spaces."

Preview that, and be amazed. The first quote is backwards. :(

Due to font differences, this is easiest to see on the Twenty Ten theme.

Attachments (24)

miqro-22692.patch (787 bytes) - added by miqrogroove 6 years ago.
miqro-22692-tests.diff (1.2 KB) - added by miqrogroove 6 years ago.
miqro-22692.2.patch (1.0 KB) - added by miqrogroove 6 years ago.
Patch improved based on ticket feedback.
miqro-22692.3.patch (3.2 KB) - added by miqrogroove 6 years ago.
A more robust solution conceived in IRC. Eliminates the \s sequence.
miqro-22692-tests.2.diff (1.7 KB) - added by miqrogroove 6 years ago.
miqro-22692.4.patch (3.3 KB) - added by miqrogroove 6 years ago.
Expand regexp to include future use of HTML output.
miqro-22692-tests.patch (2.0 KB) - added by miqrogroove 6 years ago.
miqro-22692.5.patch (4.8 KB) - added by miqrogroove 6 years ago.
Move regexp filter to a new function for reuse.
miqro-22692-tests.3.diff (3.7 KB) - added by miqrogroove 6 years ago.
Covers work on smilies for 26842
miqro-22692.6.patch (5.6 KB) - added by miqrogroove 6 years ago.
Touch unautop for spaces around shortcodes.
miqro-22692-tests.4.diff (4.5 KB) - added by miqrogroove 6 years ago.
miqro-22692.7.patch (5.7 KB) - added by miqrogroove 6 years ago.
Correction: Preserve behavior of periods after single quotes.
miqro-22692.8.patch (53.8 KB) - added by miqrogroove 6 years ago.
Merge src and tests.
miqro-22692.9.patch (10.3 KB) - added by miqrogroove 6 years ago.
Src and tests merge, take 2.
WPTexturize.php.diff (10.4 KB) - added by miqrogroove 6 years ago.
First pass at unit tests to cover all of wptexturize.
WPTexturize.php.2.diff (14.5 KB) - added by miqrogroove 6 years ago.
Second pass at unit tests to cover all of wptexturize.
WPTexturize.php.3.diff (18.9 KB) - added by miqrogroove 6 years ago.
miqro-22692.10.patch (27.2 KB) - added by miqrogroove 6 years ago.
Patch and tests rollup.
miqro-22692.11.patch (32.1 KB) - added by miqrogroove 6 years ago.
Divide unit tests into smaller units.
miqro-22692.12.patch (32.1 KB) - added by miqrogroove 6 years ago.
Fixed syntax in unit tests.
miqro-22692.13.patch (33.2 KB) - added by miqrogroove 6 years ago.
Avoid running the 9x9 regexp, which is inherently slow.
miqro-22692.14.patch (33.0 KB) - added by miqrogroove 6 years ago.
miqro-22692.15.patch (24.2 KB) - added by miqrogroove 5 years ago.
Patch refreshed without code for smilies, no tests for other bugs.
miqro-22692.16.patch (22.6 KB) - added by miqrogroove 5 years ago.
Patch refreshed without fix for shortcodes.

Download all attachments as: .zip

Change History (115)

#1 @jond3r
7 years ago

  • Cc jond3r@… added

Isn't this caused by the wptexturize filter?
Try to put

remove_filter('the_content', 'wptexturize');

in the theme's functions.php file.

#2 @dd32
7 years ago

Isn't this caused by the wptexturize filter?

Yes, It's a bug in the texturize filter, this ticket is to fix it.

#3 @pavelevap
7 years ago

Related: #16957 ?

#4 @SergeyBiryukov
7 years ago

  • Component changed from General to Formatting

#5 follow-up: @nacin
7 years ago

Cannot reproduce in the editor.

wptexturize( 'A sentence. "A quote after 2 spaces."' ) returns proper quotes.

#6 in reply to: ↑ 5 @miqrogroove
7 years ago

Replying to nacin:

Cannot reproduce in the editor.

Sometimes pasting doesn't preserve both spaces. Try removing the space and hitting spacebar twice from Visual editor. That'll do it.

#7 follow-up: @miqrogroove
7 years ago

Here's another bug:

"<code>blah</code>"

The trailing quote is backwards when output. 3.4.2 and 3.5 both affected.

#8 in reply to: ↑ 7 @SergeyBiryukov
6 years ago

Replying to miqrogroove:

"<code>blah</code>"

The trailing quote is backwards when output.

Related: #18549

#9 @miqrogroove
6 years ago

Any chance of getting a milestone for the 2-spaces bug? I'm in the habit of typing with 2 spaces so this bothers me a bit.

#10 follow-up: @SergeyBiryukov
6 years ago

Could not reproduce either:

  1. Pasted the example into editor (tried both Text and Visual).
  2. Made sure there are two spaces before the quote.
  3. The post contains proper quotes:
    A sentence.  &#8220;A quote after 2 spaces.&#8221;
    

#11 @miqrogroove
6 years ago

Tested 3.5.1:

<p>A sentence.  &#8221;A quote after 2 spaces.&#8221;</p>

#12 @helen
6 years ago

Please always test trunk.

#13 @SergeyBiryukov
6 years ago

  • Keywords reporter-feedback added

I can reproduce in 3.5.1 with WP Editor plugin activated, but not on a clean install. Could you try on a clean install?

#14 in reply to: ↑ 10 @miqrogroove
6 years ago

Replying to SergeyBiryukov:

Could not reproduce either:

  1. Pasted the example into editor (tried both Text and Visual).

OK couple things to clarify here. I can't reproduce this bug by pasting or by using the Text tab. Go to the Visual tab, type in the spaces, and then see what happens in preview.

#15 follow-up: @miqrogroove
6 years ago

  • Keywords needs-patch added; reporter-feedback removed

I've also confirmed this is reproducible in 3.6-beta1. Any chance this can get on a milestone now?

#16 in reply to: ↑ 15 @nacin
6 years ago

  • Keywords needs-unit-tests added

Replying to miqrogroove:

I've also confirmed this is reproducible in 3.6-beta1. Any chance this can get on a milestone now?

Would need a patch and unit tests for that.

#17 @miqrogroove
6 years ago

OK, I'm in an optimistic mood today :) The problem is likely in a filter somewhere and I'll track it down.

#18 @miqrogroove
6 years ago

*sigh*

Not a filter issue. I'm getting different results with identical inputs on two servers now. Starting to suspect an mb string issue. I'll boil this down to a test case for you to look at.

#19 @miqrogroove
6 years ago

$text = "A sentence. \xC2\xA0\"A quote after 2 spaces.\"";

$opening_quote = '&#8220;';
$closing_quote = '&#8221;';

$dynamic = array();
$dynamic[ '/(\s|\A|[([{<])"(?!\s)/'    ] = '$1' . $opening_quote . '$2'; 
$dynamic[ '/"(\s|\S|\Z)/'              ] = $closing_quote . '$1'; 

$dynamic_characters = array_keys( $dynamic );
$dynamic_replacements = array_values( $dynamic );

echo preg_replace($dynamic_characters, $dynamic_replacements, $text);

The above test case produces bad output on my BlueHost account, but works perfectly on my home server. Configurations are quite different; unsure the culprit. Suspect mb string.

#20 @miqrogroove
6 years ago

Oops. Fixed it. The PCRE_UTF8 modifier forces identical output on both servers.

$dynamic[ '/(\s|\A|[([{<])"(?!\s)/u'   ] = '$1' . $opening_quote . '$2'; 
$dynamic[ '/"(\s|\S|\Z)/u'             ] = $closing_quote . '$1'; 

I would think that's safe to do as long as everyone is using UTF-8.

#21 @miqrogroove
6 years ago

  • Keywords needs-patch removed

Patch added.

#22 @miqrogroove
6 years ago

Btw, I don't do unit testing. What can we do to get this on a milestone now?

#23 @SergeyBiryukov
6 years ago

  • Keywords has-patch added
  • Milestone changed from Awaiting Review to 3.6

#24 @miqrogroove
6 years ago

Thank you Sergey

#25 @miqrogroove
6 years ago

Anyone have a few minutes to test my patch?

#26 @markjaquith
6 years ago

  • Milestone changed from 3.6 to Future Release

Punting this as we're focusing on 3.6 blockers now.

#27 follow-up: @miqrogroove
6 years ago

3.7 early? Please?

#28 in reply to: ↑ 27 @nacin
6 years ago

Replying to miqrogroove:

3.7 early? Please?

Needs unit tests. http://make.wordpress.org/core/handbook/automated-testing/

#29 @miqrogroove
6 years ago

I've got the test case in comment 19 above. The output of preg_replace in that case or of wptexturize itself needs to be compared to the wanted and unwanted outputs (which I could provide).

I don't know how to make unit tests. Could you convert it for me?

Last edited 6 years ago by miqrogroove (previous) (diff)

#30 @miqrogroove
6 years ago

3.8 early? Please?
Begging!

#31 @markoheijnen
6 years ago

The point nacin mentioned is still not resolved. If you want to have it on the 3.8 milestone then you should look into create unit tests. It's not that hard. So creating small functions that make calls to wptexturize() and compare the output with what you expect.

Look at http://core.trac.wordpress.org/browser/trunk/tests/phpunit/tests/formatting/WPTexturize.php.

#32 @miqrogroove
6 years ago

miqro-22692-tests.diff attached - My first attempt ever making a unit test. Thanks tierra and azaozz in IRC for pointing me in a direction. This file needs testing.

#33 @miqrogroove
6 years ago

Possible concern - Do the unit tests have to be translatable? The opening and closing quotes in wptexturize() also run through _x() which means the output could be considerably different in other languages.

#34 follow-up: @miqrogroove
6 years ago

also,

/"(\s|\S|\Z)/

Does that regexp make sense to anyone? Isn't it equivalent to /"/ ??

#35 @azaozz
6 years ago

The tests look good, but... There is something going on here. Without the u modifier, in some cases \s matches \xA0 which happens to be the second byte from U+00A0 and all works as expected. The tests pass.

Adding this to any testcase:

/**
 * @ticket 999999
 */
function test_regex_xa0() {
	$this->assertTrue( 1 === preg_match( '/^\s$/', "\xA0" ) );
	$this->assertRegExp( '/^\s$/', "\xA0" );
}

Then run it with phpunit --group 999999. Both pass on my test install. However this:

add_action( 'shutdown', 'my_test_s' );
function my_test_s() {
	var_dump( preg_match( '/^\s$/', "\xA0" ) );
}

outputs int(0)... So which one is right?

#36 @miqrogroove
6 years ago

On my home server, var_dump( preg_match( '/^\s$/', "\xA0" ) ); throws int(1). But on BlueHost I'm having bad output. I can't explain the difference, but the configurations are quite diverse.

#37 @miqrogroove
6 years ago

BlueHost has mbstring installed and throws int(0), whereas my home server does not have mbstring and throws int(1). Does that help in any way?

#38 @miqrogroove
6 years ago

Another good guess would be that my home server is properly referencing the ISO-8859-1 code page (\xA0 is a space), whereas BlueHost could be using ASCII (\xA0 is invalid). Neither has a default charset configured in PHP, and I wonder if it falls back to an OS setting.

#39 @azaozz
6 years ago

...my home server is properly referencing the ISO-8859-1 code page (\xA0 is a space), whereas BlueHost could be using ASCII (\xA0 is invalid).

Yes, that's the most likely reason. Still couldn't confirm/reproduce it though. There are more inconsistencies. looking in the PCRE documentation:

The \s characters are HT (9), LF (10), FF (12), CR (13), and space (32).
If "use locale;" is included in a Perl script, \s may match the VT charac-
ter. In PCRE, it never does.

And later on:

By default, in a UTF mode, characters  with  values  greater  than  128
never  match  \d,  \s,  or  \w,  and always match \D, \S, and \W. These
sequences retain their original meanings from before  UTF  support  was
available,  mainly for efficiency reasons. However, if PCRE is compiled
with Unicode property support, and the PCRE_UCP option is set, the  be-
haviour  is  changed  so  that Unicode properties are used to determine
character types, as follows:

 \d  any character that \p{Nd} matches (decimal digit)
 \s  any character that \p{Z} matches, plus HT, LF, FF, CR
 \w  any character that \p{L} or \p{N} matches, plus underscore

So, \s will not match U+00A0 if PCRE was compiled without Unicode property support. In addition the PCRE_UCP option is set as of PHP 5.3.4, so the u modifier doesn't change \s in earlier versions: https://bugs.php.net/bug.php?id=52971.

In that terms, to fix this particular case we should use [\s\xA0] to match white space. However to make it fully compatible, this should include all white space chars from \p{Z} plus \r, \n, \t and \f.

In addition:

  • In PHP 5.3.4+, when using preg_* functions in UTF mode and PCRE has been compiled with Unicode property support, the behavior of \d, \D, \s, \S, \w, and \W changes. They match the Unicode property of each char, including a lot more characters and making the regex quite slower.
  • Currently this cannot be used for two reasons:
    • The lowest required PHP version is 5.2.4.
    • Some hosts don't have Unicode property support in PCRE.
Last edited 6 years ago by azaozz (previous) (diff)

#40 @azaozz
6 years ago

Actually the proper regex is \s|\xC2\xA0 as we are not in UTF mode and \xA0 can match part of an UTF-8 char.

#41 @miqrogroove
6 years ago

So then matching both bytes works better with UTF-8, but matching only the Latin byte might work better with other types of encoding. I agree both bytes is the way to go if UTF-8 is the expected input. It also wont break other multi-byte types the way that /u would.

#42 in reply to: ↑ 34 @miqrogroove
6 years ago

Replying to miqrogroove:

also,

/"(\s|\S|\Z)/

Does that regexp make sense to anyone? Isn't it equivalent to /"/ ??

Clarification.. this is replacement mode, so what's the difference between these, if any:

$dynamic[ '/"(\s|\S|\Z)/' ] = $closing_quote . '$1';

or

$dynamic[ '/"(.?)/' ] = $closing_quote . '$1';

or

$dynamic[ '/"/' ] = $closing_quote;

Wouldn't those all do the same thing?

Last edited 6 years ago by miqrogroove (previous) (diff)

#43 @miqrogroove
6 years ago

I'm seeing other problems that need to be resolved with this patch.

(?! is a non-capturing lookahead assertion, which means the $2 replacement token does nothing.

@miqrogroove
6 years ago

Patch improved based on ticket feedback.

#44 @miqrogroove
6 years ago

azaozz, miqro-22692.2.patch​ is based on your feedback in comment 40, as well as fixes for comments 42 and 43.

#45 @miqrogroove
6 years ago

  • Keywords needs-unit-tests removed

#46 @miqrogroove
6 years ago

Another note on miqro-22692.2.patch :

Changing the lookahead from negative (?!\s) to positive (?=\S) which I think is more readable, also has the effect of changing the expected output for a lone double quote, as in wptexturize('"').

I made that change on the assumption it would be trivial.

#47 @SergeyBiryukov
6 years ago

  • Milestone changed from Future Release to 3.8

#48 @SergeyBiryukov
6 years ago

  • Version changed from 3.4.2 to 1.2

The original regex appears to be introduced in [985].

#49 @miqrogroove
6 years ago

The "original" bug probably looked more like #16957 because the post editor was inserting the HTML character entity reference &nbsp; instead of its UTF-8 encoding.

@miqrogroove
6 years ago

A more robust solution conceived in IRC. Eliminates the \s sequence.

#50 @miqrogroove
6 years ago

Added a third patch option. If we can't figure out exactly why \s is causing semi-random behavior, then this patch should be used to eliminate \s from wptexturize().

#51 @miqrogroove
6 years ago

I got in touch with one of the PCRE devs who says, "Clearly, the different PHP implementations you are using must be calling pcre_compile() with different options." http://bugs.exim.org/show_bug.cgi?id=1412

#52 @azaozz
6 years ago

Very nice, super fast turnaround from PCRE :)

As far as I see the problem comes from here:

/\s/W
\x{a0}
 0: \xa0

(non-UTF mode, but with Unicode properties).

If PCRE was compiled with Unicode properties support *and* ? PCRE_UCP is passed at compile time, \s will match \xa0 even when not in UTF mode. However in PHP PCRE_UCP is only passed when the u modifier is set: http://svn.php.net/viewvc/php/php-src/trunk/ext/pcre/php_pcre.c?r1=303963&r2=303962&pathrev=303963 since PHP 5.3.4. And that hasn't changed in newer versions: http://git.php.net/?p=php-src.git;a=blob;f=ext/pcre/php_pcre.c;h=7d34d9feb15a81b5e80973cf1aaa1c4936543173;hb=refs/heads/master#l366.

... Still can't pinpoint the inconsistencies with \s.

#53 @miqrogroove
6 years ago

I'm not familiar with that PCRE Unicode mode, but it sounds similar to UTF-16. In Windows, \xA0 is sometimes encoded \x00\xA0 and that's when data typing becomes so important.

I put in a ticket at php.net today. They tend to be slow.

#54 @miqrogroove
6 years ago

azaozz, there's a request from PHP to post as much phpinfo() as possible from systems returning int(1) for the test code. Could you drop a phpinfo() call into your phpunit and post the results? Maybe edit out usernames and addresses for safety. See https://bugs.php.net/bug.php?id=66068

#55 @miqrogroove
6 years ago

And in case it's relevant, let's see if your command line environment shows a different result for

var_dump( setlocale( LC_ALL, 0 ) );

#56 @miqrogroove
6 years ago

On BlueHost, where we get the "expected" behavior of not matching \xA0, the setlocale() result is

string(1) "C"

On my home server, I get

string(85) "LC_COLLATE=C;LC_CTYPE=English_United States.1252;LC_MONETARY=C;LC_NUMERIC=C;LC_TIME=C"

#57 @miqrogroove
6 years ago

setlocale( LC_CTYPE, 'C' );
var_dump( preg_match( '/^\s$/', "\xA0" ) );

int(0)

#58 @azaozz
6 years ago

Yeah, now we are getting somewhere...

PHP 5.4.16 on Windows 7:

var_dump( setlocale( LC_ALL, 0 ) ); // string(1) "C"
var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0)

var_dump( setlocale( LC_ALL, '' ) ); // On Windows this sets it to the system default
// string(19) "English_Canada.1252"

var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(1)
var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // bool(false) (as \xA0 is not full UTF char)

setlocale( LC_ALL, 'C' );
var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0)
var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // bool(false)

PHP 5.3.1 on Mac OSX (note: 5.3.1 doesn't set PCRE_UCP with the u modifier)

var_dump( setlocale( LC_ALL, 0 ) ); // string(1) "C"
var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0)

setlocale( LC_ALL, 'en_CA' ); // Also with 'en_CA.UTF-8', 'en_CA.ISO8859-1', 'en_CA.ISO8859-15', etc.
var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(1)
var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // int(0)

setlocale( LC_ALL, 'C' ); // Also with 'en_CA.US-ASCII'
var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0)
var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // int(0)

So when the locale is anything other than C or US-ASCII equivalent, \s matches \xA0. Also on multithreaded servers like Apache on Windows, setlocale() is "sticky", more info.

From the PCRE manual:

PCRE  handles  caseless matching, and determines whether characters are
letters, digits, or whatever, by reference to a set of tables,  indexed
by  character  value.  When running in UTF-8 mode, this applies only to
characters with codes less than 128.

It is not mentioned there but seems \s is also affected by these tables.

The internal tables can always be overridden by tables supplied by  the
application that calls PCRE... External  tables  are  built by calling
pcre_maketables()

PHP uses pcre_maketables() only if the locale is 'C': http://git.php.net/?p=php-src.git;a=blob;f=ext/pcre/php_pcre.c;h=7d34d9feb15a81b5e80973cf1aaa1c4936543173;hb=refs/heads/master#l392

So it seems that the unexpected behavior of \s is caused by PCRE when a locale other than C is set.

#59 @miqrogroove
6 years ago

I'm just amazed that none of this is documented anywhere as far as I can find.

#60 @azaozz
6 years ago

Yes, thinking we can ask that the PCRE (and PHP) documentation is updated to reflect this.

#61 @miqrogroove
6 years ago

Based on what we know now, I think the most recent patch and tests are appropriate. I wouldn't want to involve locale information in the fix for this bug.

#62 @nacin
6 years ago

  • Keywords commit 3.9-early added
  • Milestone changed from 3.8 to Future Release

See also, #22363, apparently.

Let's try this out early 3.9.

#63 @miqrogroove
6 years ago

nacin, the current patch here does not involve PCRE unicode modes. We decided to go strictly ASCII. Other than that, the filenames ticket doesn't seem related.

#64 @miqrogroove
6 years ago

Could someone edit the ticket description to say "Type" instead of "Paste" ? A while back we figured out there was a big difference.

#65 @kpdesign
6 years ago

  • Description modified (diff)

@miqrogroove: I've made the change in the ticket description.

#66 @miqrogroove
6 years ago

miqro-22692.3.patch is ready to commit.

miqro-22692-tests.2.diff let me know if I need to refresh that one.

Last edited 6 years ago by miqrogroove (previous) (diff)

#67 @SergeyBiryukov
6 years ago

  • Milestone changed from Future Release to 3.9

#68 @nacin
6 years ago

  • Keywords wptexturize added

@miqrogroove
6 years ago

Expand regexp to include future use of HTML output.

@miqrogroove
6 years ago

Move regexp filter to a new function for reuse.

@miqrogroove
6 years ago

Covers work on smilies for 26842

#69 @miqrogroove
6 years ago

Patches and unit tests expanded for compatibility with changes proposed in #26842

#70 @miqrogroove
6 years ago

I decided this needs to be expanded into wpautop(). I reviewed the impact earlier and determined my patches would not break anything there. However, it would be silly to leave pre-existing \s bugs in wpautop() when we are already tackling them in wptexturize(). Check for new attachments in a couple hours.

@miqrogroove
6 years ago

Touch unautop for spaces around shortcodes.

#71 @miqrogroove
6 years ago

I'm triple checking everything before the IRC meeting. Just noticed a stray wildcard in the original regexp for closing single quotes: '/\'([\s.]|\Z)/' That lone period may represent any character, similar to the \s|\S that was used in closing double quotes. I will tweak that now if needed.

#72 @miqrogroove
6 years ago

Testing shows lone periods inside character classes are not treated as wildcards. So [\s.] is matching both spaces and periods. This is still a significant difference between double quotes and single quotes, and for the sake of this patch I want that original behavior unchanged. Patch revision coming momentarily.

Last edited 6 years ago by miqrogroove (previous) (diff)

@miqrogroove
6 years ago

Correction: Preserve behavior of periods after single quotes.

@miqrogroove
6 years ago

Merge src and tests.

@miqrogroove
6 years ago

Src and tests merge, take 2.

This ticket was mentioned in IRC in #wordpress-dev by azaozz. View the logs.


6 years ago

@miqrogroove
6 years ago

First pass at unit tests to cover all of wptexturize.

#74 @miqrogroove
6 years ago

In IRC, we decided to add a very large number of unit tests for wptexturize. These tests will:

  1. Fail upon any change of behavior from v3.8.
  2. Fail for all known bugs.
  3. Eventually, not fail. We can set aside some tests later as needed.

A first set of tests is attached. More to follow as time permits.

@miqrogroove
6 years ago

Second pass at unit tests to cover all of wptexturize.

#75 @miqrogroove
6 years ago

Uploaded more tests. This covers everything in wptexturize v3.8 except for the $no_texturize_shortcodes feature. It looks like many of the known bugs already have tests in the trunk version, but I plan to check if any need to be added.

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.


6 years ago

@miqrogroove
6 years ago

Patch and tests rollup.

@miqrogroove
6 years ago

Divide unit tests into smaller units.

@miqrogroove
6 years ago

Fixed syntax in unit tests.

#77 @miqrogroove
6 years ago

The unit tests are working now. In IRC we decided the next step is to benchmark the two different regexp styles, lookbehinds vs. capturing backreferences. There are any number of reasons one might be faster than the other, so it will be good to find out.

#78 @azaozz
6 years ago

miqro-22692.12.patch looks good. The only tests that fail after applying the patch are in test_more_bugs() in WPTexturize.php.

#79 @miqrogroove
6 years ago

Initial testing shows a massive performance improvement after the patch, as-is!

Using, for example:

$start = microtime(TRUE);
for ($i = 0; $i<5; $i++) {
  $curl = preg_replace($dynamic_characters, $dynamic_replacements, $post);
}
$end = microtime(TRUE);

Here were the results based on the old and new regexp strings, tested against a 9,911 byte post.

Old logic, 1 iteration = 0.0128 sec
Old logic, 5 iteration = 0.0658 sec
Old logic, 10 iteration = 0.1308 sec
Old logic, 25 iteration = 0.3159 sec

New logic, 1 iteration = 0.0020 sec
New logic, 5 iteration = 0.0094 sec
New logic, 10 iteration = 0.0188 sec
New logic, 25 iteration = 0.0468 sec

I will take a closer look now to see if any tuning would be worthwhile.

@miqrogroove
6 years ago

Avoid running the 9x9 regexp, which is inherently slow.

#80 @miqrogroove
6 years ago

Patch 14 gives another 3x performance boost by skipping the 9x9 replacement when possible. The extra check is negligible but saves a lot of time:

Patch 14, 9x9 off, 1 iteration = 0.0006 sec
Patch 14, 9x9 off, 5 iteration = 0.0027 sec
Patch 14, 9x9 off, 10 iteration = 0.0053 sec
Patch 14, 9x9 off, 25 iteration = 0.0131 sec

Patch 14, 9x9 on, 1 iteration = 0.0020 sec
Patch 14, 9x9 on, 5 iteration = 0.0095 sec
Patch 14, 9x9 on, 10 iteration = 0.0188 sec
Patch 14, 9x9 on, 25 iteration = 0.0467 sec

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.


6 years ago

#82 @miqrogroove
5 years ago

Might want to expand test_ampersand() and test_more_bugs() with malformed entity references like &&amp; and &!amp;

wptexturize in 3.8 will predictably fail to convert the first character because the second character in the pattern is only required to not be a #.

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.


5 years ago

#84 @miqrogroove
5 years ago

Meeting minutes:

  • Is it worth committing this 3 weeks before release?
    • Patch would be reverted at the first sign of trouble.
    • Could alter the perception of other bugs.
    • Paves the way for other bugs to be fixed.
    • Improves performance.
    • A decision will be made by the Beta 3 deadline.
  • What changes are made other than quote replacement patterns?
    • wptexturize() will treat any 0xC2A0 or &nbsp; the same as a regular space.
    • Bugs outside the scope of this patch are preserved (not fixed).
    • Smilies and shortcodes are fixed in the current patch. Should this be separate?
    • Extra spaces removed from translate_smiley() because it would lead to nbsp being replaced by normal space and other strange problems.
    • Were people counting on multiple spaces around smilies img? - Unlikely
  • What is the overlap with #26842?
    • That ticket would change how the editor outputs NBSP characters.
    • Smilies and shortcodes must be patched before that change.
    • It remains a separate issue.
  • miqrogroove will be available to address any fallout from this patch.
Last edited 5 years ago by miqrogroove (previous) (diff)

#85 @miqrogroove
5 years ago

Executive summary of miqro-22692.14.patch:

  • new function wp_spaces_regexp()
    • Provides the chars and strings that can represent a space in regexp.
    • Adds a run-once filter for any customization needs.
  • wptexturize
    • Replaced \s sequence by wp_spaces_regexp() in regex.
    • Now quotes can be neighbors with nbsp.
    • Removed duplicate patterns and junk like \s|\S for performance.
    • Replaced back-reference patterns with look-behind patterns for performance.
    • Moved the expensive 9x9 pattern so it can be skipped when not needed.
  • shortcode_unautop
    • Replaced \s sequence by wp_spaces_regexp() in regex.
    • Now shortcodes can be neighors with nbsp.
  • translate_smiley
    • Removed extra spaces that were not be needed after fixing smilies_init().
  • smilies_init
    • Replaced \s sequence by wp_spaces_regexp() in regex.
    • Now smilies can be neighbors with nbsp.
    • Added a look-behind to prevent replacement of leading spaces.
    • Removed extra capturing groups for performance.

@miqrogroove
5 years ago

Patch refreshed without code for smilies, no tests for other bugs.

#86 @miqrogroove
5 years ago

Based on feedback last night, the unit tests for smilies need a lot of work.

miqro-22692.15.patch adjusted as follows:

Dropped all changes to smilies_init() and translate_smiley().

Dropped all new unit tests for bugs not fixed by this patch.

@miqrogroove
5 years ago

Patch refreshed without fix for shortcodes.

#87 @miqrogroove
5 years ago

miqro-22692.16.patch is limited in scope to wptexturize() only. I added the two assertions from comment 82 under test_ampersand().

#88 @nacin
5 years ago

  • Owner set to nacin
  • Resolution set to fixed
  • Status changed from new to closed

In 27839:

Texturize: Massive performance improvements (~600% faster); better handling of nbsp, double, and weird spaces; 136 new unit tests.

big props miqrogroove.
fixes #22692.

#89 @nacin
5 years ago

Bravo, miqrogroove.

Let's spin out the smiley and shortcode stuff into new tickets, and hit the ground running with them in 4.0.

If there are any issues in [27839] we're likely to do a full or partial revert (for example, I'd want to leave in the 9x9 speed-up), given how little time there is remaining in the 3.9 development cycle, and we'd try it again in 4.0. But this was too good to pass up.

As I noticed in IRC at one point, I did some profiling and benchmarking using the text of Alice in Wonderland (nearly 25,000 words). I noticed that with the patch, the page loaded about 25% faster.

#90 @nacin
5 years ago

In 27844:

Add braces to wptexturize() after [27839]. see #22692.

#91 @miqrogroove
5 years ago

For future releases:

#26842 - Editor
#27426 - i18n
#27587 - Smilies
#27588 - Shortcodes

Note: See TracTickets for help on using tickets.