#22692 closed defect (bug) (fixed)
Quotes Are Messing Up
Reported by: | miqrogroove | Owned by: | nacin |
---|---|---|---|
Milestone: | 3.9 | Priority: | normal |
Severity: | normal | Version: | 1.2 |
Component: | Formatting | Keywords: | has-patch commit 3.9-early wptexturize |
Focuses: | Cc: |
Description (last modified by )
Type this into the body of a new post:
A sentence. "A quote after 2 spaces."
Preview that, and be amazed. The first quote is backwards. :(
Due to font differences, this is easiest to see on the Twenty Ten theme.
Attachments (24)
Change History (115)
#2
@
12 years ago
Isn't this caused by the wptexturize filter?
Yes, It's a bug in the texturize filter, this ticket is to fix it.
#5
follow-up:
↓ 6
@
12 years ago
Cannot reproduce in the editor.
wptexturize( 'A sentence. "A quote after 2 spaces."' )
returns proper quotes.
#6
in reply to:
↑ 5
@
12 years ago
Replying to nacin:
Cannot reproduce in the editor.
Sometimes pasting doesn't preserve both spaces. Try removing the space and hitting spacebar twice from Visual editor. That'll do it.
#7
follow-up:
↓ 8
@
12 years ago
Here's another bug:
"<code>blah</code>"
The trailing quote is backwards when output. 3.4.2 and 3.5 both affected.
#8
in reply to:
↑ 7
@
12 years ago
Replying to miqrogroove:
"<code>blah</code>"The trailing quote is backwards when output.
Related: #18549
#9
@
12 years ago
Any chance of getting a milestone for the 2-spaces bug? I'm in the habit of typing with 2 spaces so this bothers me a bit.
#10
follow-up:
↓ 14
@
12 years ago
Could not reproduce either:
- Pasted the example into editor (tried both Text and Visual).
- Made sure there are two spaces before the quote.
- The post contains proper quotes:
A sentence. “A quote after 2 spaces.”
#13
@
12 years ago
- Keywords reporter-feedback added
I can reproduce in 3.5.1 with WP Editor plugin activated, but not on a clean install. Could you try on a clean install?
#14
in reply to:
↑ 10
@
12 years ago
Replying to SergeyBiryukov:
Could not reproduce either:
- Pasted the example into editor (tried both Text and Visual).
OK couple things to clarify here. I can't reproduce this bug by pasting or by using the Text tab. Go to the Visual tab, type in the spaces, and then see what happens in preview.
#15
follow-up:
↓ 16
@
12 years ago
- Keywords needs-patch added; reporter-feedback removed
I've also confirmed this is reproducible in 3.6-beta1. Any chance this can get on a milestone now?
#16
in reply to:
↑ 15
@
12 years ago
- Keywords needs-unit-tests added
Replying to miqrogroove:
I've also confirmed this is reproducible in 3.6-beta1. Any chance this can get on a milestone now?
Would need a patch and unit tests for that.
#17
@
12 years ago
OK, I'm in an optimistic mood today :) The problem is likely in a filter somewhere and I'll track it down.
#18
@
12 years ago
*sigh*
Not a filter issue. I'm getting different results with identical inputs on two servers now. Starting to suspect an mb string issue. I'll boil this down to a test case for you to look at.
#19
@
12 years ago
$text = "A sentence. \xC2\xA0\"A quote after 2 spaces.\""; $opening_quote = '“'; $closing_quote = '”'; $dynamic = array(); $dynamic[ '/(\s|\A|[([{<])"(?!\s)/' ] = '$1' . $opening_quote . '$2'; $dynamic[ '/"(\s|\S|\Z)/' ] = $closing_quote . '$1'; $dynamic_characters = array_keys( $dynamic ); $dynamic_replacements = array_values( $dynamic ); echo preg_replace($dynamic_characters, $dynamic_replacements, $text);
The above test case produces bad output on my BlueHost account, but works perfectly on my home server. Configurations are quite different; unsure the culprit. Suspect mb string.
#20
@
12 years ago
Oops. Fixed it. The PCRE_UTF8 modifier forces identical output on both servers.
$dynamic[ '/(\s|\A|[([{<])"(?!\s)/u' ] = '$1' . $opening_quote . '$2'; $dynamic[ '/"(\s|\S|\Z)/u' ] = $closing_quote . '$1';
I would think that's safe to do as long as everyone is using UTF-8.
#26
@
11 years ago
- Milestone changed from 3.6 to Future Release
Punting this as we're focusing on 3.6 blockers now.
#28
in reply to:
↑ 27
@
11 years ago
Replying to miqrogroove:
3.7 early? Please?
Needs unit tests. http://make.wordpress.org/core/handbook/automated-testing/
#29
@
11 years ago
I've got the test case in comment 19 above. The output of preg_replace in that case or of wptexturize itself needs to be compared to the wanted and unwanted outputs (which I could provide).
I don't know how to make unit tests. Could you convert it for me?
#31
@
11 years ago
The point nacin mentioned is still not resolved. If you want to have it on the 3.8 milestone then you should look into create unit tests. It's not that hard. So creating small functions that make calls to wptexturize() and compare the output with what you expect.
Look at http://core.trac.wordpress.org/browser/trunk/tests/phpunit/tests/formatting/WPTexturize.php.
#32
@
11 years ago
miqro-22692-tests.diff attached - My first attempt ever making a unit test. Thanks tierra and azaozz in IRC for pointing me in a direction. This file needs testing.
#33
@
11 years ago
Possible concern - Do the unit tests have to be translatable? The opening and closing quotes in wptexturize() also run through _x() which means the output could be considerably different in other languages.
#34
follow-up:
↓ 42
@
11 years ago
also,
/"(\s|\S|\Z)/
Does that regexp make sense to anyone? Isn't it equivalent to /"/
??
#35
@
11 years ago
The tests look good, but... There is something going on here. Without the u
modifier, in some cases \s
matches \xA0
which happens to be the second byte from U+00A0 and all works as expected. The tests pass.
Adding this to any testcase:
/** * @ticket 999999 */ function test_regex_xa0() { $this->assertTrue( 1 === preg_match( '/^\s$/', "\xA0" ) ); $this->assertRegExp( '/^\s$/', "\xA0" ); }
Then run it with phpunit --group 999999
. Both pass on my test install. However this:
add_action( 'shutdown', 'my_test_s' ); function my_test_s() { var_dump( preg_match( '/^\s$/', "\xA0" ) ); }
outputs int(0)
... So which one is right?
#36
@
11 years ago
On my home server, var_dump( preg_match( '/^\s$/', "\xA0" ) );
throws int(1). But on BlueHost I'm having bad output. I can't explain the difference, but the configurations are quite diverse.
#37
@
11 years ago
BlueHost has mbstring installed and throws int(0), whereas my home server does not have mbstring and throws int(1). Does that help in any way?
#38
@
11 years ago
Another good guess would be that my home server is properly referencing the ISO-8859-1 code page (\xA0 is a space), whereas BlueHost could be using ASCII (\xA0 is invalid). Neither has a default charset configured in PHP, and I wonder if it falls back to an OS setting.
#39
@
11 years ago
...my home server is properly referencing the ISO-8859-1 code page (\xA0 is a space), whereas BlueHost could be using ASCII (\xA0 is invalid).
Yes, that's the most likely reason. Still couldn't confirm/reproduce it though. There are more inconsistencies. looking in the PCRE documentation:
The \s characters are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is included in a Perl script, \s may match the VT charac- ter. In PCRE, it never does.
And later on:
By default, in a UTF mode, characters with values greater than 128 never match \d, \s, or \w, and always match \D, \S, and \W. These sequences retain their original meanings from before UTF support was available, mainly for efficiency reasons. However, if PCRE is compiled with Unicode property support, and the PCRE_UCP option is set, the be- haviour is changed so that Unicode properties are used to determine character types, as follows: \d any character that \p{Nd} matches (decimal digit) \s any character that \p{Z} matches, plus HT, LF, FF, CR \w any character that \p{L} or \p{N} matches, plus underscore
So, \s
will not match U+00A0 if PCRE was compiled without Unicode property support. In addition the PCRE_UCP option is set as of PHP 5.3.4, so the u
modifier doesn't change \s
in earlier versions: https://bugs.php.net/bug.php?id=52971.
In that terms, to fix this particular case we should use [\s\xA0]
to match white space. However to make it fully compatible, this should include all white space chars from \p{Z}
plus \r, \n, \t and \f.
In addition:
- In PHP 5.3.4+, when using preg_* functions in UTF mode and PCRE has been compiled with Unicode property support, the behavior of
\d
,\D
,\s
,\S
,\w
, and\W
changes. They match the Unicode property of each char, including a lot more characters and making the regex quite slower. - Currently this cannot be used for two reasons:
- The lowest required PHP version is 5.2.4.
- Some hosts don't have Unicode property support in PCRE.
#40
@
11 years ago
Actually the proper regex is \s|\xC2\xA0
as we are not in UTF mode and \xA0
can match part of an UTF-8 char.
#41
@
11 years ago
So then matching both bytes works better with UTF-8, but matching only the Latin byte might work better with other types of encoding. I agree both bytes is the way to go if UTF-8 is the expected input. It also wont break other multi-byte types the way that /u would.
#42
in reply to:
↑ 34
@
11 years ago
Replying to miqrogroove:
also,
/"(\s|\S|\Z)/
Does that regexp make sense to anyone? Isn't it equivalent to
/"/
??
Clarification.. this is replacement mode, so what's the difference between these, if any:
$dynamic[ '/"(\s|\S|\Z)/' ] = $closing_quote . '$1';
or
$dynamic[ '/"(.?)/' ] = $closing_quote . '$1';
or
$dynamic[ '/"/' ] = $closing_quote;
Wouldn't those all do the same thing?
#43
@
11 years ago
I'm seeing other problems that need to be resolved with this patch.
(?!
is a non-capturing lookahead assertion, which means the $2
replacement token does nothing.
#44
@
11 years ago
azaozz, miqro-22692.2.patch is based on your feedback in comment 40, as well as fixes for comments 42 and 43.
#46
@
11 years ago
Another note on miqro-22692.2.patch :
Changing the lookahead from negative (?!\s)
to positive (?=\S)
which I think is more readable, also has the effect of changing the expected output for a lone double quote, as in wptexturize('"')
.
I made that change on the assumption it would be trivial.
#48
@
11 years ago
- Version changed from 3.4.2 to 1.2
The original regex appears to be introduced in [985].
#49
@
11 years ago
The "original" bug probably looked more like #16957 because the post editor was inserting the HTML character entity reference instead of its UTF-8 encoding.
#50
@
11 years ago
Added a third patch option. If we can't figure out exactly why \s is causing semi-random behavior, then this patch should be used to eliminate \s from wptexturize().
#51
@
11 years ago
I got in touch with one of the PCRE devs who says, "Clearly, the different PHP implementations you are using must be calling pcre_compile() with different options." http://bugs.exim.org/show_bug.cgi?id=1412
#52
@
11 years ago
Very nice, super fast turnaround from PCRE :)
As far as I see the problem comes from here:
/\s/W \x{a0} 0: \xa0
(non-UTF mode, but with Unicode properties).
If PCRE was compiled with Unicode properties support *and* ? PCRE_UCP is passed at compile time, \s
will match \xa0
even when not in UTF mode. However in PHP PCRE_UCP
is only passed when the u
modifier is set: http://svn.php.net/viewvc/php/php-src/trunk/ext/pcre/php_pcre.c?r1=303963&r2=303962&pathrev=303963 since PHP 5.3.4. And that hasn't changed in newer versions: http://git.php.net/?p=php-src.git;a=blob;f=ext/pcre/php_pcre.c;h=7d34d9feb15a81b5e80973cf1aaa1c4936543173;hb=refs/heads/master#l366.
... Still can't pinpoint the inconsistencies with \s
.
#53
@
11 years ago
I'm not familiar with that PCRE Unicode mode, but it sounds similar to UTF-16. In Windows, \xA0 is sometimes encoded \x00\xA0 and that's when data typing becomes so important.
I put in a ticket at php.net today. They tend to be slow.
#54
@
11 years ago
azaozz, there's a request from PHP to post as much phpinfo() as possible from systems returning int(1) for the test code. Could you drop a phpinfo() call into your phpunit and post the results? Maybe edit out usernames and addresses for safety. See https://bugs.php.net/bug.php?id=66068
#55
@
11 years ago
And in case it's relevant, let's see if your command line environment shows a different result for
var_dump( setlocale( LC_ALL, 0 ) );
#56
@
11 years ago
On BlueHost, where we get the "expected" behavior of not matching \xA0, the setlocale() result is
string(1) "C"
On my home server, I get
string(85) "LC_COLLATE=C;LC_CTYPE=English_United States.1252;LC_MONETARY=C;LC_NUMERIC=C;LC_TIME=C"
#58
@
11 years ago
Yeah, now we are getting somewhere...
PHP 5.4.16 on Windows 7:
var_dump( setlocale( LC_ALL, 0 ) ); // string(1) "C" var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0) var_dump( setlocale( LC_ALL, '' ) ); // On Windows this sets it to the system default // string(19) "English_Canada.1252" var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(1) var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // bool(false) (as \xA0 is not full UTF char) setlocale( LC_ALL, 'C' ); var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0) var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // bool(false)
PHP 5.3.1 on Mac OSX (note: 5.3.1 doesn't set PCRE_UCP with the u
modifier)
var_dump( setlocale( LC_ALL, 0 ) ); // string(1) "C" var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0) setlocale( LC_ALL, 'en_CA' ); // Also with 'en_CA.UTF-8', 'en_CA.ISO8859-1', 'en_CA.ISO8859-15', etc. var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(1) var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // int(0) setlocale( LC_ALL, 'C' ); // Also with 'en_CA.US-ASCII' var_dump( preg_match( '/^\s$/', "\xA0" ) ); // int(0) var_dump( preg_match( '/^\s$/u', "\xA0" ) ); // int(0)
So when the locale is anything other than C
or US-ASCII
equivalent, \s
matches \xA0
. Also on multithreaded servers like Apache on Windows, setlocale() is "sticky", more info.
From the PCRE manual:
PCRE handles caseless matching, and determines whether characters are letters, digits, or whatever, by reference to a set of tables, indexed by character value. When running in UTF-8 mode, this applies only to characters with codes less than 128.
It is not mentioned there but seems \s
is also affected by these tables.
The internal tables can always be overridden by tables supplied by the application that calls PCRE... External tables are built by calling pcre_maketables()
PHP uses pcre_maketables() only if the locale is 'C': http://git.php.net/?p=php-src.git;a=blob;f=ext/pcre/php_pcre.c;h=7d34d9feb15a81b5e80973cf1aaa1c4936543173;hb=refs/heads/master#l392
So it seems that the unexpected behavior of \s
is caused by PCRE when a locale other than C
is set.
#60
@
11 years ago
Yes, thinking we can ask that the PCRE (and PHP) documentation is updated to reflect this.
#61
@
11 years ago
Based on what we know now, I think the most recent patch and tests are appropriate. I wouldn't want to involve locale information in the fix for this bug.
#62
@
11 years ago
- Keywords commit 3.9-early added
- Milestone changed from 3.8 to Future Release
See also, #22363, apparently.
Let's try this out early 3.9.
#63
@
11 years ago
nacin, the current patch here does not involve PCRE unicode modes. We decided to go strictly ASCII. Other than that, the filenames ticket doesn't seem related.
#64
@
11 years ago
Could someone edit the ticket description to say "Type" instead of "Paste" ? A while back we figured out there was a big difference.
#65
@
11 years ago
- Description modified (diff)
@miqrogroove: I've made the change in the ticket description.
#66
@
11 years ago
miqro-22692.3.patch is ready to commit.
miqro-22692-tests.2.diff let me know if I need to refresh that one.
#69
@
11 years ago
Patches and unit tests expanded for compatibility with changes proposed in #26842
#70
@
11 years ago
I decided this needs to be expanded into wpautop(). I reviewed the impact earlier and determined my patches would not break anything there. However, it would be silly to leave pre-existing \s bugs in wpautop() when we are already tackling them in wptexturize(). Check for new attachments in a couple hours.
#71
@
11 years ago
I'm triple checking everything before the IRC meeting. Just noticed a stray wildcard in the original regexp for closing single quotes: '/\'([\s.]|\Z)/'
That lone period may represent any character, similar to the \s|\S
that was used in closing double quotes. I will tweak that now if needed.
#72
@
11 years ago
Testing shows lone periods inside character classes are not treated as wildcards. So [\s.]
is matching both spaces and periods. This is still a significant difference between double quotes and single quotes, and for the sake of this patch I want that original behavior unchanged. Patch revision coming momentarily.
This ticket was mentioned in IRC in #wordpress-dev by azaozz. View the logs.
11 years ago
#74
@
11 years ago
In IRC, we decided to add a very large number of unit tests for wptexturize. These tests will:
- Fail upon any change of behavior from v3.8.
- Fail for all known bugs.
- Eventually, not fail. We can set aside some tests later as needed.
A first set of tests is attached. More to follow as time permits.
#75
@
11 years ago
Uploaded more tests. This covers everything in wptexturize v3.8 except for the $no_texturize_shortcodes feature. It looks like many of the known bugs already have tests in the trunk version, but I plan to check if any need to be added.
This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.
11 years ago
#77
@
11 years ago
The unit tests are working now. In IRC we decided the next step is to benchmark the two different regexp styles, lookbehinds vs. capturing backreferences. There are any number of reasons one might be faster than the other, so it will be good to find out.
#78
@
11 years ago
miqro-22692.12.patch looks good. The only tests that fail after applying the patch are in test_more_bugs() in WPTexturize.php.
#79
@
11 years ago
Initial testing shows a massive performance improvement after the patch, as-is!
Using, for example:
$start = microtime(TRUE); for ($i = 0; $i<5; $i++) { $curl = preg_replace($dynamic_characters, $dynamic_replacements, $post); } $end = microtime(TRUE);
Here were the results based on the old and new regexp strings, tested against a 9,911 byte post.
Old logic, 1 iteration = 0.0128 sec
Old logic, 5 iteration = 0.0658 sec
Old logic, 10 iteration = 0.1308 sec
Old logic, 25 iteration = 0.3159 sec
New logic, 1 iteration = 0.0020 sec
New logic, 5 iteration = 0.0094 sec
New logic, 10 iteration = 0.0188 sec
New logic, 25 iteration = 0.0468 sec
I will take a closer look now to see if any tuning would be worthwhile.
#80
@
11 years ago
Patch 14 gives another 3x performance boost by skipping the 9x9 replacement when possible. The extra check is negligible but saves a lot of time:
Patch 14, 9x9 off, 1 iteration = 0.0006 sec
Patch 14, 9x9 off, 5 iteration = 0.0027 sec
Patch 14, 9x9 off, 10 iteration = 0.0053 sec
Patch 14, 9x9 off, 25 iteration = 0.0131 sec
Patch 14, 9x9 on, 1 iteration = 0.0020 sec
Patch 14, 9x9 on, 5 iteration = 0.0095 sec
Patch 14, 9x9 on, 10 iteration = 0.0188 sec
Patch 14, 9x9 on, 25 iteration = 0.0467 sec
This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.
11 years ago
#82
@
11 years ago
Might want to expand test_ampersand() and test_more_bugs() with malformed entity references like &&
and &!amp;
wptexturize in 3.8 will predictably fail to convert the first character because the second character in the pattern is only required to not be a #.
This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.
11 years ago
#84
@
11 years ago
Meeting minutes:
- Is it worth committing this 3 weeks before release?
- Patch would be reverted at the first sign of trouble.
- Could alter the perception of other bugs.
- Paves the way for other bugs to be fixed.
- Improves performance.
- A decision will be made by the Beta 3 deadline.
- What changes are made other than quote replacement patterns?
- wptexturize() will treat any 0xC2A0 or the same as a regular space.
- Bugs outside the scope of this patch are preserved (not fixed).
- Smilies and shortcodes are fixed in the current patch. Should this be separate?
- Extra spaces removed from translate_smiley() because it would lead to nbsp being replaced by normal space and other strange problems.
- Were people counting on multiple spaces around smilies img? - Unlikely
- What is the overlap with #26842?
- That ticket would change how the editor outputs NBSP characters.
- Smilies and shortcodes must be patched before that change.
- It remains a separate issue.
- miqrogroove will be available to address any fallout from this patch.
#85
@
11 years ago
Executive summary of miqro-22692.14.patch:
- new function wp_spaces_regexp()
- Provides the chars and strings that can represent a space in regexp.
- Adds a run-once filter for any customization needs.
- wptexturize
- Replaced \s sequence by wp_spaces_regexp() in regex.
- Now quotes can be neighbors with nbsp.
- Removed duplicate patterns and junk like
\s|\S
for performance. - Replaced back-reference patterns with look-behind patterns for performance.
- Moved the expensive 9x9 pattern so it can be skipped when not needed.
- shortcode_unautop
- Replaced \s sequence by wp_spaces_regexp() in regex.
- Now shortcodes can be neighors with nbsp.
- translate_smiley
- Removed extra spaces that were not be needed after fixing smilies_init().
- smilies_init
- Replaced \s sequence by wp_spaces_regexp() in regex.
- Now smilies can be neighbors with nbsp.
- Added a look-behind to prevent replacement of leading spaces.
- Removed extra capturing groups for performance.
#86
@
11 years ago
Based on feedback last night, the unit tests for smilies need a lot of work.
miqro-22692.15.patch adjusted as follows:
Dropped all changes to smilies_init() and translate_smiley().
Dropped all new unit tests for bugs not fixed by this patch.
#87
@
11 years ago
miqro-22692.16.patch is limited in scope to wptexturize() only. I added the two assertions from comment 82 under test_ampersand().
#88
@
11 years ago
- Owner set to nacin
- Resolution set to fixed
- Status changed from new to closed
In 27839:
#89
@
11 years ago
Bravo, miqrogroove.
Let's spin out the smiley and shortcode stuff into new tickets, and hit the ground running with them in 4.0.
If there are any issues in [27839] we're likely to do a full or partial revert (for example, I'd want to leave in the 9x9 speed-up), given how little time there is remaining in the 3.9 development cycle, and we'd try it again in 4.0. But this was too good to pass up.
As I noticed in IRC at one point, I did some profiling and benchmarking using the text of Alice in Wonderland (nearly 25,000 words). I noticed that with the patch, the page loaded about 25% faster.
Isn't this caused by the wptexturize filter?
Try to put
in the theme's functions.php file.