Make WordPress Core

Opened 11 years ago

Closed 10 years ago

Last modified 10 years ago

#29256 closed defect (bug) (fixed)

Strings ending with a number and quotation mark get wrong smart quotes

Reported by: coreygilmore's profile coreygilmore Owned by: johnbillion's profile johnbillion
Milestone: 4.3 Priority: normal
Severity: normal Version: 2.8
Component: Formatting Keywords: wptexturize has-patch
Focuses: Cc:

Description

When a string ends in a number followed by a single or double quote, such as "Lorem ipsum dolor sit amet 1234" or 'Etiam eu egestas dui 1234' wptexturize incorrectly converts the trailing single or double quote into a single or double prime, respectively.

Related to #8775, but the scope of that ticket seems to be limited to numbers inside of quotes, not a number preceded by other text (crudely described as\d+ vs .*\d+).

This is still an issue in trunk on .org and on wordpress.com (Example: http://coreygilmore.wordpress.com/2014/08/18/expendables-3-illegally-downloaded-5-million-times-but-still-isnt-top-hit-for-pirates/)

Attachments (6)

quote-prime.png (62.1 KB) - added by coreygilmore 11 years ago.
miqro-29256.patch (13.2 KB) - added by miqrogroove 11 years ago.
miqro-29256.2.patch (13.2 KB) - added by miqrogroove 11 years ago.
Tweaks for readability.
miqro-29256.3.patch (13.3 KB) - added by miqrogroove 10 years ago.
Refreshed
miqro-29256.4.patch (18.3 KB) - added by miqrogroove 10 years ago.
Handles situations where open and close quotes are identical after translation.
miqro-29256.5.patch (18.0 KB) - added by miqrogroove 10 years ago.
Refresh previous patch.

Download all attachments as: .zip

Change History (33)

#1 @miqrogroove
11 years ago

  • Keywords needs-patch needs-unit-tests wptexturize added

Ignoring #18549 it may be possible to fix this for plain text only. The logic is not "regular" and will require potentially slow code.

#2 @miqrogroove
11 years ago

Algorithm brainstorm:

  1. Split the subject string at every opening quote.
  2. If each value contains only one closing quote, then no primes exist.
  3. If a value contains multiple closing quote candidates then any candidate that is not preceded by a digit should be replaced with a closing quote before replacing the primes.
  4. Make sure primes are followed by a space, whereas trailing punctuation indicates a quote.
  5. If all closing quotes are preceded by a digit and not followed by punctuation then the last quote is assumed to be the only closing quote character.
  6. Implode the subject string and resume the inner loop of wptexturize.

#3 @miqrogroove
11 years ago

On further reflection, primes do not have to be followed by whitespace. For example, the length of the table is 4'.

#4 @miqrogroove
11 years ago

  • Keywords has-patch added; needs-patch needs-unit-tests removed

A complete solution with unit tests is now attached.

Some benchmark testing wouldn't hurt, just to find out the impact.

I will advocate for the 4.1 milestone when it becomes available.

@miqrogroove
11 years ago

Tweaks for readability.

#5 @miqrogroove
11 years ago

Note the patch and tests do not check for mix-and-match usage of UTF-8 quote chars. If that's desired, we would have to normalize those chars as entity references at the top of the new function.

#6 @miqrogroove
10 years ago

  • Keywords 4.1-early added

#7 @SergeyBiryukov
10 years ago

  • Milestone changed from Awaiting Review to 4.1

#8 @miqrogroove
10 years ago

A few more thoughts:

Patch will need to be refreshed after 4.0.1.

Patch did not add any new tests for cases like Fred answered the problem with '99' and then turned in the test. It would be helpful to look into whether the complexity would be increased or decreased by bringing in logic for quoted 2-digit numbers. This is the situation where we have to distinguish between an abbreviated year at the end of a quotation vs. a standalone quoted 2-digit number. If we can remove those extra regex patterns from the main function and incorporate them here, it might give better and faster results.

@miqrogroove
10 years ago

Refreshed

#9 @miqrogroove
10 years ago

Refreshed patch and looked at remaining issues. I think this patch will stand on its own because it solves a specific problem without being too complex.

The remaining problems we still have with a phrase like '99' is that it could have three different meanings depending on its context. Examples:

Then he said, '99' of lumber'.

Then he said, "we need '99' feet of lumber".

Then he said, 'that was the class of '99'.

We could theoretically determine that context, but it would require a huge amount of new code that seems to be beyond the scope of this ticket.

#10 @miqrogroove
10 years ago

#29912 was marked as a duplicate.

@miqrogroove
10 years ago

Handles situations where open and close quotes are identical after translation.

#11 @miqrogroove
10 years ago

Expanded the patch so that if opening and closing quotes are both the same character after translation, then the new function won't get confused or need to bail.

#12 follow-up: @paulschreiber
10 years ago

Does this patch handle the case of "30 for 30" ?

With 4.0, the second " becomes an inch mark.

#13 in reply to: ↑ 12 @miqrogroove
10 years ago

Replying to paulschreiber:

Does this patch handle the case of "30 for 30" ?

Yes.

#14 @johnbillion
10 years ago

  • Keywords 4.2-early added; 4.1-early removed
  • Milestone changed from 4.1 to Future Release

This could do with some performance figures. How does this impact page generation time?

#15 @paulschreiber
10 years ago

We'd really like to see this fixed in 4.1 — it is causing problems for our editorial staff.

I noticed this has been patched for two months. Please don't kick it to the next release and suddenly worry about page generation time. (I've never seen a wptexturize fix require "performance figures.") If this is a genuine concern, give @miqrogroove some time to do so before deferring it.

#16 @johnbillion
10 years ago

This is a large change to texturize and we're on the verge of a RC. Sorry it wasn't addressed sooner.

#17 @miqrogroove
10 years ago

Future Release is correct. I have been focused on critical bugs in 4.1 and expect that work to continue for a week or two.

#18 @iseulde
10 years ago

  • Milestone changed from Future Release to 4.2

has-patch 4.2-early so moving to 4.2.

#19 @DrewAPicture
10 years ago

@miqrogroove: What's the status here? Did you still want to try to tackle this for 4.2?

Patch still applies.

#20 @DrewAPicture
10 years ago

  • Keywords 4.3-early added; 4.2-early removed
  • Milestone changed from 4.2 to Future Release

No meaningful activity in several months, and as cited by @johnbillion previously, this needs time to soak. Maybe 4.3-early is the ticket.

#21 @miqrogroove
10 years ago

Needs performance testing at the least. Didn't have time to work on it this cycle. :)

#22 @obenland
10 years ago

  • Owner set to johnbillion
  • Status changed from new to assigned

#23 @obenland
10 years ago

  • Keywords 4.3-early removed
  • Milestone changed from Future Release to 4.3

#24 @miqrogroove
10 years ago

  • Focuses performance added
  • Keywords changed from wptexturize, has-patch to wptexturize has-patch

Initial performance test showed wptexturize() runs 20% faster after the patch. :o Wow. It may be ugly, but it's fast! I can only guess the improvement is tied to the overall reduction of PREG usage. I will test more to confirm this.

#25 @miqrogroove
10 years ago

  • Focuses performance removed

Performance difference was < 1% after correcting a script error. Still, I was expecting awful results so this is promising.

@miqrogroove
10 years ago

Refresh previous patch.

#26 @wonderboymusic
10 years ago

  • Resolution set to fixed
  • Status changed from assigned to closed

In 32863:

wptexturize() improvements:

  • Make sure that strings ending with a number and quotation mark get the proper smart quotes
  • Introduce wptexturize_primes(), a logic tree to determine whether or not "7'." represents seven feet, then converts the special char into either a prime char or a closing quote char.

Adds unit tests.

Props miqrogroove.
Fixes #29256.

#27 @xibe
10 years ago

Great to see this finally land in Core! It solves part of the unending issue between wptexturize and smart quotes :)

Note: See TracTickets for help on using tickets.