WordPress.org

Make WordPress Core

Opened 5 years ago

Closed 4 years ago

Last modified 4 years ago

#29256 closed defect (bug) (fixed)

Strings ending with a number and quotation mark get wrong smart quotes

Reported by: coreygilmore Owned by: johnbillion
Milestone: 4.3 Priority: normal
Severity: normal Version: 2.8
Component: Formatting Keywords: wptexturize has-patch
Focuses: Cc:

Description

When a string ends in a number followed by a single or double quote, such as "Lorem ipsum dolor sit amet 1234" or 'Etiam eu egestas dui 1234' wptexturize incorrectly converts the trailing single or double quote into a single or double prime, respectively.

Related to #8775, but the scope of that ticket seems to be limited to numbers inside of quotes, not a number preceded by other text (crudely described as\d+ vs .*\d+).

This is still an issue in trunk on .org and on wordpress.com (Example: http://coreygilmore.wordpress.com/2014/08/18/expendables-3-illegally-downloaded-5-million-times-but-still-isnt-top-hit-for-pirates/)

Attachments (6)

quote-prime.png (62.1 KB) - added by coreygilmore 5 years ago.
miqro-29256.patch (13.2 KB) - added by miqrogroove 5 years ago.
miqro-29256.2.patch (13.2 KB) - added by miqrogroove 5 years ago.
Tweaks for readability.
miqro-29256.3.patch (13.3 KB) - added by miqrogroove 5 years ago.
Refreshed
miqro-29256.4.patch (18.3 KB) - added by miqrogroove 5 years ago.
Handles situations where open and close quotes are identical after translation.
miqro-29256.5.patch (18.0 KB) - added by miqrogroove 4 years ago.
Refresh previous patch.

Download all attachments as: .zip

Change History (33)

#1 @miqrogroove
5 years ago

  • Keywords needs-patch needs-unit-tests wptexturize added

Ignoring #18549 it may be possible to fix this for plain text only. The logic is not "regular" and will require potentially slow code.

#2 @miqrogroove
5 years ago

Algorithm brainstorm:

  1. Split the subject string at every opening quote.
  2. If each value contains only one closing quote, then no primes exist.
  3. If a value contains multiple closing quote candidates then any candidate that is not preceded by a digit should be replaced with a closing quote before replacing the primes.
  4. Make sure primes are followed by a space, whereas trailing punctuation indicates a quote.
  5. If all closing quotes are preceded by a digit and not followed by punctuation then the last quote is assumed to be the only closing quote character.
  6. Implode the subject string and resume the inner loop of wptexturize.

#3 @miqrogroove
5 years ago

On further reflection, primes do not have to be followed by whitespace. For example, the length of the table is 4'.

#4 @miqrogroove
5 years ago

  • Keywords has-patch added; needs-patch needs-unit-tests removed

A complete solution with unit tests is now attached.

Some benchmark testing wouldn't hurt, just to find out the impact.

I will advocate for the 4.1 milestone when it becomes available.

@miqrogroove
5 years ago

Tweaks for readability.

#5 @miqrogroove
5 years ago

Note the patch and tests do not check for mix-and-match usage of UTF-8 quote chars. If that's desired, we would have to normalize those chars as entity references at the top of the new function.

#6 @miqrogroove
5 years ago

  • Keywords 4.1-early added

#7 @SergeyBiryukov
5 years ago

  • Milestone changed from Awaiting Review to 4.1

#8 @miqrogroove
5 years ago

A few more thoughts:

Patch will need to be refreshed after 4.0.1.

Patch did not add any new tests for cases like Fred answered the problem with '99' and then turned in the test. It would be helpful to look into whether the complexity would be increased or decreased by bringing in logic for quoted 2-digit numbers. This is the situation where we have to distinguish between an abbreviated year at the end of a quotation vs. a standalone quoted 2-digit number. If we can remove those extra regex patterns from the main function and incorporate them here, it might give better and faster results.

@miqrogroove
5 years ago

Refreshed

#9 @miqrogroove
5 years ago

Refreshed patch and looked at remaining issues. I think this patch will stand on its own because it solves a specific problem without being too complex.

The remaining problems we still have with a phrase like '99' is that it could have three different meanings depending on its context. Examples:

Then he said, '99' of lumber'.

Then he said, "we need '99' feet of lumber".

Then he said, 'that was the class of '99'.

We could theoretically determine that context, but it would require a huge amount of new code that seems to be beyond the scope of this ticket.

#10 @miqrogroove
5 years ago

#29912 was marked as a duplicate.

@miqrogroove
5 years ago

Handles situations where open and close quotes are identical after translation.

#11 @miqrogroove
5 years ago

Expanded the patch so that if opening and closing quotes are both the same character after translation, then the new function won't get confused or need to bail.

#12 follow-up: @paulschreiber
5 years ago

Does this patch handle the case of "30 for 30" ?

With 4.0, the second " becomes an inch mark.

#13 in reply to: ↑ 12 @miqrogroove
5 years ago

Replying to paulschreiber:

Does this patch handle the case of "30 for 30" ?

Yes.

#14 @johnbillion
5 years ago

  • Keywords 4.2-early added; 4.1-early removed
  • Milestone changed from 4.1 to Future Release

This could do with some performance figures. How does this impact page generation time?

#15 @paulschreiber
5 years ago

We'd really like to see this fixed in 4.1 — it is causing problems for our editorial staff.

I noticed this has been patched for two months. Please don't kick it to the next release and suddenly worry about page generation time. (I've never seen a wptexturize fix require "performance figures.") If this is a genuine concern, give @miqrogroove some time to do so before deferring it.

#16 @johnbillion
5 years ago

This is a large change to texturize and we're on the verge of a RC. Sorry it wasn't addressed sooner.

#17 @miqrogroove
5 years ago

Future Release is correct. I have been focused on critical bugs in 4.1 and expect that work to continue for a week or two.

#18 @iseulde
5 years ago

  • Milestone changed from Future Release to 4.2

has-patch 4.2-early so moving to 4.2.

#19 @DrewAPicture
4 years ago

@miqrogroove: What's the status here? Did you still want to try to tackle this for 4.2?

Patch still applies.

#20 @DrewAPicture
4 years ago

  • Keywords 4.3-early added; 4.2-early removed
  • Milestone changed from 4.2 to Future Release

No meaningful activity in several months, and as cited by @johnbillion previously, this needs time to soak. Maybe 4.3-early is the ticket.

#21 @miqrogroove
4 years ago

Needs performance testing at the least. Didn't have time to work on it this cycle. :)

#22 @obenland
4 years ago

  • Owner set to johnbillion
  • Status changed from new to assigned

#23 @obenland
4 years ago

  • Keywords 4.3-early removed
  • Milestone changed from Future Release to 4.3

#24 @miqrogroove
4 years ago

  • Focuses performance added
  • Keywords changed from wptexturize, has-patch to wptexturize has-patch

Initial performance test showed wptexturize() runs 20% faster after the patch. :o Wow. It may be ugly, but it's fast! I can only guess the improvement is tied to the overall reduction of PREG usage. I will test more to confirm this.

#25 @miqrogroove
4 years ago

  • Focuses performance removed

Performance difference was < 1% after correcting a script error. Still, I was expecting awful results so this is promising.

@miqrogroove
4 years ago

Refresh previous patch.

#26 @wonderboymusic
4 years ago

  • Resolution set to fixed
  • Status changed from assigned to closed

In 32863:

wptexturize() improvements:

  • Make sure that strings ending with a number and quotation mark get the proper smart quotes
  • Introduce wptexturize_primes(), a logic tree to determine whether or not "7'." represents seven feet, then converts the special char into either a prime char or a closing quote char.

Adds unit tests.

Props miqrogroove.
Fixes #29256.

#27 @xibe
4 years ago

Great to see this finally land in Core! It solves part of the unending issue between wptexturize and smart quotes :)

Note: See TracTickets for help on using tickets.