Make WordPress Core

Opened 13 years ago

Last modified 18 months ago

#18549 assigned defect (bug)

wp_texturize incorrectly curls closing quotes after inline HTML end tags

Reported by: justincwatt's profile justincwatt Owned by: miqrogroove's profile miqrogroove
Milestone: Priority: normal
Severity: normal Version: 3.2.1
Component: Formatting Keywords: wptexturize has-patch
Focuses: Cc:

Description

The following source HTML:

The word is "<a href="http://example.com/">quoted</a>".
The word is '<a href="http://example.com/">quoted</a>'
The word is '<a href="http://example.com/">quoted.</a>'
The word is '<a href="http://example.com/">quoted</a>'.
The word is '<a href="http://example.com/">quot</a>'d

Gets incorrectly transformed by wp_texturize() as:

The word is &#8220;<a href="http://example.com/">quoted</a>&#8220;.
The word is &#8216;<a href="http://example.com/">quoted</a>&#8216;
The word is &#8216;<a href="http://example.com/">quoted.</a>&#8216;
The word is &#8216;<a href="http://example.com/">quoted</a>&#8216;.
The word is &#8216;<a href="http://example.com/">quot</a>&#8216;d

Note: all the double/single quotes in the above examples that should be closing are instead opening)

This renders in the browser like this:

The word is “quoted“.
The word is ‘quoted‘
The word is ‘quoted.‘
The word is ‘quoted‘.
The word is ‘quot‘d

The problem here is that wp_texturize splits the text on all start/end tags, which makes sense for block-level tags, but not inline-tags like <em> and <a href="">.

formatting.php line 67:

$textarr = preg_split('/(<.*>|\[.*\])/Us', $text, -1, PREG_SPLIT_DELIM_CAPTURE);

However if you change it to only split the content on block-level tags, you'll need something more sophisticated/complex than a regular expression to avoid curling quotes within html.

Attachments (8)

18549.diff (6.3 KB) - added by wonderboymusic 11 years ago.
18549.2.diff (7.7 KB) - added by miqrogroove 11 years ago.
Tweak the unit tests.
18549.3.diff (6.6 KB) - added by miqrogroove 10 years ago.
Restore tests that fail.
18549_wptexturize.diff (23.2 KB) - added by gitlost 9 years ago.
An implementation of miqrogroove's algorithm.
18549_tests.diff (10.7 KB) - added by gitlost 9 years ago.
Unittests.
18549_wptexturize.2.diff (22.7 KB) - added by gitlost 9 years ago.
Individual statics instead of arrays.
18549_wptexturize.3.diff (26.0 KB) - added by gitlost 9 years ago.
Some performance tweaks.
18549_wptexturize.4.diff (34.1 KB) - added by gitlost 8 years ago.
Refresh for 4.7. Includes unit tests.

Download all attachments as: .zip

Change History (44)

#2 @Ammaletu
13 years ago

  • Cc Ammaletu added
  • Version changed from 3.2.1 to 3.3.1

I can confirm this issue. Haven't had time yet to write a test case or patch, but I took a look at the plugin "InTypo" (which I have been using for years to change the English quote characters to German ones). This plugin completely replaces wptexturize and does not show this bug except in one minor case. What InTypo does: It first replaces the closing quotes and then the opening ones. This way, when you get to replacing the opening quotes, the closing quotes are already safely replaced.

The code looks like this:

$dynamic_character = '/"([\.,;\!\?\s\)\]])/'
$dynamic_replacement = $closing_quote . '$1'

Basically, it replaces every quote character with a closing quote that is followed by a space, a closing bracket or a number of punctuation marks. That is not perfect yet, for example wptexturize handles { and < for opening quotes, so these should be handled here as well. Also probably any punctuation marks from different languages (are there any missing?). And finally this fails if the closing quote is the very last character of the post, a case that could be handled by adding \Z.

Could this be the basis for a patch?

#3 @SergeyBiryukov
13 years ago

  • Version changed from 3.3.1 to 3.2.1

Version number indicates when the bug was initially introduced/reported.

#4 @SergeyBiryukov
12 years ago

#23827 was marked as a duplicate.

#5 @nacin
11 years ago

  • Keywords wptexturize added

#6 @nacin
11 years ago

  • Keywords needs-patch needs-unit-tests added
  • Milestone changed from Awaiting Review to Future Release

Unit tests in particular would move this along.

#7 @miqrogroove
11 years ago

All of this seems to fall within the scope of #4539, which already has unit tests. Can we close this one as a duplicate?

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.


11 years ago

@miqrogroove
11 years ago

Tweak the unit tests.

#9 @wonderboymusic
11 years ago

  • Owner set to wonderboymusic
  • Resolution set to fixed
  • Status changed from new to closed

In 28764:

In wptexturize(), adjust for the treatment of abbreviated years at the end of quotations.
Silence some unit tests that have never passed and may no longer be applicable.

Props miqrogroove.
Fixes #18549.

#10 @miqrogroove
11 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

Ticket still valid. Just didn't need the broken tests.

#11 @miqrogroove
10 years ago

#29287 was marked as a duplicate.

#12 @nacin
10 years ago

Were those tests actually broken? Because they seemed to be testing still-outstanding texturize issues.

#13 @miqrogroove
10 years ago

wonderboymusic wanted all the tests to pass, so we disabled the tests for this ticket and others that are still open.

#14 @nacin
10 years ago

Okay. In the future, failing tests should be broken out into individual tests and marked with @ticket for the open ticket.

@miqrogroove
10 years ago

Restore tests that fail.

#15 @miqrogroove
10 years ago

Added 18549.3.diff to update failing tests. The following tests are expected to fail:

test_texturize_around_html()
test_quotes_after_numbers()
test_entity_quote_cuddling()

#16 @miqrogroove
10 years ago

#30881 was marked as a duplicate.

#17 @miqrogroove
10 years ago

#31160 was marked as a duplicate.

#18 @miqrogroove
10 years ago

#31931 was marked as a duplicate.

#19 @miqrogroove
10 years ago

Algorithm brainstorm:

  1. Make a string that is stripped of all elements and shortcodes.
  2. Make an array of the elements and shortcodes, indexed by string position.
  3. Apply all formatting changes to the text-only string.
  4. Compare the formatted string to the unformatted string.
  5. Copy each formatting change to the original input OR insert the elements and shortcodes back into the formatted string.

This is based on the assumptions that we can't simply parse around elements, we can't use inline placeholders for the elements if we want to keep the existing regular expressions, and elements can be positioned anywhere in the text.

#20 @miqrogroove
10 years ago

Missing from previous comment:

  • It will be necessary to retain certain elements such as blocks which are effectively separate sentences even when they are not separated by whitespace.

#22 @paulschreiber
9 years ago

We just hit this twice today in 4.3. Example:
<strong>Read more: </strong>"<a href="http://blah.com/test">Something (else)</a>"</p>

Can someone try and fix this for 4.4?

@gitlost
9 years ago

An implementation of miqrogroove's algorithm.

@gitlost
9 years ago

Unittests.

#23 @gitlost
9 years ago

This patch is an implementation of miqrogroove's algorithm to remove inline HTML and shortcodes while processing quotes, and then to reinstate them, using routines which wrap str_replace/preg_replace to keep track of offset changes.

It seems to run around 30-100% slower than the current version, though it can actually run faster occasionally depending on the data.

One incompatibility is thrown up by the unittests, in the "crazy" shortcode test.

#24 @miqrogroove
9 years ago

That's a great start. Some of the new functions seem a bit long and that may be where we can do some optimizing. Also, if renaming the variables is really important, I think we should eventually break that out to a separate patch for clarity. I will try to help out before the betas. Going to be tied up with non-wp stuff for a few weeks.

#25 @gitlost
9 years ago

Thanks very much for responding. If by renaming the variables you mean the $regexs array then no, it isn't really important(!) and didn't make sense anyway (no reason to deref an array needlessly) - was just getting embarrassed by the number of static variables being declared - I'll upload a new diff. The patch is just a starting point as you say and I look forward to your help when you get the chance...

@gitlost
9 years ago

Individual statics instead of arrays.

#26 @miqrogroove
9 years ago

Next thought: Adding this much code to wptexturize doesn't make sense in the big picture. Let's think about how to make the algorithm reusable. There is a good chance of reusing this pattern when normalizing whitespace in wpautop and similar situations.

#27 @gitlost
9 years ago

Unfortunately I don't know if there is much chance of reusing this algorithm in other situations, as the much simpler and faster solution of using numbered placeholders (as for <pre> tags in wpautop) is usually good enough - indeed it would kinda more or less work for wptexturize() as well (eg substituting a placeholder plus a space for opening inline tags).

The amount of code is very off-putting I agree, but in its defence it's very mechanical, repetitive code, and unlikely to have unforeseen consequences.

Anyway I'll upload the latest version, which has some speed tweaks - one being to add back the standard replace stuff if there's no wptexturize_replace_init() match (more code bloat!), another is adding the study option to wptexturize_replace_init()'s regex, another is moving the "atomic" adjustments to wptexture_replace_final(). It also takes the liberty of rejigging _wptexturize_pushpop_element() which was a bit slow.

It seems now to run < 20% slower on most inputs (except very small ones with inline tags/shortcodes, where it can rise to 50%).

@gitlost
9 years ago

Some performance tweaks.

#28 @gitlost
9 years ago

  • Keywords has-patch added; needs-patch needs-unit-tests removed

#29 @johnbillion
9 years ago

#33446 was marked as a duplicate.

#30 @wonderboymusic
9 years ago

  • Owner changed from wonderboymusic to miqrogroove
  • Status changed from reopened to assigned

@gitlost
8 years ago

Refresh for 4.7. Includes unit tests.

#31 @gitlost
8 years ago

Refresh for WP 4.7. Includes the unit tests as well. The patch is also available as a plugin at pap-texturize, the download link is pap-texturize-1.0.0.zip. The plugin version also includes a fix for #29882 (quotes inside quotes curling incorrectly).

#32 @pento
6 years ago

#47724 was marked as a duplicate.

#33 @slamorte
5 years ago

For those that can't patch core or don't want to, I have a simple content filter that solves a common issue: an italicized name followed by a possessive, e.g. "As reported in CNN's latest article."

$content = preg_replace('#>&\#8216;s\b#', '>&#8217;s', $content);

This ticket was mentioned in Slack in #core-editor by lkraav. View the logs.


5 years ago

#35 @ellatrix
18 months ago

#48602 was marked as a duplicate.

#36 @ellatrix
18 months ago

#40148 was marked as a duplicate.

Note: See TracTickets for help on using tickets.