WordPress.org

Make WordPress Core

Opened 4 years ago

Last modified 4 days ago

#11311 new defect (bug)

kses converts ampersands to & in post titles, post content, and more

Reported by: Viper007Bond Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 2.9
Component: Posts, Post Types Keywords: needs-patch
Focuses: administration Cc:

Description

Make a test user that has the "author" role (i.e. no unfiltered_html) and write a post with a title that has & in it. After saving, it will becomes & due to wp_filter_kses(). It gets saved in the database this way too.

It's confusing to the user.

Attachments (1)

11311.patch (806 bytes) - added by Viper007Bond 4 years ago.
Would something like this work?

Download all attachments as: .zip

Change History (19)

comment:1 scribu4 years ago

  • Milestone changed from Unassigned to Future Release

comment:2 Viper007Bond4 years ago

wp_kses_normalize_entities() (called by wp_kses() which is called by wp_filter_kses()) causes this.

Viper007Bond4 years ago

Would something like this work?

comment:3 Viper007Bond4 years ago

Nevermind, it fails this test case:

This is a test & hi » dfg  :  AT&T  &#XYZZY;

comment:4 westi4 years ago

  • Cc westi added

When implementing this is would be great to have the tests cases formalised in a set of tests for the Automated Test suite

comment:5 wojtek.szkutnik4 years ago

  • Keywords gsoc added

comment:6 Viper007Bond4 years ago

  • Summary changed from Low access users get their ampersands escaped in post titles to kses converts ampersands to & in post titles, post content, and more

This applies to more than just post titles.

Write a post as an author or some other low access user. Here's some sample content to use when doing so:

Here's a foo & apple test:

http://www.youtube.com/watch?v=nTDNLUzjkpg&hd=1

After saving the post, you'll end up with this:

Here's a foo & apple test:

http://www.youtube.com/watch?v=nTDNLUzjkpg&hd=1

Why don't we do this on display instead of save? It currently results in stuff like the &'ed URL being sent to oEmbed providers:

http://www.youtube.com/oembed?maxwidth=640&maxheight=600&url=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DnTDNLUzjkpg%26amp%3Bhd%3D1&format=json

Thankfully YouTube seems to be smart, but we shouldn't rely on that.

comment:7 nacin3 years ago

(In [16728]) Unescape ampersands before making an oEmbed request. props Viper007Bond, fixes #14514, see #11311.

comment:8 underground-stockholm3 years ago

"C.12. Using Ampersands in Attribute Values (and Elsewhere)

In both SGML and XML, the ampersand character ("&") declares the beginning of an entity reference (e.g., ® for the registered trademark symbol "®"). Unfortunately, many HTML user agents have silently ignored incorrect usage of the ampersand character in HTML documents - treating ampersands that do not look like entity references as literal ampersands. XML-based user agents will not tolerate this incorrect usage, and any document that uses an ampersand incorrectly will not be "valid", and consequently will not conform to this specification. In order to ensure that documents are compatible with historical HTML user agents and XML-based user agents, ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (e.g. "&"). For example, when the href attribute of the a element refers to a CGI script that takes parameters, it must be expressed as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user rather than as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user."

from http://www.w3.org/TR/xhtml1/#C_12

Looks like this isn't a real bug.

-- Pete | Rock Madrid | http://rock-madrid.com/

comment:9 Viper007Bond3 years ago

Thanks, but my point is that the ampersand escaping should be done on display like it is done for administrators instead of by kses on post sanitization and save.

When you're editing a post, you want to see "&" in your content and not "&". ;)

comment:10 johnbillion23 months ago

  • Keywords gsoc removed

There are actually quite a few places where WordPress is storing data with entities encoded. For example, a term named "This & That" will have entities in its name encoded, resulting in a term name of "This & That". Not good for when you're trying to do things with data and you don't want it encoded (for example, putting the term name in a 'title' attribute).

The following fields are stored with encoded entities regardless of your user role as they all go through wp_kses():

  • Term name and description
  • User first name, last name, display name, nickname and description
  • Comment author name
  • Link name, description, image, rel and notes

Link target, comment author email, comment author URL, user email, user URL and link URL are also stored with encoded entities, although these fields typically don't contain entities.

The best solution would be to switch to storing this data in unencoded form and run an upgrade routine to decode existing data when the change happens, but I realise that this is potentially an expensive upgrade. I'm not sure how to address that problem.

comment:11 mintindeed20 months ago

  • Cc gabriel.koen@… added

comment:12 batmoo13 months ago

  • Cc batmoo@… added

comment:13 johnbillion11 months ago

#8002 was marked as a duplicate.

comment:14 follow-up: boonebgorges8 months ago

  • Cc boonebgorges@… added

I just ran into this issue with tax terms. I had created a term 'Foo & Bar' using wp_insert_term(). I then later wanted to fetch it using get_term_by( 'name' ). But this failed, because the ampersand was not being translated in get_term_by(). I was able to work around the issue by doing (roughly)

$sanitized_name = sanitize_term_field( 'name', $value, 0, $tax, 'db );
$term = get_term_by( 'name', $sanitized_name );

I'm surprised this isn't coming up elsewhere in WP - I guess similar sanitization must already be in place - but it does seem as if it should be happening in get_term_by(), as the SQL query is built (assuming the underlying encoding issue in the OP can't be fixed for backward compatibility reasons).

comment:15 in reply to: ↑ 14 SergeyBiryukov8 months ago

Replying to boonebgorges:

I just ran into this issue with tax terms.

Related: #24354

comment:16 nacin3 months ago

  • Component changed from Administration to Posts, Post Types
  • Focuses administration added

comment:17 thiagof7 weeks ago

Seems like the encode-to-store and decode-to-display is the only solution. Maybe in the querys core..!

But how to overcome people hacks to the escapes?

//get_term_by - and others
if (!has_escape($term))
    $term = html_escape($term)
return query($term)
Last edited 7 weeks ago by thiagof (previous) (diff)

comment:18 anmari4 days ago

Just flagging that this is happens not just for posts and post types but for user meta data too.

A custom add or update_user_meta does not result in an encoded ampersand.

However due to default filters added (in default-filters.php line 16) , user data like last name DOES have any ampersands escaped. Admittedly not a common occurrence (an ampersand in last name I mean - except for a Ke$sha type person perhaps?). The behaviour was flushed out when doing some debug on generic user meta queries and searches, when I'd dumped some test data into last name.

Note: See TracTickets for help on using tickets.