Make WordPress Core

Opened 15 years ago

Last modified 2 years ago

#11311 new defect (bug)

kses converts ampersands to & in post titles, post content, and more

Reported by: viper007bond's profile Viper007Bond Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 2.9
Component: Posts, Post Types Keywords: needs-patch
Focuses: administration Cc:

Description

Make a test user that has the "author" role (i.e. no unfiltered_html) and write a post with a title that has & in it. After saving, it will becomes & due to wp_filter_kses(). It gets saved in the database this way too.

It's confusing to the user.

Attachments (2)

11311.patch (806 bytes) - added by Viper007Bond 15 years ago.
Would something like this work?
11311-2.patch (809 bytes) - added by jacklenox 8 years ago.
I feel a bit hesitant sharing this, but here's a slight twist on Alex's original patch that seems to work for every case I can think of. It allows through ampersands that are surrounded by spaces. This seems to fit most use cases when people would want to use an ampersand in a title or body content. Any thoughts?

Download all attachments as: .zip

Change History (28)

#1 @scribu
15 years ago

  • Milestone changed from Unassigned to Future Release

#2 @Viper007Bond
15 years ago

wp_kses_normalize_entities() (called by wp_kses() which is called by wp_filter_kses()) causes this.

@Viper007Bond
15 years ago

Would something like this work?

#3 @Viper007Bond
15 years ago

Nevermind, it fails this test case:

This is a test & hi » dfg  :  AT&T  &#XYZZY;

#4 @westi
14 years ago

  • Cc westi added

When implementing this is would be great to have the tests cases formalised in a set of tests for the Automated Test suite

#5 @wojtek.szkutnik
14 years ago

  • Keywords gsoc added

#6 @Viper007Bond
14 years ago

  • Summary changed from Low access users get their ampersands escaped in post titles to kses converts ampersands to & in post titles, post content, and more

This applies to more than just post titles.

Write a post as an author or some other low access user. Here's some sample content to use when doing so:

Here's a foo & apple test:

http://www.youtube.com/watch?v=nTDNLUzjkpg&hd=1

After saving the post, you'll end up with this:

Here's a foo & apple test:

http://www.youtube.com/watch?v=nTDNLUzjkpg&hd=1

Why don't we do this on display instead of save? It currently results in stuff like the &'ed URL being sent to oEmbed providers:

http://www.youtube.com/oembed?maxwidth=640&maxheight=600&url=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DnTDNLUzjkpg%26amp%3Bhd%3D1&format=json

Thankfully YouTube seems to be smart, but we shouldn't rely on that.

#7 @nacin
14 years ago

(In [16728]) Unescape ampersands before making an oEmbed request. props Viper007Bond, fixes #14514, see #11311.

#8 @underground-stockholm
14 years ago

"C.12. Using Ampersands in Attribute Values (and Elsewhere)

In both SGML and XML, the ampersand character ("&") declares the beginning of an entity reference (e.g., ® for the registered trademark symbol "®"). Unfortunately, many HTML user agents have silently ignored incorrect usage of the ampersand character in HTML documents - treating ampersands that do not look like entity references as literal ampersands. XML-based user agents will not tolerate this incorrect usage, and any document that uses an ampersand incorrectly will not be "valid", and consequently will not conform to this specification. In order to ensure that documents are compatible with historical HTML user agents and XML-based user agents, ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (e.g. "&"). For example, when the href attribute of the a element refers to a CGI script that takes parameters, it must be expressed as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user rather than as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user."

from http://www.w3.org/TR/xhtml1/#C_12

Looks like this isn't a real bug.

-- Pete | Rock Madrid | http://rock-madrid.com/

#9 @Viper007Bond
14 years ago

Thanks, but my point is that the ampersand escaping should be done on display like it is done for administrators instead of by kses on post sanitization and save.

When you're editing a post, you want to see "&" in your content and not "&". ;)

#10 @johnbillion
12 years ago

  • Keywords gsoc removed

There are actually quite a few places where WordPress is storing data with entities encoded. For example, a term named "This & That" will have entities in its name encoded, resulting in a term name of "This & That". Not good for when you're trying to do things with data and you don't want it encoded (for example, putting the term name in a 'title' attribute).

The following fields are stored with encoded entities regardless of your user role as they all go through wp_kses():

  • Term name and description
  • User first name, last name, display name, nickname and description
  • Comment author name
  • Link name, description, image, rel and notes

Link target, comment author email, comment author URL, user email, user URL and link URL are also stored with encoded entities, although these fields typically don't contain entities.

The best solution would be to switch to storing this data in unencoded form and run an upgrade routine to decode existing data when the change happens, but I realise that this is potentially an expensive upgrade. I'm not sure how to address that problem.

#11 @mintindeed
12 years ago

  • Cc gabriel.koen@… added

#12 @batmoo
12 years ago

  • Cc batmoo@… added

#13 @johnbillion
11 years ago

#8002 was marked as a duplicate.

#14 follow-up: @boonebgorges
11 years ago

  • Cc boonebgorges@… added

I just ran into this issue with tax terms. I had created a term 'Foo & Bar' using wp_insert_term(). I then later wanted to fetch it using get_term_by( 'name' ). But this failed, because the ampersand was not being translated in get_term_by(). I was able to work around the issue by doing (roughly)

$sanitized_name = sanitize_term_field( 'name', $value, 0, $tax, 'db );
$term = get_term_by( 'name', $sanitized_name );

I'm surprised this isn't coming up elsewhere in WP - I guess similar sanitization must already be in place - but it does seem as if it should be happening in get_term_by(), as the SQL query is built (assuming the underlying encoding issue in the OP can't be fixed for backward compatibility reasons).

#15 in reply to: ↑ 14 @SergeyBiryukov
11 years ago

Replying to boonebgorges:

I just ran into this issue with tax terms.

Related: #24354

#16 @nacin
11 years ago

  • Component changed from Administration to Posts, Post Types
  • Focuses administration added

#17 @thiagof
11 years ago

Wouldn't be a lot easier to just encode the query too? Now with all the hacks people done to overcome the issue, may be a bit more complex situation..

Version 0, edited 11 years ago by thiagof (next)

#18 @anmari
10 years ago

Just flagging that this is happens not just for posts and post types but for user meta data too.

A custom add or update_user_meta does not result in an encoded ampersand.

However due to default filters added (in default-filters.php line 16) , user data like last name DOES have any ampersands escaped. Admittedly not a common occurrence (an ampersand in last name I mean - except for a Ke$sha type person perhaps?). The behaviour was flushed out when doing some debug on generic user meta queries and searches, when I'd dumped some test data into last name.

#19 @boonebgorges
9 years ago

The best solution would be to switch to storing this data in unencoded form and run an upgrade routine to decode existing data when the change happens, but I realise that this is potentially an expensive upgrade. I'm not sure how to address that problem.

I agree that the best system would be to store the data as unencoded, and then encode as necessary for display. However:

  1. The upgrade routine is going to be expensive (as you note)
  2. The upgrade routine is going to be plagued with false positives
  3. It's going to break queries for anyone who is currently working around this bug by encoding their query terms
  4. In what contexts would we need to encode for display? Currently, all content is being encoded on the way out (by virtue of its being encoded in the database). Presumably, you'd want to return unencoded content from low-level functions like get_post_meta() and get_term_by(). But this is going to cause compatibility problems with people who are expecting encoded data. And maintaining a whitelist of places where output should be encoded is going to be most unfun.

I'm all for fixing this stuff over the long run, but in the short to medium term, we should face facts: we store encoded data, so our query functions ought to match encoded data too. In the appropriate low-level functions (get_metadata(), get_term_by(), and so forth), we should encode using the same filters used to encode on the way in. For example, for terms, we'd use sanitize_term_field() (see #24354). I don't think that this will cause backward compatibility problems: for all the affected areas, there's no way to get an unencoded & into the database, so existing queries for Foo & Bar are failing anyway.

What do johnbillion and others think?

This ticket was mentioned in Slack in #core by boone. View the logs.


9 years ago

#21 @boonebgorges
9 years ago

#36387 was marked as a duplicate.

#22 @nicoblog
8 years ago

Still happening.

@jacklenox
8 years ago

I feel a bit hesitant sharing this, but here's a slight twist on Alex's original patch that seems to work for every case I can think of. It allows through ampersands that are surrounded by spaces. This seems to fit most use cases when people would want to use an ampersand in a title or body content. Any thoughts?

#23 @jacklenox
8 years ago

Hmm, on further inspection it looks as though this ticket has been fixed anyway, though I'm not sure when!

Update
Nope, still broken. Ignore me.

Last edited 8 years ago by jacklenox (previous) (diff)

This ticket was mentioned in Slack in #cli by johnbillion. View the logs.


7 years ago

#25 @tillkruess
5 years ago

This is still a thing.

#26 @zzxiang
5 years ago

Ten years later, it is still a problem. Ampersands are converted also in post GUIDs, causing my plugin (https://wordpress.org/plugins/external-media-without-import/) failing to deal with URLs containing &.

Note: See TracTickets for help on using tickets.