Opened 15 years ago
Last modified 2 years ago
#11311 new defect (bug)
kses converts ampersands to & in post titles, post content, and more
Reported by: | Viper007Bond | Owned by: | |
---|---|---|---|
Milestone: | Future Release | Priority: | normal |
Severity: | normal | Version: | 2.9 |
Component: | Posts, Post Types | Keywords: | needs-patch |
Focuses: | administration | Cc: |
Description
Make a test user that has the "author" role (i.e. no unfiltered_html
) and write a post with a title that has &
in it. After saving, it will becomes &
due to wp_filter_kses()
. It gets saved in the database this way too.
It's confusing to the user.
Attachments (2)
Change History (28)
#3
@
15 years ago
Nevermind, it fails this test case:
This is a test & hi » dfg : AT&T &#XYZZY;
#4
@
14 years ago
- Cc westi added
When implementing this is would be great to have the tests cases formalised in a set of tests for the Automated Test suite
#6
@
14 years ago
- Summary changed from Low access users get their ampersands escaped in post titles to kses converts ampersands to & in post titles, post content, and more
This applies to more than just post titles.
Write a post as an author or some other low access user. Here's some sample content to use when doing so:
Here's a foo & apple test: http://www.youtube.com/watch?v=nTDNLUzjkpg&hd=1
After saving the post, you'll end up with this:
Here's a foo & apple test: http://www.youtube.com/watch?v=nTDNLUzjkpg&hd=1
Why don't we do this on display instead of save? It currently results in stuff like the &
'ed URL being sent to oEmbed providers:
http://www.youtube.com/oembed?maxwidth=640&maxheight=600&url=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DnTDNLUzjkpg%26amp%3Bhd%3D1&format=json
Thankfully YouTube seems to be smart, but we shouldn't rely on that.
#8
@
14 years ago
"C.12. Using Ampersands in Attribute Values (and Elsewhere)
In both SGML and XML, the ampersand character ("&") declares the beginning of an entity reference (e.g., ® for the registered trademark symbol "®"). Unfortunately, many HTML user agents have silently ignored incorrect usage of the ampersand character in HTML documents - treating ampersands that do not look like entity references as literal ampersands. XML-based user agents will not tolerate this incorrect usage, and any document that uses an ampersand incorrectly will not be "valid", and consequently will not conform to this specification. In order to ensure that documents are compatible with historical HTML user agents and XML-based user agents, ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (e.g. "&"). For example, when the href attribute of the a element refers to a CGI script that takes parameters, it must be expressed as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user rather than as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user."
from http://www.w3.org/TR/xhtml1/#C_12
Looks like this isn't a real bug.
-- Pete | Rock Madrid | http://rock-madrid.com/
#9
@
14 years ago
Thanks, but my point is that the ampersand escaping should be done on display like it is done for administrators instead of by kses
on post sanitization and save.
When you're editing a post, you want to see "&
" in your content and not "&
". ;)
#10
@
12 years ago
- Keywords gsoc removed
There are actually quite a few places where WordPress is storing data with entities encoded. For example, a term named "This & That" will have entities in its name encoded, resulting in a term name of "This & That". Not good for when you're trying to do things with data and you don't want it encoded (for example, putting the term name in a 'title' attribute).
The following fields are stored with encoded entities regardless of your user role as they all go through wp_kses()
:
- Term name and description
- User first name, last name, display name, nickname and description
- Comment author name
- Link name, description, image, rel and notes
Link target, comment author email, comment author URL, user email, user URL and link URL are also stored with encoded entities, although these fields typically don't contain entities.
The best solution would be to switch to storing this data in unencoded form and run an upgrade routine to decode existing data when the change happens, but I realise that this is potentially an expensive upgrade. I'm not sure how to address that problem.
#14
follow-up:
↓ 15
@
11 years ago
- Cc boonebgorges@… added
I just ran into this issue with tax terms. I had created a term 'Foo & Bar' using wp_insert_term()
. I then later wanted to fetch it using get_term_by( 'name' )
. But this failed, because the ampersand was not being translated in get_term_by()
. I was able to work around the issue by doing (roughly)
$sanitized_name = sanitize_term_field( 'name', $value, 0, $tax, 'db ); $term = get_term_by( 'name', $sanitized_name );
I'm surprised this isn't coming up elsewhere in WP - I guess similar sanitization must already be in place - but it does seem as if it should be happening in get_term_by()
, as the SQL query is built (assuming the underlying encoding issue in the OP can't be fixed for backward compatibility reasons).
#16
@
11 years ago
- Component changed from Administration to Posts, Post Types
- Focuses administration added
#17
@
11 years ago
Seems like the encode-to-store and decode-to-display is the only solution. Maybe in the querys core..!
But how to overcome people hacks to the escapes?
//get_term_by - and others if (!has_escape($term)) $term = html_escape($term) return query($term)
#18
@
10 years ago
Just flagging that this is happens not just for posts and post types but for user meta data too.
A custom add or update_user_meta does not result in an encoded ampersand.
However due to default filters added (in default-filters.php line 16) , user data like last name DOES have any ampersands escaped. Admittedly not a common occurrence (an ampersand in last name I mean - except for a Ke$sha type person perhaps?). The behaviour was flushed out when doing some debug on generic user meta queries and searches, when I'd dumped some test data into last name.
#19
@
9 years ago
The best solution would be to switch to storing this data in unencoded form and run an upgrade routine to decode existing data when the change happens, but I realise that this is potentially an expensive upgrade. I'm not sure how to address that problem.
I agree that the best system would be to store the data as unencoded, and then encode as necessary for display. However:
- The upgrade routine is going to be expensive (as you note)
- The upgrade routine is going to be plagued with false positives
- It's going to break queries for anyone who is currently working around this bug by encoding their query terms
- In what contexts would we need to encode for display? Currently, all content is being encoded on the way out (by virtue of its being encoded in the database). Presumably, you'd want to return unencoded content from low-level functions like
get_post_meta()
andget_term_by()
. But this is going to cause compatibility problems with people who are expecting encoded data. And maintaining a whitelist of places where output should be encoded is going to be most unfun.
I'm all for fixing this stuff over the long run, but in the short to medium term, we should face facts: we store encoded data, so our query functions ought to match encoded data too. In the appropriate low-level functions (get_metadata()
, get_term_by()
, and so forth), we should encode using the same filters used to encode on the way in. For example, for terms, we'd use sanitize_term_field()
(see #24354). I don't think that this will cause backward compatibility problems: for all the affected areas, there's no way to get an unencoded &
into the database, so existing queries for Foo & Bar
are failing anyway.
What do johnbillion and others think?
This ticket was mentioned in Slack in #core by boone. View the logs.
9 years ago
@
8 years ago
I feel a bit hesitant sharing this, but here's a slight twist on Alex's original patch that seems to work for every case I can think of. It allows through ampersands that are surrounded by spaces. This seems to fit most use cases when people would want to use an ampersand in a title or body content. Any thoughts?
#23
@
8 years ago
Hmm, on further inspection it looks as though this ticket has been fixed anyway, though I'm not sure when!
Update
Nope, still broken. Ignore me.
This ticket was mentioned in Slack in #cli by johnbillion. View the logs.
6 years ago
#26
@
5 years ago
Ten years later, it is still a problem. Ampersands are converted also in post GUIDs, causing my plugin (https://wordpress.org/plugins/external-media-without-import/) failing to deal with URLs containing &
.
wp_kses_normalize_entities()
(called bywp_kses()
which is called bywp_filter_kses()
) causes this.