Ticket #9934 (closed defect (bug): fixed)

Opened 3 years ago

Last modified 3 years ago

Apostrophe in comment author causes comment to be spammed - esc_html

Reported by: tellyworth Owned by: markjaquith
Priority: high Milestone: 2.8
Component: Comments Version:
Severity: blocker Keywords:
Cc:

Description

Since [11380] - which added esc_html filtering to many items - comments containing an apostrophe (and possibly other characters) in the author name field are flagged as spam by Wordpress.

The root cause is that esc_html() uses decimal entity encoding, so O'Connor becomes O'Connor. But wp_blacklist_check() regards any comment containing a decimal entity as spam (and worse, does so silently and without any way for the blog administrator to stop it).

Possible solutions:

  1. esc_html() should use hex entity encoding, not decimal
  1. comment_author_name shouldn't use esc_html()
  1. wp_blacklist_check() shouldn't spam comments containing decimal entities

All three are trivial fixes so I haven't included a patch. I'd favour (1) if only because it eliminates the regression and reverts to the old behaviour.

Attachments

9934.diff Download (655 bytes) - added by ryan 3 years ago.

Change History

comment:1   ryan3 years ago

  • Owner set to markjaquith
  • Status changed from new to assigned

Actually there's a fourth option, and I think this ought to be the long-term fix:

Spam filtering really needs to happen on raw POST data, before plugins and sanitizers have the opportunity to screw with it. esc_html()'s behaviour would be fine if it occurred only at display time. But the data passed to spam filters (and, importantly, the data stored in the wp_comments table - which is subsequently used when reporting false positives and missed spam to Akismet and other spam filtering services) need to be as close as possible to the original.

comment:3   ryan3 years ago

  • Milestone changed from Unassigned to 2.8

comment:4 in reply to: ↑ description   filosofo3 years ago

Replying to tellyworth:

  1. wp_blacklist_check() shouldn't spam comments containing decimal entities

This needs to be fixed. Aside from being an un-filterable hack, it's probably needlessly blacklisting trackbacks with such entities.

I'd encourage Tellyworth's 4th option, storing escaped/converted data in the database is almost always a bad idea. It's causes problems in other areas where WordPress does that, and it's going to be a real pain to undo.

comment:7   ryan3 years ago

  • Severity changed from major to blocker

comment:8   ryan3 years ago

(In [11460]) Don't use esc_html() for DB bound data. see #9934

comment:9   ryan3 years ago

That good enough for now?

Let's leave the ticket open until we can do something about wp_blacklist_check()

Remove entity encoding check? Having it in a blacklist check is unexpected.

  • Status changed from assigned to closed
  • Resolution set to fixed

See #9965 for the blacklist issue. The regression should be fixed so I'll close this ticket.

  • Status changed from closed to reopened
  • Resolution fixed deleted

I just re-tested with r11480 and the problem is still present. Investigating.

Confirmed, the same problem is still present even after [11460].

wp_specialchars is used on comment_author prior to comment spam filtering. wp_specialchars() calls _wp_specialchars(), which encodes an apostrophe to its decimal numeric entity (formatting.php around line 273).

Removing the blacklist entity check as per #9965 will fix it but that's just covering up the symptom. The real issue is that WP is futzing with comment data before passing it to spam filters, which hampers their ability to produce accurate results.

ryan3 years ago

That moves wp_allow_comment() before wp_filter_comment(). I don't know if that will bust any plugins though.

wp_specialchars(), when passed only one argument, calls esc_html(). esc_html() defaults to ENT_QUOTES. wp_specialchars() used to default to ENT_NOQUOTES.

Do we need esc_html_db() for these instances. (Yes, I know we should escape as little as possible when sending to the db, but I'm going for the minimal fix for 2.8.)

esc_html_raw() (consistent with esc_url_raw()) might be a better name?

(In [11488]) Use _wp_specialchars to get NOQUOTES. see #9934

_wp_specialchars() defaults to ENT_NOQUOTES, which should restore previous behavior. The broader issue of entity encoding on the way to the DB can be dealt with in 2.9.

fixed, right?

  • Status changed from reopened to closed
  • Resolution set to fixed
Note: See TracTickets for help on using tickets.