Make WordPress Core

Opened 11 years ago

Closed 10 years ago

#28506 closed defect (bug) (fixed)

Control (non-printing) characters not being stripped out.

Reported by: mauteri's profile mauteri Owned by: sergeybiryukov's profile SergeyBiryukov
Milestone: 4.0 Priority: normal
Severity: normal Version: 3.8
Component: Formatting Keywords: needs-testing kses has-patch
Focuses: Cc:

Description

I found non-printing characters were not being stripped out after copy/paste into the WordPress editor. I fixed this in our theme, and thought I'd try and make it my first contribution to the general WordPress project. See included diff as well as a sample file with bad character for testing. These characters break RSS feeds, which makes them so problematic.

Attachments (6)

control_char_filter.diff (603 bytes) - added by mauteri 11 years ago.
Diff
bad-post-1.txt (1.1 KB) - added by mauteri 11 years ago.
Example of a post with a problematic control character. Publish post and look at feed, you'll see the issue right away.
28506.diff (436 bytes) - added by mauteri 10 years ago.
New patch based on feedback. Moved to kses.
28506_1.diff (453 bytes) - added by mauteri 10 years ago.
New patch based on recent feedback
control-characters.txt (155 bytes) - added by mauteri 10 years ago.
Test file runs through all 32 ascii control characters. Everything marked keep does not break RSS feed, so it is not included in regex.
28506.2.diff (2.2 KB) - added by miqrogroove 10 years ago.
Add unit tests.

Download all attachments as: .zip

Change History (23)

@mauteri
11 years ago

Example of a post with a problematic control character. Publish post and look at feed, you'll see the issue right away.

This ticket was mentioned in IRC in #wordpress-dev by johnbillion. View the logs.


11 years ago

#2 @wonderboymusic
11 years ago

  • Component changed from Posts, Post Types to Formatting
  • Keywords has-patch added
  • Milestone changed from Awaiting Review to 4.0

Thanks for the report - an azaozz or miqrogroove type person should be able to weigh in here

#3 @miqrogroove
11 years ago

  • Keywords needs-testing needs-unit-tests added
  • Version changed from trunk to 3.8

We should look at whether this is more of a kses issue, since we normally allow admins to break things in this way. If this hasn't been tested in a non-admin account, it might not even be a bug.

#4 @mauteri
11 years ago

Good call.

Just tested as an editor and broke the feed like it did with an admin account.

#5 @SergeyBiryukov
11 years ago

Editors have unfiltered_html capability too. Would be helpful to test with an author or contributor account.

#6 @mauteri
11 years ago

Same result as an author role, broken RSS feed.

#7 @miqrogroove
11 years ago

  • Keywords kses added; has-patch removed

This probably affects other areas like comment feeds then too. I've brought up similar issues in the past and they were generally not treated as security bugs.

Let's make this a kses issue as it's closer to validation and filtering than the cosmetic features of texturize. Probably goes back to v1 which is why I removed the trunk designation.

@mauteri
10 years ago

New patch based on feedback. Moved to kses.

#8 @mauteri
10 years ago

Thanks for the feedback.

I've attached a new patch. I felt wp_kses_no_null() was an appropriate function for this. Let me know if you agree or have a better place in mind. Thanks!

#9 @miqrogroove
10 years ago

In wp_kses_no_null(), I would put the new filter in place of the first one, because they will do the same thing.

Also, make sure tab characters (\t, \x09, or \11) are not affected.

@mauteri
10 years ago

New patch based on recent feedback

@mauteri
10 years ago

Test file runs through all 32 ascii control characters. Everything marked keep does not break RSS feed, so it is not included in regex.

#10 @mauteri
10 years ago

Attached new patch (28506_1.diff) which is more exclusive than the previous diff I submitted for review. I also included my test file (control_characters.txt).

The test file includes all 32 ascii control characters that I tested one by one to see if it breaks the RSS feed. The ones that didn't break the feed were not included in my regex.

#11 @mauteri
10 years ago

Quick chart of first 32 ascii characters:

       0  00/00   00  00  NUL  (Ctrl-@)  NULL
       1  00/01   01  01  SOH  (Ctrl-A)  START OF HEADING
       2  00/02   02  02  STX  (Ctrl-B)  START OF TEXT
       3  00/03   03  03  ETX  (Ctrl-C)  END OF TEXT
       4  00/04   04  04  EOT  (Ctrl-D)  END OF TRANSMISSION
       5  00/05   05  05  ENQ  (Ctrl-E)  ENQUIRY
       6  00/06   06  06  ACK  (Ctrl-F)  ACKNOWLEDGE
       7  00/07   07  07  BEL  (Ctrl-G)  BELL (Beep)
       8  00/08   10  08  BS   (Ctrl-H)  BACKSPACE
       9  00/09   11  09  HT   (Ctrl-I)  HORIZONTAL TAB
      10  00/10   12  0A  LF   (Ctrl-J)  LINE FEED
      11  00/11   13  0B  VT   (Ctrl-K)  VERTICAL TAB
      12  00/12   14  0C  FF   (Ctrl-L)  FORM FEED
      13  00/13   15  0D  CR   (Ctrl-M)  CARRIAGE RETURN
      14  00/14   16  0E  SO   (Ctrl-N)  SHIFT OUT
      15  00/15   17  0F  SI   (Ctrl-O)  SHIFT IN
      16  01/00   20  10  DLE  (Ctrl-P)  DATA LINK ESCAPE
      17  01/01   21  11  DC1  (Ctrl-Q)  DEVICE CONTROL 1 (XON)
      18  01/02   22  12  DC2  (Ctrl-R)  DEVICE CONTROL 2
      19  01/03   23  13  DC3  (Ctrl-S)  DEVICE CONTROL 3 (XOFF)
      20  01/04   24  14  DC4  (Ctrl-T)  DEVICE CONTROL 4
      21  01/05   25  15  NAK  (Ctrl-U)  NEGATIVE ACKNOWLEDGE
      22  01/06   26  16  SYN  (Ctrl-V)  SYNCHRONOUS IDLE
      23  01/07   27  17  ETB  (Ctrl-W)  END OF TRANSMISSION BLOCK
      24  01/08   30  18  CAN  (Ctrl-X)  CANCEL
      25  01/09   31  19  EM   (Ctrl-Y)  END OF MEDIUM
      26  01/10   32  1A  SUB  (Ctrl-Z)  SUBSTITUTE
      27  01/11   33  1B  ESC  (Ctrl-[)  ESCAPE
      28  01/12   34  1C  FS   (Ctrl-\)  FILE SEPARATOR
      29  01/13   35  1D  GS   (Ctrl-])  GROUP SEPARATOR
      30  01/14   36  1E  RS   (Ctrl-^)  RECORD SEPARATOR
      31  01/15   37  1F  US   (Ctrl-_)  UNIT SEPARATOR

http://ftp.vim.org/networking/kermit/public_html/ascii.html

Last edited 10 years ago by SergeyBiryukov (previous) (diff)

@miqrogroove
10 years ago

Add unit tests.

#12 @miqrogroove
10 years ago

  • Keywords has-patch added; needs-unit-tests removed

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.


10 years ago

#14 @miqrogroove
10 years ago

Also tested the regex with and without a quantifier after the character class. Adding a + allows consecutive chars to be replaced more quickly, but at the expense of more time spent searching in subject strings that don't have them. Since posts should not have these control chars at all, we do not need to optimize for replacement performance.

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.


10 years ago

#16 @SergeyBiryukov
10 years ago

Also removes any instance of the '\0' string.

Related: #28699

#17 @SergeyBiryukov
10 years ago

  • Owner set to SergeyBiryukov
  • Resolution set to fixed
  • Status changed from new to closed

In 28942:

Make wp_kses_no_null() remove any invalid control characters in a string.

props mauteri, miqrogroove.
fixes #28506.

Note: See TracTickets for help on using tickets.