WordPress.org

Make WordPress Core

Opened 5 years ago

Closed 5 years ago

#28506 closed defect (bug) (fixed)

Control (non-printing) characters not being stripped out.

Reported by: mauteri Owned by: SergeyBiryukov
Milestone: 4.0 Priority: normal
Severity: normal Version: 3.8
Component: Formatting Keywords: needs-testing kses has-patch
Focuses: Cc:
PR Number:

Description

I found non-printing characters were not being stripped out after copy/paste into the WordPress editor. I fixed this in our theme, and thought I'd try and make it my first contribution to the general WordPress project. See included diff as well as a sample file with bad character for testing. These characters break RSS feeds, which makes them so problematic.

Attachments (6)

control_char_filter.diff (603 bytes) - added by mauteri 5 years ago.
Diff
bad-post-1.txt (1.1 KB) - added by mauteri 5 years ago.
Example of a post with a problematic control character. Publish post and look at feed, you'll see the issue right away.
28506.diff (436 bytes) - added by mauteri 5 years ago.
New patch based on feedback. Moved to kses.
28506_1.diff (453 bytes) - added by mauteri 5 years ago.
New patch based on recent feedback
control-characters.txt (155 bytes) - added by mauteri 5 years ago.
Test file runs through all 32 ascii control characters. Everything marked keep does not break RSS feed, so it is not included in regex.
28506.2.diff (2.2 KB) - added by miqrogroove 5 years ago.
Add unit tests.

Download all attachments as: .zip

Change History (23)

@mauteri
5 years ago

Diff

@mauteri
5 years ago

Example of a post with a problematic control character. Publish post and look at feed, you'll see the issue right away.

This ticket was mentioned in IRC in #wordpress-dev by johnbillion. View the logs.


5 years ago

#2 @wonderboymusic
5 years ago

  • Component changed from Posts, Post Types to Formatting
  • Keywords has-patch added
  • Milestone changed from Awaiting Review to 4.0

Thanks for the report - an azaozz or miqrogroove type person should be able to weigh in here

#3 @miqrogroove
5 years ago

  • Keywords needs-testing needs-unit-tests added
  • Version changed from trunk to 3.8

We should look at whether this is more of a kses issue, since we normally allow admins to break things in this way. If this hasn't been tested in a non-admin account, it might not even be a bug.

#4 @mauteri
5 years ago

Good call.

Just tested as an editor and broke the feed like it did with an admin account.

#5 @SergeyBiryukov
5 years ago

Editors have unfiltered_html capability too. Would be helpful to test with an author or contributor account.

#6 @mauteri
5 years ago

Same result as an author role, broken RSS feed.

#7 @miqrogroove
5 years ago

  • Keywords kses added; has-patch removed

This probably affects other areas like comment feeds then too. I've brought up similar issues in the past and they were generally not treated as security bugs.

Let's make this a kses issue as it's closer to validation and filtering than the cosmetic features of texturize. Probably goes back to v1 which is why I removed the trunk designation.

@mauteri
5 years ago

New patch based on feedback. Moved to kses.

#8 @mauteri
5 years ago

Thanks for the feedback.

I've attached a new patch. I felt wp_kses_no_null() was an appropriate function for this. Let me know if you agree or have a better place in mind. Thanks!

#9 @miqrogroove
5 years ago

In wp_kses_no_null(), I would put the new filter in place of the first one, because they will do the same thing.

Also, make sure tab characters (\t, \x09, or \11) are not affected.

@mauteri
5 years ago

New patch based on recent feedback

@mauteri
5 years ago

Test file runs through all 32 ascii control characters. Everything marked keep does not break RSS feed, so it is not included in regex.

#10 @mauteri
5 years ago

Attached new patch (28506_1.diff) which is more exclusive than the previous diff I submitted for review. I also included my test file (control_characters.txt).

The test file includes all 32 ascii control characters that I tested one by one to see if it breaks the RSS feed. The ones that didn't break the feed were not included in my regex.

#11 @mauteri
5 years ago

Quick chart of first 32 ascii characters:

0 00/00 00 00 NUL (Ctrl-@) NULL
1 00/01 01 01 SOH (Ctrl-A) START OF HEADING
2 00/02 02 02 STX (Ctrl-B) START OF TEXT
3 00/03 03 03 ETX (Ctrl-C) END OF TEXT
4 00/04 04 04 EOT (Ctrl-D) END OF TRANSMISSION
5 00/05 05 05 ENQ (Ctrl-E) ENQUIRY
6 00/06 06 06 ACK (Ctrl-F) ACKNOWLEDGE
7 00/07 07 07 BEL (Ctrl-G) BELL (Beep)
8 00/08 10 08 BS (Ctrl-H) BACKSPACE
9 00/09 11 09 HT (Ctrl-I) HORIZONTAL TAB

10 00/10 12 0A LF (Ctrl-J) LINE FEED
11 00/11 13 0B VT (Ctrl-K) VERTICAL TAB
12 00/12 14 0C FF (Ctrl-L) FORM FEED
13 00/13 15 0D CR (Ctrl-M) CARRIAGE RETURN
14 00/14 16 0E SO (Ctrl-N) SHIFT OUT
15 00/15 17 0F SI (Ctrl-O) SHIFT IN
16 01/00 20 10 DLE (Ctrl-P) DATA LINK ESCAPE
17 01/01 21 11 DC1 (Ctrl-Q) DEVICE CONTROL 1 (XON)
18 01/02 22 12 DC2 (Ctrl-R) DEVICE CONTROL 2
19 01/03 23 13 DC3 (Ctrl-S) DEVICE CONTROL 3 (XOFF)
20 01/04 24 14 DC4 (Ctrl-T) DEVICE CONTROL 4
21 01/05 25 15 NAK (Ctrl-U) NEGATIVE ACKNOWLEDGE
22 01/06 26 16 SYN (Ctrl-V) SYNCHRONOUS IDLE
23 01/07 27 17 ETB (Ctrl-W) END OF TRANSMISSION BLOCK
24 01/08 30 18 CAN (Ctrl-X) CANCEL
25 01/09 31 19 EM (Ctrl-Y) END OF MEDIUM
26 01/10 32 1A SUB (Ctrl-Z) SUBSTITUTE
27 01/11 33 1B ESC (Ctrl-[) ESCAPE
28 01/12 34 1C FS (Ctrl-\) FILE SEPARATOR
29 01/13 35 1D GS (Ctrl-]) GROUP SEPARATOR
30 01/14 36 1E RS (Ctrl-) RECORD SEPARATOR
31 01/15 37 1F US (Ctrl-_) UNIT SEPARATOR

http://ftp.vim.org/networking/kermit/public_html/ascii.html

Version 1, edited 5 years ago by mauteri (previous) (next) (diff)

@miqrogroove
5 years ago

Add unit tests.

#12 @miqrogroove
5 years ago

  • Keywords has-patch added; needs-unit-tests removed

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.


5 years ago

#14 @miqrogroove
5 years ago

Also tested the regex with and without a quantifier after the character class. Adding a + allows consecutive chars to be replaced more quickly, but at the expense of more time spent searching in subject strings that don't have them. Since posts should not have these control chars at all, we do not need to optimize for replacement performance.

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.


5 years ago

#16 @SergeyBiryukov
5 years ago

Also removes any instance of the '\0' string.

Related: #28699

#17 @SergeyBiryukov
5 years ago

  • Owner set to SergeyBiryukov
  • Resolution set to fixed
  • Status changed from new to closed

In 28942:

Make wp_kses_no_null() remove any invalid control characters in a string.

props mauteri, miqrogroove.
fixes #28506.

Note: See TracTickets for help on using tickets.