Opened 11 years ago
Closed 10 years ago
#28506 closed defect (bug) (fixed)
Control (non-printing) characters not being stripped out.
Reported by: | mauteri | Owned by: | SergeyBiryukov |
---|---|---|---|
Milestone: | 4.0 | Priority: | normal |
Severity: | normal | Version: | 3.8 |
Component: | Formatting | Keywords: | needs-testing kses has-patch |
Focuses: | Cc: |
Description
I found non-printing characters were not being stripped out after copy/paste into the WordPress editor. I fixed this in our theme, and thought I'd try and make it my first contribution to the general WordPress project. See included diff as well as a sample file with bad character for testing. These characters break RSS feeds, which makes them so problematic.
Attachments (6)
Change History (23)
@
11 years ago
Example of a post with a problematic control character. Publish post and look at feed, you'll see the issue right away.
This ticket was mentioned in IRC in #wordpress-dev by johnbillion. View the logs.
11 years ago
#2
@
11 years ago
- Component changed from Posts, Post Types to Formatting
- Keywords has-patch added
- Milestone changed from Awaiting Review to 4.0
Thanks for the report - an azaozz or miqrogroove type person should be able to weigh in here
#3
@
11 years ago
- Keywords needs-testing needs-unit-tests added
- Version changed from trunk to 3.8
We should look at whether this is more of a kses issue, since we normally allow admins to break things in this way. If this hasn't been tested in a non-admin account, it might not even be a bug.
#4
@
11 years ago
Good call.
Just tested as an editor and broke the feed like it did with an admin account.
#5
@
11 years ago
Editors have unfiltered_html
capability too. Would be helpful to test with an author or contributor account.
#7
@
11 years ago
- Keywords kses added; has-patch removed
This probably affects other areas like comment feeds then too. I've brought up similar issues in the past and they were generally not treated as security bugs.
Let's make this a kses issue as it's closer to validation and filtering than the cosmetic features of texturize. Probably goes back to v1 which is why I removed the trunk designation.
#8
@
10 years ago
Thanks for the feedback.
I've attached a new patch. I felt wp_kses_no_null() was an appropriate function for this. Let me know if you agree or have a better place in mind. Thanks!
#9
@
10 years ago
In wp_kses_no_null(), I would put the new filter in place of the first one, because they will do the same thing.
Also, make sure tab characters (\t, \x09, or \11) are not affected.
@
10 years ago
Test file runs through all 32 ascii control characters. Everything marked keep does not break RSS feed, so it is not included in regex.
#10
@
10 years ago
Attached new patch (28506_1.diff) which is more exclusive than the previous diff I submitted for review. I also included my test file (control_characters.txt).
The test file includes all 32 ascii control characters that I tested one by one to see if it breaks the RSS feed. The ones that didn't break the feed were not included in my regex.
#11
@
10 years ago
Quick chart of first 32 ascii characters:
0 00/00 00 00 NUL (Ctrl-@) NULL 1 00/01 01 01 SOH (Ctrl-A) START OF HEADING 2 00/02 02 02 STX (Ctrl-B) START OF TEXT 3 00/03 03 03 ETX (Ctrl-C) END OF TEXT 4 00/04 04 04 EOT (Ctrl-D) END OF TRANSMISSION 5 00/05 05 05 ENQ (Ctrl-E) ENQUIRY 6 00/06 06 06 ACK (Ctrl-F) ACKNOWLEDGE 7 00/07 07 07 BEL (Ctrl-G) BELL (Beep) 8 00/08 10 08 BS (Ctrl-H) BACKSPACE 9 00/09 11 09 HT (Ctrl-I) HORIZONTAL TAB 10 00/10 12 0A LF (Ctrl-J) LINE FEED 11 00/11 13 0B VT (Ctrl-K) VERTICAL TAB 12 00/12 14 0C FF (Ctrl-L) FORM FEED 13 00/13 15 0D CR (Ctrl-M) CARRIAGE RETURN 14 00/14 16 0E SO (Ctrl-N) SHIFT OUT 15 00/15 17 0F SI (Ctrl-O) SHIFT IN 16 01/00 20 10 DLE (Ctrl-P) DATA LINK ESCAPE 17 01/01 21 11 DC1 (Ctrl-Q) DEVICE CONTROL 1 (XON) 18 01/02 22 12 DC2 (Ctrl-R) DEVICE CONTROL 2 19 01/03 23 13 DC3 (Ctrl-S) DEVICE CONTROL 3 (XOFF) 20 01/04 24 14 DC4 (Ctrl-T) DEVICE CONTROL 4 21 01/05 25 15 NAK (Ctrl-U) NEGATIVE ACKNOWLEDGE 22 01/06 26 16 SYN (Ctrl-V) SYNCHRONOUS IDLE 23 01/07 27 17 ETB (Ctrl-W) END OF TRANSMISSION BLOCK 24 01/08 30 18 CAN (Ctrl-X) CANCEL 25 01/09 31 19 EM (Ctrl-Y) END OF MEDIUM 26 01/10 32 1A SUB (Ctrl-Z) SUBSTITUTE 27 01/11 33 1B ESC (Ctrl-[) ESCAPE 28 01/12 34 1C FS (Ctrl-\) FILE SEPARATOR 29 01/13 35 1D GS (Ctrl-]) GROUP SEPARATOR 30 01/14 36 1E RS (Ctrl-^) RECORD SEPARATOR 31 01/15 37 1F US (Ctrl-_) UNIT SEPARATOR
http://ftp.vim.org/networking/kermit/public_html/ascii.html
This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.
10 years ago
#14
@
10 years ago
Also tested the regex with and without a quantifier after the character class. Adding a + allows consecutive chars to be replaced more quickly, but at the expense of more time spent searching in subject strings that don't have them. Since posts should not have these control chars at all, we do not need to optimize for replacement performance.
Diff