Opened 11 years ago
Last modified 6 years ago
#29187 new defect (bug)
.notdef glyph (when copying text from a PDF in the excerpt) breaks the /feed
Reported by: |
|
Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | 1.0 |
Component: | Feeds | Keywords: | needs-patch |
Focuses: | Cc: |
Description
I created a post where the excerpt was copy&pasted from a pdf document.
When pasting the text, the "fi" glyph disappears (e.g. "specification" is copied over as "specication", this is a common problem, see for instance: http://superuser.com/questions/375449/why-does-the-text-fi-get-cut-when-i-copy-from-a-pdf-or-print-a-document).
To be more precise, the "fi" glyph is replaced with the .notdef glyph. The .notdef glyph is not visible in the Edit Post screen nor when viewing the post but it is stored in the database (rendered as a white square, the most common representation for this glyph).
The problem is that, while the glyph is properly filtered when viewing the post, it is not when creating the RSS feed so it breaks it.
For instance, when trying to access it with Google Chrome I get: This page contains the following errors:
error on line 29 at column 25: Input is not proper UTF-8, indicate encoding !
Bytes: 0x0C 0x66 0x69 0x63
I've been able to reproduce the problem on several sites.
Attachments (2)
Change History (8)
#1
@
10 years ago
- Keywords reporter-feedback added
I attached a PDF file that I think contains the fi glyph - can you confirm?
I wasn't able to reproduce this at all, but I think that's because OS X Preview is smart enough to convert the glyph back to regular text when copying to the clipboard.
#2
@
10 years ago
I was able to copy the fi glyph from the attached PDF file using Foxit Reader on Windows (I made sure it's a single glyph and not two separate characters).
However, I could not reproduce the issues described here. It's displayed correctly for me in both the post content and the RSS feed, just like any other UTF-8 character.
#3
@
10 years ago
Your attached pdf worked fine for me as well.
Can I ask you to try with this pdf (which is the one that caused the issue in my case): http://ecmfa2014.lcc.uma.es/resources/marsha_chechik.pdf
More specifically try copying the third paragraphp that starts with "Our specification". If when pasting the paragraph on your post, you get "specification" and not "specication" then, yes, probably your OS/PDF Reader combination is fixing the issue (I´m on Windows 7 with Adobe Reader).
If you do get "specication", then you should still be able to display the post but get an error when accesing yourblog.com/RSS
NOTES:
- The pdf I linked was created from a LaTeX file (as most of the PDFs I use). Wondering if the problem only occurs with LaTeX files.
- I tested with different browsers and both Chrome and IE report the error but, instead, Firefox doesn´t complain
PDF to reproduce(?)