WordPress.org

Make WordPress Core

Opened 5 years ago

Last modified 4 months ago

#29187 new defect (bug)

.notdef glyph (when copying text from a PDF in the excerpt) breaks the /feed

Reported by: softmodeling Owned by:
Milestone: Priority: normal
Severity: normal Version: 1.0
Component: Feeds Keywords: needs-patch
Focuses: Cc:
PR Number:

Description

I created a post where the excerpt was copy&pasted from a pdf document.

When pasting the text, the "fi" glyph disappears (e.g. "specification" is copied over as "specication", this is a common problem, see for instance: http://superuser.com/questions/375449/why-does-the-text-fi-get-cut-when-i-copy-from-a-pdf-or-print-a-document).

To be more precise, the "fi" glyph is replaced with the .notdef glyph. The .notdef glyph is not visible in the Edit Post screen nor when viewing the post but it is stored in the database (rendered as a white square, the most common representation for this glyph).

The problem is that, while the glyph is properly filtered when viewing the post, it is not when creating the RSS feed so it breaks it.

For instance, when trying to access it with Google Chrome I get: This page contains the following errors:

error on line 29 at column 25: Input is not proper UTF-8, indicate encoding !
Bytes: 0x0C 0x66 0x69 0x63

I've been able to reproduce the problem on several sites.

Attachments (2)

29187-notdef-glyph.pdf (23.9 KB) - added by tellyworth 5 years ago.
PDF to reproduce(?)
marsha_chechik.pdf (110.4 KB) - added by SergeyBiryukov 5 years ago.

Download all attachments as: .zip

Change History (8)

@tellyworth
5 years ago

PDF to reproduce(?)

#1 @tellyworth
5 years ago

  • Keywords reporter-feedback added

I attached a PDF file that I think contains the fi glyph - can you confirm?

I wasn't able to reproduce this at all, but I think that's because OS X Preview is smart enough to convert the glyph back to regular text when copying to the clipboard.

#2 @SergeyBiryukov
5 years ago

I was able to copy the fi glyph from the attached PDF file using Foxit Reader on Windows (I made sure it's a single glyph and not two separate characters).

However, I could not reproduce the issues described here. It's displayed correctly for me in both the post content and the RSS feed, just like any other UTF-8 character.

#3 @softmodeling
5 years ago

Your attached pdf worked fine for me as well.

Can I ask you to try with this pdf (which is the one that caused the issue in my case): http://ecmfa2014.lcc.uma.es/resources/marsha_chechik.pdf

More specifically try copying the third paragraphp that starts with "Our specification". If when pasting the paragraph on your post, you get "specification" and not "specication" then, yes, probably your OS/PDF Reader combination is fixing the issue (I´m on Windows 7 with Adobe Reader).

If you do get "specication", then you should still be able to display the post but get an error when accesing yourblog.com/RSS

NOTES:

  • The pdf I linked was created from a LaTeX file (as most of the PDFs I use). Wondering if the problem only occurs with LaTeX files.
  • I tested with different browsers and both Chrome and IE report the error but, instead, Firefox doesn´t complain
Last edited 5 years ago by softmodeling (previous) (diff)

#4 @softmodeling
5 years ago

  • Keywords reporter-feedback removed

#5 @SergeyBiryukov
5 years ago

  • Version changed from trunk to 1.0

Reproduced the issue with the latest file (attached to the ticket for reference).

See also #19998.

#6 @chriscct7
4 years ago

  • Keywords needs-patch added
Note: See TracTickets for help on using tickets.