WordPress.org

Make WordPress Core

Opened 8 months ago

Closed 7 months ago

Last modified 7 months ago

#38522 closed defect (bug) (fixed)

HTTP Errors on Upload with Certain PDFs

Reported by: mikeschroder Owned by: mikeschroder
Milestone: 4.7 Priority: normal
Severity: normal Version: 4.7
Component: Media Keywords: needs-testing
Focuses: Cc:

Description

As of #31050 and [38949], there is support in WordPress for generating JPEG renders of the first page of a PDF, and using that within the admin.

However, this means that time is spent creating this render, which increases the chance that a user will see an HTTP Error. WordPress automatically falls back to old behavior (the PDF icon) when a PDF thumbnail can't be created, which means that the HTTP error for those users means "Your upload happened successfully, but the PDF thumbnails didn't get created."

Let's use this ticket to discuss/add workarounds for this issue, and better represent to users what happens in this scenario.

Please attach any problem PDFs you encounter.

Other information that is useful includes:

  • Hosting you were using when you encountered the issue
  • Specific HTTP Error code and output from debug tools, if available
  • Details from @joemcgill's debug media plugin.

Attachments (2)

38522.diff (1.1 KB) - added by mikeschroder 8 months ago.
Only read first page.
PDF Uploads no duplicates.png (91.3 KB) - added by lukecavanagh 8 months ago.
PDF uploads in media library no duplicates created

Download all attachments as: .zip

Change History (38)

#1 @mikeschroder
8 months ago

  • Owner set to mikeschroder
  • Status changed from new to assigned

This ticket was mentioned in Slack in #core-media by mike. View the logs.


8 months ago

#3 @lukecavanagh
8 months ago

@mikeschroder

Tested on a site on Bluehost Cloud sites running the current nightly version of WP 4.7.

Uploaded a PDF into the media library and it created the JPG thumbnail without any issues.

Also to test renamed a PHP and a JS both as PDF files and tried to upload them into the media library, both returned the An error occurred in the upload. Please try again later, error message.

#4 follow-up: @dglingren
8 months ago

I have done some preliminary testing with the 4.7 Beta 1 release. PDF thumbnail generation generally works great (congratulations!). However, with very large PDF documents I found two issues:

  1. WP 4.7 thumbnail generation for a 32,953KB document takes 4 minutes 37 seconds (4:37) on my system. With WordPress 4.6.1, uploading takes 6.35 seconds, and using the thumbnail generator in Media Library Assistant (MLA) takes just 5.37 seconds (0:06) . I suspect the difference is because the generation technique in WP 4.7 processes the entire PDF document while MLA uses an explicit GhostScript invocation to process just the first page (I haven't looked at the code yet).
  1. I used the drag-and-drop uploader (in the Media/Add New screen) to process four large PDF documents (about 23MB each) with similar names. The WP 4.7 process sometimes failed with an HTTP Error, but it also generated multiple copies of some files and thumbnail images for each copy.

Let me know what else I can do to help investigate and resolve these issues. Thanks for all your hard work on this new feature!

#5 in reply to: ↑ 4 ; follow-up: @mikeschroder
8 months ago

Replying to dglingren:

I have done some preliminary testing with the 4.7 Beta 1 release. PDF thumbnail generation generally works great (congratulations!).

Thanks for testing!

However, with very large PDF documents I found two issues:

  1. WP 4.7 thumbnail generation for a 32,953KB document takes 4 minutes 37 seconds (4:37) on my system. With WordPress 4.6.1, uploading takes 6.35 seconds, and using the thumbnail generator in Media Library Assistant (MLA) takes just 5.37 seconds (0:06) . I suspect the difference is because the generation technique in WP 4.7 processes the entire PDF document while MLA uses an explicit GhostScript invocation to process just the first page (I haven't looked at the code yet).

It may be partially directly calling GhostScript, but I suspect the largest time difference is that WordPress is rendering at 128dpi, while MLA renders at 72dpi by default. 72dpi is enough for a small thumbnail, but wasn't high enough resolution to generate clear previews for Attachment Details.

You can test to see if this is the case or not by either changing the line from 128, 128 to 72, 72, or commenting out the if statement/setResolution in entirely, since 72dpi is Imagick's default.

  1. I used the drag-and-drop uploader (in the Media/Add New screen) to process four large PDF documents (about 23MB each) with similar names. The WP 4.7 process sometimes failed with an HTTP Error, but it also generated multiple copies of some files and thumbnail images for each copy.

Did the ones that fail have a large amount of pages? Any that you can share for testing would be appreciated.

I think in the end, we're going to need to provide a better error message (as mentioned in the title), since some PDFs are going to fail thumbnailing due to time or RAM no matter what platform they're on. The trick is figuring out how to best detect or prevent the case, while still showing a true error if the upload failed entirely.

#6 @joemcgill
8 months ago

Ideally, I think we want to find a way to split the task of generating the PDF thumbnail into a distinct task that happens apart from the initial upload, such that the attachment post is created when the attachment is uploaded, and then we kick off the thumbnail generation process and update the backbone view in the media library once the PDF images have been created.

#7 in reply to: ↑ 5 @dglingren
8 months ago

Thanks, @mikeschroder for the kind words and the additional questions/suggestions. I re-tested the MLA thumbnail generation, increasing the resolution to 128dpi and the image size to 300x388 pixels. The time increased from 5.37 seconds to 9.61 seconds; significant but not dramatic. I increased the image size to 1088x1408 at 128dpi and the time was actually a bit faster.

The document for the timing test has 392 pages, and the file is 32,953KB (32MB+). The page count seems more significant than the file size. I have lots of documents with large page counts if you want them.

The four-document set for the "multiple copies" issue has 126 pages. I can do more testing to give you more information if that's helpful.

I would be happy to share the documents for your testing. Can you give me specific instructions on the best way to do that?

For @joemcgill - Note that I am using the "old" Media/Add New (Upload New Media) admin submenu, not the Media/Library submenu or the Media Manager Modal Window. MLA adds features to that screen for assigning taxonomy terms, etc. to items as they are uploaded, so I always test with that screen.

This ticket was mentioned in Slack in #core by helen. View the logs.


8 months ago

#9 @dglingren
8 months ago

I have uploaded a new MLA Development Version (https://wordpress.org/plugins/media-library-assistant/developers/) that generates WP 4.7-style thumbnails for items already in the Media Library. The default settings match those for the code in trunk, but you can adjust most of the parameters through the UI to do experiments. As a precaution, MLA won't generate new thumbnails for items that already have a "native" thumbnail.

The code now in trunk uses Imagick to process the PDF document. This was my first approach as well, and I abandoned it after I realized that Imagick processed the whole document before extracting the first page as an image. Calling GhostScript directly to generate just the first page is the only way I found to get acceptable performance for documents with hundreds of pages.

This ticket was mentioned in Slack in #core-media by joemcgill. View the logs.


8 months ago

#11 @Stagger Lee
8 months ago

This is new track, I posted on old:


What is needed to make it work ?
I have installed latest nightly Beta. Upload PDF and nothing happens.

Imagmagick is installed in server:
https://s18.postimg.org/5o5a7kfzt/2016_11_06_183022.jpg

#12 follow-up: @dglingren
8 months ago

@Stagger Lee -

Make sure you have both Imagemagick AND Ghostscript installed and working. If you install and activate Media Library Assistant you can go to the Settings/Media Library Assistant Shortcodes tab and scroll to the "Thumbnail Substitution Support, mla_viewer" section. under the "Enable thumbnail substitution " checkbox you can see "IMPORTANT: both Ghostscript and Imagick/ImageMagick must be installed for this feature." if everything's working or a warning message if, for example, Ghostscript is not installed.

#13 in reply to: ↑ 12 @Stagger Lee
8 months ago

Replying to dglingren:

Hi @dglingren. Yes, you are right. I believed Ghostscript is part of Imagemagick. It was missing.

#14 @dglingren
8 months ago

I did more testing today with the 4.7 Beta 2 release. If you upload multiple files with high page counts you get several duplicate Media Library items for each file, and the files take many minutes to generate thumbnails even if they don't fail entirely. Please let me know if there's anything I can do to help sort this out.

This ticket was mentioned in Slack in #core-media by mike. View the logs.


8 months ago

@mikeschroder
8 months ago

Only read first page.

#16 @mikeschroder
8 months ago

  • Keywords has-patch needs-testing added; needs-patch removed

In my initial tests, reading the first page with [0] didn't function as expected, and seemed to yield the same times (as setting the iterator after the fact). Perhaps this was due to PDFs that were too small, or perhaps I did the tests incorrectly.

Either way, I gave it another look after the feedback on this ticket, and it does seem to improve the state of things significantly.

Please test it in 38522.diff.

This ticket was mentioned in Slack in #core by helen. View the logs.


8 months ago

#18 @lukecavanagh
8 months ago

@mikeschroder

Tested using a 150 page PDF file, but the PDF file was just 1mb in size, no issues on thumbnail PDF generation.

#19 follow-up: @dglingren
8 months ago

@mikeschroder

I applied the patch to my system, and rendering times are much improved! The large files with hundreds of pages render in 10 to 35 seconds, down from several minutes without the patch.

However, I still get HTTP errors about 4 times out of ten uploads, and creation of duplicates as well. I uploaded six documents (one at a time, no errors) and ended up with 13 items in the Media Library. If you can tell me how to send you the files I'd be happy to pass them on.

It seems like dragging and dropping two or more large files at one time increases the error rate.

Note that I am testing on a small, slow "server" and this may aggravate the problems.

#20 in reply to: ↑ 19 @mikeschroder
8 months ago

Replying to dglingren:

@mikeschroder

I applied the patch to my system, and rendering times are much improved! The large files with hundreds of pages render in 10 to 35 seconds, down from several minutes without the patch.

Great!

However, I still get HTTP errors about 4 times out of ten uploads, and creation of duplicates as well. I uploaded six documents (one at a time, no errors) and ended up with 13 items in the Media Library. If you can tell me how to send you the files I'd be happy to pass them on.

The duplicates issue confuses and concerns me. Is this a bare install, with no additional plugins, and a default theme?
A new attachment should get created even if the upload appears to have failed, but should only add up to the total number of attempts.

Yes, certainly any test PDFs you could share are valuable, thank you!

#21 @johnbillion
8 months ago

38522.diff makes a huge difference to PDF thumbnail generation on my site.

Thumbnail generation for a 7MB PDF with 300+ pages went down from 1.6 minutes to just 2 seconds.

I was using a PDF of the SWEBOK available here: http://www4.ncsu.edu/~tjmenzie/cs510/pdf/SWEBOKv3.pdf

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


8 months ago

#23 follow-up: @dglingren
8 months ago

@mikeschroder -

I am using the Twenty Fifteen theme. My original tests had a few plugins active, such as MLA, Query Monitor and Yoast SEO. I just went back and disabled all plugins. New testing still shows the HTTP errors and duplicate items. Here are links to my test files:

https://dl.dropboxusercontent.com/u/63421103/PKA-CTB%20optimized.pdf
https://dl.dropboxusercontent.com/u/63421103/PKA-CTB%20original.pdf
https://dl.dropboxusercontent.com/u/63421103/PKA-CTB%20reduced.pdf
https://dl.dropboxusercontent.com/u/63421103/PKA-CTB.pdf

My test is to drag all four files into the Media/Add New admin screen and drop them there. Duplicate items are only created when the HTTP error also occurs. Let me know when you've copied the files so I can clean up my Dropbox area; thanks.

PS One more test result. It looks like the document that fails with an HTTP error gets replaced by a duplicate of another document in the same upload batch.

Last edited 8 months ago by dglingren (previous) (diff)

#24 @mikeschroder
8 months ago

In 39187:

Media: Only load first PDF page for thumbnails.

To improve performance, directly load the first page of uploaded PDFs
to reduce the total clock time necessary to generate thumbnails.

Props dglingren, lukecavanagh, helen, johnbillion, mikeschroder.
See #38522.

#25 in reply to: ↑ 23 @mikeschroder
8 months ago

  • Keywords needs-patch added; has-patch removed

Replying to dglingren:

New testing still shows the HTTP errors and duplicate items. Here are links to my test files:

Thanks, download them and will test this out.

Do you end up with more duplicates than the number of HTTP Errors, or the same as the total of HTTP Errors + successful ones?

Just want to be sure I understand the behavior you're seeing -- It's expected behavior for there to still be an attachment generated if there is an HTTP error, since the attachment is generated before thumbs are generated (this allows uploads to succeed even if the thumb fails).

Adding back needs-patch, since I'd still love to see better messaging if we can do it within the time there is left.

#26 @lukecavanagh
8 months ago

@dglingren

Was able to upload all four test PDF files on a live dev site running WP 4.7 (4.7-beta2-39168)
Just had to wait for each PDF upload to finish in sequence. No duplicates created in the media library and all of the PDF files uploaded fine.

Last edited 8 months ago by lukecavanagh (previous) (diff)

@lukecavanagh
8 months ago

PDF uploads in media library no duplicates created

#27 @dglingren
8 months ago

@mikeschroder - The number of duplicates seems to vary form test to test; very hard to pin down.

@lukecavanagh - The duplicates don't always occur. I dragged/dropped those four files repeatedly and sometimes got duplicates, sometimes not.

I think it was much worse before the latest patch, when it took much longer to generate the thumbnails. Maybe try reverting the patch and running the test on the older code?

My test setup might contribute to the problem. I am running the client/browser and the server on the same system, so there are no network delays to speak of. It's quite possible that slower uploads mask or eliminate the problem.

Here's a new clue, which I have just repeated several times. The upload process begins with "asynch-upoad.php". The HTTP errors occur if an "admin-ajax.php" (action:heartbeat) request is sent while the earlier asynch-upload.php request is still in process.

#28 @lukecavanagh
8 months ago

@dglingren

Granted my test was on cloud hosting, so there would be some slight delay even if the server response is pretty quick.

It did take a while to upload all four PDF files.

#29 @dglingren
8 months ago

@lukecavanagh - I think the key is slow server processing time. The clue seems to be overlapping HTTP requests, such as the "action=heartbeat" example above. I believe the HTTP error also occurs if fetching the thumbnail result of one upload overlaps with generating the thumbnail for a subsequent upload.

This ticket was mentioned in Slack in #core-media by joemcgill. View the logs.


8 months ago

#31 @helen
7 months ago

Is there anything left here critical for 4.7 or can we close + move to new tickets and milestone those as appropriate?

#32 @mikeschroder
7 months ago

  • Keywords needs-patch removed
  • Resolution set to fixed
  • Status changed from assigned to closed

I haven't been able to reproduce the "duplicate" issue -- if someone does, I think it'd be good to open a new ticket.

Otherwise, while I'd much prefer to have a better error message, I think we've mitigated the issue enough for 4.7.

Feel free to comment or open new tickets if you see additional issues.

This ticket was mentioned in Slack in #core-media by joemcgill. View the logs.


7 months ago

#34 @mikeschroder
7 months ago

In 39303:

Media: Allow override of PDF setup for multiple pages or DPI.

After [39187], WordPress started loading only the first page of a PDF.

This is appropriate for performance, but made it impossible to
write plugins that read other pages without overriding load().

Introduces WP_Image_Editor_Imagick->pdf_setup(), to allow an override
to change WordPress' rendering DPI defaults or which pages are loaded.

Fixes #38832. See #38522, #31050.
Props markoheijnen, joemcgill, mikeschroder.

#35 @dglingren
7 months ago

I ran more tests with the Beta 4 release and I can generate HTTP errors on almost every upload. I edited /wp-includes/js/heartbeat.js to reduce the time between heartbeat requests from 60 seconds to 6 seconds, changing line 52:

Connect interval (in seconds)
mainInterval: 6,

Then, I went to the Media/Add New admin screen and dragged/dropped two (or more) PDF files into the upload area. If a heartbeat request overlaps with the upload and thumbnail generation process the HTTP error usually follows.

I am still getting duplicate items in about half of the failure cases, but I am willing to let that go.

This ticket was mentioned in Slack in #forums by clorith. View the logs.


7 months ago

Note: See TracTickets for help on using tickets.