Make WordPress Core

Opened 9 years ago

Closed 7 years ago

Last modified 4 years ago

#31050 closed feature request (fixed)

Better PDF Upload Management

Reported by: celloexpressions's profile celloexpressions Owned by: kirasong's profile kirasong
Milestone: 4.7 Priority: normal
Severity: normal Version:
Component: Media Keywords: has-patch needs-testing has-dev-note
Focuses: ui Cc:

Description

Over the past several releases, WordPress has added improved support for more and more media types, particularly images, audio, and video. All other file types get reasonably decent handling, with file type icons in the admin and link-insertion into content for the front-end.

But a particularly notable gap is with PDF files. Even though it's generally preferred to put content directly into html/site format, there are still many unavoidable situations where users must post PDF files.

In my recent sheet music library project, I discovered that PDFs can easily be converted to images using ImageMagick and WP_Image_Editor. If we create thumbnail images of PDFs on upload, we could use those images instead of the generic document icon in the admin media library. Additionally, we could explore improved means of displaying PDFs on the front-end. For example, this could be accomplished by offering the user the ability to insert a thumbnail-image-preview, linked to the PDF file, instead of a plain-text link.

This is the basic code for generating a thumbnail of a PDF on upload, and storing it as post meta in a plugin:

/**
 * When a PDF is uploaded, generate a thumbnail image and save it in the attachment meta.
 */
add_filter( 'wp_generate_attachment_metadata', 'prefix_generate_pdf_thumbnail_metadata', 10, 2 );
function prefix_generate_pdf_thumbnail_metadata( $metadata, $attachment_id ) {
	$attachment = get_post( $attachment_id );
	if ( 'application/pdf' === get_post_mime_type( $attachment ) ) {
		// Create a png from the pdf.
		$file = get_attached_file( $attachment_id );
		$pdf = wp_get_image_editor( $file );
		if ( ! is_wp_error( $pdf ) ) { // Most likely cause for error is that ImageMagick is not available.
			$filename = $pdf->generate_filename( 'image', null, 'png' );
			$uploaded = $pdf->save( $filename, 'image/png' );
			if ( ! is_wp_error( $uploaded ) ) {
				$upload_dir = wp_upload_dir();
				update_post_meta( $attachment_id, 'pdf_thumbnail_url', $upload_dir['url'] . '/' . $uploaded['file'] );
			}
		}
	}
	return $metadata;
}

I'd like to propose with this ticket that we explore creating thumbnails of PDFs when they're uploaded and use them for improved administration and display purposes, when the server environment supports ImageMagick, of course.

Attachments (22)

31050.patch (3.9 KB) - added by markoheijnen 8 years ago.
31050.2.patch (6.9 KB) - added by markoheijnen 8 years ago.
31050.3.patch (7.8 KB) - added by markoheijnen 8 years ago.
Remove support from audio/video.
31050.4.patch (7.7 KB) - added by markoheijnen 8 years ago.
31050.5.patch (8.4 KB) - added by kirasong 8 years ago.
Docs for filter, only support PDF, check for fallback support without loading attachment in WP_Image_Editor
Screen Shot 2016-02-12 at 3.49.37 PM.png (108.3 KB) - added by kirasong 8 years ago.
Media Library with PDF; Cropped display "landscape" => max-height 100%
Screen Shot 2016-02-12 at 3.49.55 PM.png (120.6 KB) - added by kirasong 8 years ago.
Media Library with PDF; Cropped display "portrait" => max-width 100%
31050.6.patch (9.1 KB) - added by markoheijnen 8 years ago.
Adding the full image to the sizes array
31050.7.patch (9.6 KB) - added by markoheijnen 8 years ago.
Fix linking to image instead of PDF.
31050.8.patch (9.8 KB) - added by markoheijnen 8 years ago.
Fixed the unit tests issues
31050.9.patch (10.7 KB) - added by markoheijnen 8 years ago.
Increase the quality of the PDF and fix scaling PDF image issue
31050.diff (11.0 KB) - added by kirasong 7 years ago.
Refreshed patch. Added a couple TODOs.
Emperor_s_New_Truncation.pdf (140.3 KB) - added by Ipstenu 7 years ago.
A really old and bad PDF of a public paper that causes an HTTP Error on upload and no thumbnail generation
31050.2.diff (10.9 KB) - added by kirasong 7 years ago.
SRA2.pdf (996.7 KB) - added by Ipstenu 7 years ago.
A PDF that sometimes throws HTTP errors
31050.3.diff (9.9 KB) - added by kirasong 7 years ago.
Work around black thumbnail issue. Generate more sizes; use medium by default in gallery and large by default in attachment details. Lower rendering DPI to 128.
31050.4.diff (9.5 KB) - added by joemcgill 7 years ago.
31050.5.diff (9.8 KB) - added by kirasong 7 years ago.
31050.6.diff (11.2 KB) - added by joemcgill 7 years ago.
wordpress-gsoc-flyer.pdf (12.6 KB) - added by joemcgill 7 years ago.
31050.7.diff (11.4 KB) - added by kirasong 7 years ago.
31050.8.diff (11.5 KB) - added by kirasong 7 years ago.
Skip test when PDF rendering is not supported.

Download all attachments as: .zip

Change History (127)

#1 @helen
9 years ago

  • Milestone changed from Awaiting Review to Future Release

Yes, please! Will happily move this to 4.2 if a patch materializes in a timely manner.

#2 @celloexpressions
9 years ago

Let's figure out the general strategy here first, then see how that leads to a patch.

We can do the actual image generation when it's done for audio and video in wp_generate_attachment_metadata(). The generated image would be stored as its own sub-attachment, which could in turn have multiple sizes, like any other image attachment, that could be used in the admin or on the frontend. But we would need to prevent these sub-attachments from showing up in the admin anywhere. Would it maybe be better to just generate thumbnails at a couple of arbitrary sizes (probably a thumbnail for admin and full-sized for possible frontend use) and store their URLs as attachment meta instead?

We should also back-generate images for all existing PDF uploads if possible, but I have no idea how to do that. I think we could get to a patch here for 4.2 with a team effort, but I know we'll get improvements in here eventually.

This ticket was mentioned in Slack in #core by tomauger. View the logs.


9 years ago

#4 @tomauger
9 years ago

We've got a little hackfest group working on this, and the current thinking is that we let core handle generating the PDF thumbnail upon upload, as per @celloexpressions, but that the back-generation belongs more properly in plugin territory, since it's basically a one-time affair.

#5 @celloexpressions
9 years ago

It shouldn't be difficult to build a basic mechanism into core that generates any missing thumbnails either as-needed (probably a bad idea since it will always fail if there's no IMagick), during an upgrade routine, or when the attachment post object is updated. When it comes down to it, this is core functionality and you shouldn't need a plugin to use it on already-uploaded media, it should "just work". Pretty much all users would benefit from that just happening without going through a plugin. But I wouldn't worry about that detail too much yet - let's get the actual functionality working and then dig into those compatibility details.

#6 @tomauger
9 years ago

Thanks @celloexpressions, we'll deal with the batch update piece later.

The team here is at a bit of a crossroads and could use core dev feedback. It has to do with the actual thumbnail images that we create.

  • We think that the thumbnail image of the PDF ought to be generated not just for the thumbnail size (ie: media gallery) but also for all the intermediate image sizes as well, so that the thumbnail can propberly be used on the front-end.
  • If this is the case, then we have to decide whether these generated images are their own attachment that is somehow associated with the original (PDF) attachment, or simply the -150x150.jpg versions (etc) of the original PDF attachment.
  • If we simply append the size postfix, then we have a TLA mismatch situation (.pdf vs. .jpg) - I don't know whether that could pose some issues down the road.

Interested to hear some perspectives on this.

This ticket was mentioned in Slack in #core by tomauger. View the logs.


9 years ago

#8 @helen
9 years ago

As for the "when" part of back-generating, could perhaps handle it the way we deal with ID3 tags.

#9 @tomauger
9 years ago

Good thought. I'm not currently familiar with the ID3 tag generation, but I'll look into it at the same time we look into the video thumbnail generation.

This ticket was mentioned in Slack in #core by tomauger. View the logs.


9 years ago

#11 @swissspidy
8 years ago

  • Focuses ui administration removed

Anyone willing to work on patch here?

How is image generation currently working for other non-image post types? Perhaps we can learn something from there.

#12 @melchoyce
8 years ago

#35044 was marked as a duplicate.

@markoheijnen
8 years ago

#13 @markoheijnen
8 years ago

The first part to tackle this. Currently it only works on the grid since only a full size is getting generated.

Questions:

  • Should we generate a thumbnail to or handle it differently?
  • Should we always generate images for video/audio even without theme support?

#14 @markoheijnen
8 years ago

  • Keywords has-patch dev-feedback added; needs-patch removed

#15 @kirasong
8 years ago

Tested it out.
Yeah, thinking we might want to generate a thumb as well, so it doesn't need to load the larger image when unnecessary.

Wondering if we can show the full or thumb in MCE/attachment page, when an image is available, as well.

#16 @markoheijnen
8 years ago

Shouldn't be hard. Will dive into it in the next 24 hours

#17 @joemcgill
8 years ago

I dig the idea of creating a preview image/thumbnail to make it easier to visually distinguish documents inside the media library. I think we should make sure the top level attachment metadata (e.g. mime-type, etc.) refers to the PDF and not the preview image.

We probably should include a thumbnail since that will get used in some places but I wouldn't want the preview images to be thought of as a front-end experience solution. I would think something like https://mozilla.github.io/pdf.js/ might make for better front-end previews if we were going to go down that rabbit hole.

@markoheijnen
8 years ago

Remove support from audio/video.

#18 follow-up: @markoheijnen
8 years ago

Added two new patches:

  • Generate thumbnail image
  • Changed image_downsize() to also work for non images. Please check that part.
  • Because of that change image now shows at the list table
  • Show image on the edit screen

Todo

  • Fix front-end. Breaks since it wants to show the pdf file as image
  • What to do about Video/Audio image? 31050.3.patch removes the logic added in #23673

#19 in reply to: ↑ 18 ; follow-up: @pbearne
8 years ago

Replying to markoheijnen:

I had a similar problem where I needed the image for a video the way I solved it was to get the open graph data saved the image provided there.

<?php
        /**
         * http://stackoverflow.com/questions/2068344/how-do-i-get-a-youtube-video-thumbnail-from-the-youtube-api
         *
         * @param $url
         *
         * @return string
         */
        private static function get_open_graph_data_for_url(
                $url
        ) {
                if ( empty( esc_url( $url ) ) ) {
                        return '';
                }
                $transit_id  = 'LP-open-graph-data';
                $transit_key = md5( $url );

                $open_graph_data = get_transient( $transit_id );

                if ( false === $open_graph_data || ! isset( $open_graph_data[ $transit_key ] ) ) {
                        try {
                                $scraper                         = new SimpleScraper( esc_url( $url ) );
                                $open_graph_data[ $transit_key ] = $scraper->getOgp();

                                set_transient( $transit_id, $open_graph_data, HOUR_IN_SECONDS );
                        } catch ( Exception $e ) {
                                error_log( 'Caught SimpleScrape exception: ' . $e->getMessage() );

                                return false;
                        }


                }

                return $open_graph_data[ $transit_key ];
        }

I used the lib

/*
 +---------------------------------------------------------------------------+
 | Simple Scraper (SimpleScraper.class.php )                                 |
 | Copyright (c) 2013-2015, Ramon Kayo                                       |
 +---------------------------------------------------------------------------+
 | Author        : Ramon Kayo                                                |
 | Email         : contato@ramonkayo.com                                     |
 | License       : Distributed under the MIT License                         |
 | Full license  : http://code.ramonkayo.com/simple-scraper/license.txt      |
 +---------------------------------------------------------------------------+
 | "Simplicity is the ultimate sophistication." - Leonardo Da Vinci          |
 +---------------------------------------------------------------------------+
 | Last modified : 2015-07-08                                                |
 +---------------------------------------------------------------------------+
 */

Just a throught

Paul

#20 in reply to: ↑ 19 @markoheijnen
8 years ago

Replying to pbearne:

Replying to markoheijnen:

I had a similar problem where I needed the image for a video the way I solved it was to get the open graph data saved the image provided there.

Thats something unrelated to this. This is about getting data out of video/audio and in this case with ID3 tag data. Currently there is check if is has CPT support or theme support.

#21 @markoheijnen
8 years ago

Information storage was incorrect, fixed that and updated the previous code to work with that. The image of a PDF will be displayed on the attachment page. I'm not that happy with the quality but no clue how to fix that.

This one is ready to be reviewed.

Open questions:

  • What to do about Video/Audio image?
  • Which mimetypes will we support for this? Currently application/pdf and video/avi (haven't tested that one yet)
  • Is attachment_fallback_mimetypes the best filter name? would love to hear alternatives. Maybe move the logic to a function or extensions in combination with wp_get_video_extensions().

This ticket was mentioned in Slack in #core by joemcgill. View the logs.


8 years ago

@kirasong
8 years ago

Docs for filter, only support PDF, check for fallback support without loading attachment in WP_Image_Editor

#23 @kirasong
8 years ago

  • Focuses ui added
  • Keywords needs-unit-tests needs-testing added
  • Milestone changed from Future Release to 4.5

I think we should only support PDF, but allow users to specify other types if they'd like, with the opening to expand these types for core later. As far as I can tell, FFMPEG is required for any video types.

Notes:

  • This is cool.
  • When we use thumbs for the gallery display, the top and bottom of standard pages get cut off. Wondering if we shouldn't crop for these thumbs.
  • It'd be cool to insert the thumbnail into the page in the Visual editor, if this is easily doable. If we do that, will probably want to run a user test to see if it's preferred behavior.

Added 31050.5.patch:

  • Docs for filter
  • Only support PDF in filter
  • Check for fallback support without loading attachment in WP_Image_Editor

#24 follow-up: @markoheijnen
8 years ago

Fine with only PDF. Was just checking if we could support more. Fallback can be removed since the same check is inside wp_image_editor_supports.

@kirasong
8 years ago

Media Library with PDF; Cropped display "landscape" => max-height 100%

@kirasong
8 years ago

Media Library with PDF; Cropped display "portrait" => max-width 100%

#25 in reply to: ↑ 24 ; follow-up: @kirasong
8 years ago

Replying to markoheijnen:

Fine with only PDF. Was just checking if we could support more. Fallback can be removed since the same check is inside wp_image_editor_supports.

Maybe I misunderstand -- don't we still want to be sure that only PDF is used, even if other types are supported by the image editors available, for now?

This ticket was mentioned in Slack in #core-images by mike. View the logs.


8 years ago

#27 in reply to: ↑ 25 @kirasong
8 years ago

Replying to markoheijnen:

Fine with only PDF. Was just checking if we could support more. Fallback can be removed since the same check is inside wp_image_editor_supports.

Ah, nevermind, you are correct -- instantiating with wp_get_image_editor() uses _wp_image_editor_choose() internally, which is what wp_image_editor_supports() runs before loading the image, so we're fine to skip checking for editor support ahead of time, like you did in the previous patch.

#28 @kirasong
8 years ago

  • Owner set to markoheijnen
  • Status changed from new to assigned

#29 @adamsilverstein
8 years ago

This looks good. I like the version with Cropped display "landscape" => max-height 100% - I think a typical PDF is for printing and will have portrait orientation.

#30 follow-up: @celloexpressions
8 years ago

If possible, let's handle both portrait and landscape PDFs - depending on industry/usage, both could be common so it's better not to assume. For example, construction documents are nearly always landscape, and distributed in PDF format.

I still think there's room for improvement in the way PDFs can be displayed on the front end; maybe some sort of a tile that shows the thumbnail and title, but really anything other than plain links would be an improvement. We can look into that in another ticket once image generation is implemented for the admin side, and it would be optional regardless.

#31 in reply to: ↑ 30 @markoheijnen
8 years ago

Replying to celloexpressions:

If possible, let's handle both portrait and landscape PDFs - depending on industry/usage, both could be common so it's better not to assume. For example, construction documents are nearly always landscape, and distributed in PDF format.

I still think there's room for improvement in the way PDFs can be displayed on the front end; maybe some sort of a tile that shows the thumbnail and title, but really anything other than plain links would be an improvement. We can look into that in another ticket once image generation is implemented for the admin side, and it would be optional regardless.

It's just inverting the logic of images to let the image show completely instead of cropped.

Currently it's showing the image linked to the PDF on the frontend side. So that part has been improved.

This ticket was mentioned in Slack in #core by joemcgill. View the logs.


8 years ago

#33 @azaozz
8 years ago

Been thinking about this. To work it requires ImageMagick and Postscript on the server. There are some settings that we can pass through imagick to ps to try to get better quality previews.

There is a difference between setting album art for audio, poster image for video and preview image for PDF.

  • Album art can be added separately, it should be a separate image attachment.
  • Video posters are usually generated automatically but would be good to support adding them manually, to match audio/album art.
  • PDF previews are always generated automatically. There is no need for separate image attachments.

Thinking that it's better to set another meta "key" for preview images in the attachment meta, perhaps preview_image. Then keep the image details in there. That way we won't mix non-image mime types with image meta.

The structure of the preview_image array can also be better than the existing image meta. For example we can store the sub-dirs separately and have all sizes in one array, etc.

#34 @markoheijnen
8 years ago

I could look into it. Reason why I did it this way was the easy integration in the existing code. To me the sizes part already is something separate and not linked to only to images.

About the full image we could probably move that one to sizes too. Reason why I set $metadatafile? was that it's used to get the relative path. However we could also use the meta key _wp_attached_file for that. So I'm looking if that could be used to only add the PDF data to sizes.

#35 follow-up: @dglingren
8 years ago

I added PDF thumbnail generation to my plugin, [Media Library Assistant](https://wordpress.org/support/plugin/media-library-assistant), a while back. It supports dynamic generation in the [mla_gallery] shortcode. It also has a bulk action in the Media/Assistant admin submenu to generate thumbnails and store them as separate Media Library items. You are welcome to incorporate any of the code or ideas in Core.

I also discovered that adding "Featured Image" support to Media Library items is easy and works well:

add_post_type_support( 'attachment', 'thumbnail' );

Once that's done you can assign an image to any non-image item (PDF, audio, video, Office documents) and use it in gallery displays, etc. It's a nice alternative to generating thumbnails.

@markoheijnen
8 years ago

Adding the full image to the sizes array

#36 @markoheijnen
8 years ago

Added the option to have the full image inside the sizes array. I still need to look over the code since both approaches are generating 5 failing unit tests.

This ticket was mentioned in Slack in #core-images by markoheijnen. View the logs.


8 years ago

#38 @ericlewis
8 years ago

Some notes from testing the latest patch.

  • The thumbnails while browsing in the media modal and library are great.
  • The attachment page displays the thumbnail of the PDF, and links to the PDF, great.
  • Inserting a PDF into post content creates a link to the thumbnail of the PDF. This should link to the PDF.
  • From the media modal and library, the URL in the attachment details pane points to the thumbnail of the PDF. This should link to the PDF.
  • Relatedly, it is hard to get to the original PDF from the media modal and library. Even if the URL linked to the PDF, I would have to click on the URL, select all, and navigate to that URL.

#39 follow-up: @markoheijnen
8 years ago

That's not good. I guess that was always a problem and will look into linking to the PDF instead of the Image.

Isn't it always hard to get the original link? So it's a side issue of what we are trying to fix here?

#40 in reply to: ↑ 39 ; follow-up: @ericlewis
8 years ago

Replying to markoheijnen:

Isn't it always hard to get the original link? So it's a side issue of what we are trying to fix here?

In trunk without the patch applied, embedding the PDF in a post will create a link to the original PDF, and the media modal URL points to the original PDF. So this would be a regression.

@markoheijnen
8 years ago

Fix linking to image instead of PDF.

#41 in reply to: ↑ 40 @markoheijnen
8 years ago

Replying to ericlewis:

Replying to markoheijnen:

Isn't it always hard to get the original link? So it's a side issue of what we are trying to fix here?

In trunk without the patch applied, embedding the PDF in a post will create a link to the original PDF, and the media modal URL points to the original PDF. So this would be a regression.

Okay that way. Should be fixed now. Still looking at why the unit tests are failing.

@markoheijnen
8 years ago

Fixed the unit tests issues

This ticket was mentioned in Slack in #core by jorbin. View the logs.


8 years ago

@markoheijnen
8 years ago

Increase the quality of the PDF and fix scaling PDF image issue

#43 @markoheijnen
8 years ago

In general the patch is really good however when a PDF isn't correctly build, the images will be broken. I have tried to run several imagick methods to test things out but couldn't get closer to the answer. Using mergeImageLayers(Imagick::LAYERMETHOD_FLATTEN) broke a PDF that normally did work for me. When using identifyImage the only difference was the type. but forcing the broken one to another type didn't work either.

Below the warnings that imagemagick will output for the broken PDF.

**** Warning: can't process font stream, loading font by the name.
**** Error reading a content stream. The page may be incomplete.
**** File did not complete the page properly and may be damaged.

**** This file had errors that were repaired or ignored.
**** The file was produced by: 
**** >>>> Microsoft: Print To PDF <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.

#44 @markoheijnen
8 years ago

Also in this patch the contruct of Imagick will not load the image anymore but readimage() will. This because setResolution() need to be called before the image is been loaded.

#45 @kirasong
8 years ago

  • Milestone changed from 4.5 to Future Release

Since @markoheijnen noted that this is not working with some unknown (but readily discoverable) quantity of PDFs in a currently undetectable way, going to punt this from 4.5.

I'd still love to see this make its way to core. It might make the most sense in a feature plugin, to allow us to test with a wider variety of PDFs.

#46 @dglingren
8 years ago

I would be very interested in testing some of the "broken PDF" files against the code in my plugin. I am using a combination of Imagick calls and direct Ghostscript exec calls and I have never seen the kind of errors reported in earlier posts to this topic.

Where can I find the PDF documents you're testing with? Thanks!

This ticket was mentioned in Slack in #core-images by mike. View the logs.


8 years ago

This ticket was mentioned in Slack in #core-images by joemcgill. View the logs.


8 years ago

#49 @jlambe
8 years ago

Yes! Hope this feature will be accepted for 4.6 merge. Great work!

#50 @joemcgill
8 years ago

  • Milestone changed from Future Release to 4.7

This ticket was mentioned in Slack in #core-images by joemcgill. View the logs.


8 years ago

#52 @simonrcodrington
8 years ago

I've been looking at this in several of my own projects. It's a bit jarring to see a wall of default images used for PDF documents. It should be relatively easy to generate a thumbnail / sample image using Imagick or similar so that users can see at a glace what this file is. I'd prefer any type of image regardless of how it's cropped, so long as you can get a feel for what type of file is being displayed.

If we can't / don't want to add a preview thumbnail image, it would be good to have a large preview image displayed when you click on the document itself (when it opens in the media attachment lightbox, where it shows you it's metadata).

How likely is it that this will be followed up in 4.7? I'm happy to try a pull together a simple solution if needed.

Cheers


This ticket was mentioned in Slack in #core by joemcgill. View the logs.


8 years ago

This ticket was mentioned in Slack in #core-images by mike. View the logs.


8 years ago

#55 in reply to: ↑ 35 @joyously
8 years ago

Replying to dglingren:

I added PDF thumbnail generation to my plugin, [Media Library Assistant](https://wordpress.org/support/plugin/media-library-assistant), a while back. It supports dynamic generation in the [mla_gallery] shortcode. It also has a bulk action in the Media/Assistant admin submenu to generate thumbnails and store them as separate Media Library items. You are welcome to incorporate any of the code or ideas in Core.

I also discovered that adding "Featured Image" support to Media Library items is easy and works well:

add_post_type_support( 'attachment', 'thumbnail' );

Once that's done you can assign an image to any non-image item (PDF, audio, video, Office documents) and use it in gallery displays, etc. It's a nice alternative to generating thumbnails.

Using the post_thumbnail seems like it would be the WordPress way to do things, as attachments are just posts anyway, and it applies to all non-image types instead of just PDFs. Then it's just a matter of generating it on upload, and maybe marking it so the image is not displayed separately in the Media Library (or not).

My opinion for generating the already uploaded ones is that it should only happen when used, as in while displaying the thumbnails in Media Library, the PDF is next, get the thumbnail but if it doesn't exist, create it. It seems to me that all image sizes should be done that way instead of littering the server with unused image sizes. (There's a plugin that does this: Dynamic Image Resizer.)

This ticket was mentioned in Slack in #core-images by joemcgill. View the logs.


8 years ago

#57 @markoheijnen
8 years ago

Current status is that I'm refreshing the patch so it works again. Will happen in the next 36 hours.

For plugins, I don't know how "Media Library Assistant" should work but it doesn't generate an image for me. "PDF Image Generator" has same mistake as the current code and so do some others that solly only use imagick logic. I will work on a catalog of PDF's we can use. The current one I use is an invoice that I can't share.

Last edited 8 years ago by markoheijnen (previous) (diff)

#58 @dglingren
8 years ago

@markoheijnen,

I am the Media Library Assistant author and I'd be happy to help you in any way I can. MLA supports image generation in two ways:

  1. The [mla_gallery] shortcode has an mla_viewer parameter that generates images for its gallery display.
  2. The Media/Assistant admin submenu has a "Thumbnail" bulk action that will generate images for PDF documents and add them to the Media Library as separate items. Then, for example, the new items can be used as the Featured Image for the document to avoid generating the image each time it is displayed in an [mla_gallery] output.

These options are described in the "Thumbnail Substitution Support, mla_viewer" section of the Settings/Media Library Assistant Documentation tab. Let me know if there's anything I can do to help!

#59 @markoheijnen
8 years ago

@dglingren I found later on that I read your message here as I was sure it was on Slack. So I tried out the bulk action and with the invoice PDF I have, it only shows the logo and the rest is removed. So same behavior as others.
But I will look into your plugins code to see if there are things to learn from.

This ticket was mentioned in Slack in #core-images by joemcgill. View the logs.


8 years ago

#61 @dglingren
8 years ago

@markoheijnen,

Thanks for your quick response. I am always looking for PDF documents to test with; if you get a failing document that you can share, let me know. MLA also does a lot of PDF parsing to extract IPTC, EXIF and XMP metadata and there are many odd cases that don't work without special code.

#62 @celloexpressions
8 years ago

Would it be useful to upload various types of PDFs (ideally generated from different types of software) to this ticket for testing? Or maybe setup an open dropbox/google drive folder somewhere? Probably the most unique examples I can share include sheet music, construction drawings, and construction schedules.

#63 @jorbin
7 years ago

@markoheijnen Doing some trac gardening. Are you still working on this for 4.7? I want to make sure the milestone accuretly reflects what is being worked on for 4.7

#64 @markoheijnen
7 years ago

@jorbin Awesome job. Yes, I'm still looking at it. I probably give it another 2 weeks to see if I can get it in for 4.7.

@celloexpressions I got something at https://github.com/markoheijnen/Debug-Image-editors but it does more checks. I can create categories so PDF's only can be tested as well.

@dglingren I will collect more PDF's and test it with your plugin this weekend. Sorry for the delay.

This ticket was mentioned in Slack in #core-images by joemcgill. View the logs.


7 years ago

#66 @desrosj
7 years ago

  • Keywords needs-refresh added

The most recent patch is failing for me when applying. Needs a refresh.

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


7 years ago

This ticket was mentioned in Slack in #core-images by joemcgill. View the logs.


7 years ago

#69 @kirasong
7 years ago

  • Milestone changed from 4.7 to Future Release

Chatted with @markoheijnen on this today. Since there isn't a refresh yet, going to remove it from the milestone for now. Of course, if more happens here, we can move it back in.

If I have time today or tomorrow, I'll see what the work looks like to give it a refresh -- by all means, if others are interested, please feel free to do so as well.

@kirasong
7 years ago

Refreshed patch. Added a couple TODOs.

#70 @kirasong
7 years ago

  • Keywords needs-refresh removed

In 31050.diff, refreshed patch, made a few whitespace changes, and added a couple of things we'll want to consider before a commit in TODOs.

This should do for testing, but does not resolve any of the issues found previously with broken thumbnails being generated.

Please test this with your PDFs and please attach any test PDFs you find that break the functionality!

This ticket was mentioned in Slack in #core-images by mike. View the logs.


7 years ago

#72 @kirasong
7 years ago

  • Keywords dev-feedback removed
  • Milestone changed from Future Release to 4.7

After a chat with @helen, adding back to milestone for testing attention.

Please test this with any/all types of PDFs and post any that do not work properly.

Some of the broken ones have been Word PDFs (with images in the header/top of page), but I haven't been able to reproduce this on Word Mac, so testing with PDFs from Word on Windows would be appreciated.

Since we can't detect the warnings Ghostscript/Imagick is tossing, hoping we can find a profile of the PDFs and thumbs that fail.

This ticket was mentioned in Slack in #core-images by mike. View the logs.


7 years ago

@Ipstenu
7 years ago

A really old and bad PDF of a public paper that causes an HTTP Error on upload and no thumbnail generation

#74 follow-up: @Ipstenu
7 years ago

I attached a file - Emperor_s_New_Truncation.pdf - and I apologize for the Comic Sans in it. When I upload it, I get an HTTP Error on upload as it fails to generate a thumbnail. Per my expectations, the file DID upload, just no thumbnails. It's 144kb which is pretty small so I was surprised at that. I have a few others. They're all public PDFs anyway. IIRC they were made on Windows XP back in 2008 or so, but we couldn't remember. It claims to be made by PrimoPDF http://www.primopdf.com

#75 in reply to: ↑ 74 @kirasong
7 years ago

Replying to Ipstenu:

I attached a file - Emperor_s_New_Truncation.pdf

Thanks for testing, Mika!

I wasn't able to reproduce the HTTP Error on VVV (tested in VVV first, since when we chatted you noted that was where you were testing).

Any chance you could use dev tools to check on what the HTTP response/content was, to give us an idea on why it failed?

#76 follow-up: @celloexpressions
7 years ago

I dropped a bunch of files in and this worked great for a while. Then, one of the exceeded the max upload size and another had an http error (this happened for several, which I believe was an internet connectivity issue). Now, no images will generate thumbnails, in the media modal or library. Not even the ones that worked originally. I tried re-applying the patch, but no luck. Any ideas?

#77 in reply to: ↑ 76 @kirasong
7 years ago

Replying to celloexpressions:

I dropped a bunch of files in and this worked great for a while. Then, one of the exceeded the max upload size and another had an http error (this happened for several, which I believe was an internet connectivity issue). Now, no images will generate thumbnails, in the media modal or library. Not even the ones that worked originally. I tried re-applying the patch, but no luck. Any ideas?

This sounds like it might be a hosting throttling issue. I assume it's still broken if you revert the patch?

Time will probably reset whatever throttle you hit, but you could also try killing the PHP processes via the shell, if you have access.

This ticket was mentioned in Slack in #core by desrosj. View the logs.


7 years ago

#79 @kirasong
7 years ago

  • Owner changed from markoheijnen to mikeschroder

Assigning to myself while I'm working on this -- more than happy to pass it on if/when someone else is taking point next :)

@kirasong
7 years ago

#80 @kirasong
7 years ago

Looked at this with @markoheijnen and a bit with @joemcgill (who has other comments as well), but a few notes:

Attached 31050.2.diff. I mis-applied some of the logic at the beginning of image_downsize(), which is fixed in the latest patch.

However, now that it's working "properly," the black thumbnail issue is back, because it looks like the "full" size rendered JPGs are properly saved, whereas the thumbnails are not being generated correctly with all PDFs. I'll dig into this to see whether it can be solved without intentionally making use of these.

It's great that it highlights the issue, and we'll have to decide how to proceed.

After chatting with @markoheijnen, I'm okay with proceeding without a filter for now, with the PDF detection parts within WP_Image_Editor, so removed the TODOs related to that, and documented the WP_Image_Editor changes a bit more.

Also, changed the logic in edit_form_image_editor() so that wp_edit_form_attachment_display doesn't result in double output.

Still to do: Double-check that Imagick::readImage() is available all the way back in versions of Imagick.

Last edited 7 years ago by kirasong (previous) (diff)

This ticket was mentioned in Slack in #core-images by mike. View the logs.


7 years ago

#82 @kirasong
7 years ago

There were some good details in the Slack conversation above.

A few notes:

  • I've got a workaround working where we generate the thumbs with the initially created render, which fixes the multi_resize/black thumbnail problem, and allows us to create whatever other sizes we think best.
  • Imagick supports readImage() back to at least 2.0.0 alpha (WP supports 2.2.0+).
  • @joemcgill notes that he'd like to see a different/smaller size used in the gallery, and I entirely agree.
  • It's an option to only commit the thumbs, and not the Attachment Details view, which would allow WP to only generate lower quality images (it's faster to render PDFs at a lower DPI)
  • @joemcgill notes that he'd like to explore the post thumbnail option before this gets committed (I'm going to dig into what this would mean, and connect with him and @markoheijnen about it)
  • @joemcgill notes that we'll want to address the issue of some PDFs timing out that wouldn't have previously, and had suggested a size limit for the thumbnail generation. I noted that a lower DPI would also help with this -- especially for image intensive PDFs.
  • @joemcgill would prefer if we didn’t have to hack around image_downsize(), which he thinks would be possible if we didn't need to read $meta['sizes'] from PDF attachments.

If I missed anything, please feel free to add additional comments :)

#83 @joemcgill
7 years ago

Thanks @mikeschroder. A few other bits of feedback that I would add, which still apply to 31050.2.diff, are that full size image being generated from the PDF should be named like "filename.jpg" rather than "filename.pdf.jpg" and for images we store the full size metadata outside the [sizes] array. We should probably stay consistent for PDFs unless there's a good reason not to.

@Ipstenu
7 years ago

A PDF that sometimes throws HTTP errors

#84 @Ipstenu
7 years ago

Any chance you could use dev tools to check on what the HTTP response/content was, to give us an idea on why it failed?

Sorry about the delay. Was offline.

It's a 502 Bad Gateway all along the way.

Safari's Request & Response:

POST
Cached	No
Status	Bad Gateway
Code	502
Encoded	166 B
Decoded	166 B
Transferred	329 B

Chrome's Response Headers:

Connection:keep-alive
Content-Length:568
Content-Type:text/html; charset=utf-8
Date:Wed, 19 Oct 2016 18:39:05 GMT
Server:nginx

Testing just that PDF worked a second time on a fresh setup of the same flavor. So I ran them all again and got different PDFs failing. This made me wonder if the issue was how many I was uploading at once, though /srv/log/php_errors.log was empty. Also once a PDF failed, it failed 'forever' which was weird. (Now a file called SRA2.pdf is failing... Attached for grins and giggles)

I ran an svn up again and re-tested with 31050.2.diff - same problem, but the same "new" PDF.

Turned on Xdebug and reloaded Vagrant. Same errors.

Both PDFs were made in MS PowerPoint on a Windows XP box.

This ticket was mentioned in Slack in #core by joemcgill. View the logs.


7 years ago

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


7 years ago

This ticket was mentioned in Slack in #core-media by joemcgill. View the logs.


7 years ago

#88 @desrosj
7 years ago

I did a file search for PDFs on my system and started uploading.

I am noticing some PDFs returning HTTP errors, even though the PDF file actually does get copied into the appropriate upload directory. It happens pretty consistently for files that are small in terms of file size, but large in terms of page numbers.

Each of these files weighs in at under 400KB, but each is in the 30-50 page range.

#89 @desrosj
7 years ago

Also, after refreshing the page, every PDF file that returned an HTTP error appears in the media library with the default document icon.

@kirasong
7 years ago

Work around black thumbnail issue. Generate more sizes; use medium by default in gallery and large by default in attachment details. Lower rendering DPI to 128.

#90 @kirasong
7 years ago

In 31050.3.diff:

  • Work around black thumbnail issue.
  • Name full image '.jpg' rather than '.pdf.jpg'
  • Generate more sizes (thumbnail, medium, large, full)
  • Make thumbnail not crop by default.
  • Use medium by default in gallery
  • Use large by default in attachment details.
  • Lower rendering DPI to 128. This reduces all PDF rendered image quality, but also reduces the chances of timeouts. We'll have to decide what compromise we're comfortable with here.

Left:

  • Decide what to do with full images, whether to discard, or how to store in meta if we're still going to. If we don't store in meta, we'll need to change to checking for whether there is an image representation for a non-image filetype a different way. I actually don't mind the way this is working currently, since it signals "The 'file' isn't an image, so the 'full' rendered image is here."
  • Figure out why data.size.url isn't getting set by the time we hit the checks in media-template.php. I've added the extra sizes, and they're being sent via JS like with standard images (data.sizes), but the "default" isn't getting set to medium, which is why it's being hardcoded currently. I'm probably missing something obvious here, so would appreciate any pointers here.

This ticket was mentioned in Slack in #core-media by mike. View the logs.


7 years ago

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


7 years ago

@joemcgill
7 years ago

#93 follow-up: @joemcgill
7 years ago

31050.3.diff is working much better on the performance front. 31050.4.diff cleans up a few things:

  1. Since we're saving the attachment metadata similarly to how images save data, most of the internal image functions work just fine, so I've removed some changes to image_downsize() that seemed unnecessary.
  2. I fixed two issues with the way orientations were being calculated. First, in wp_prepare_attachment_for_js() when we use the sizes fallback to set the full size data we were calculating the orientation in reverse. While it's simple to do this calculation, I'm not against adding a helper function as @markoheijnen suggested on Slack. I also fixed a typo which meant we were failing to set the orientation, height, and width attributes for the response.
  3. The only other change is to go back to the default size and crop attributes for image sizes in wp_generate_attachment_metadata(), which I don't feel strongly about if there is a good reason to hard code those values in.

Remaining issues I see:

  1. On the edit media screen wp-admin/post.php?action=edit the thumbnail is not showing up for PDFs because it's attempting to use the attachment file and not the thumbnail image.
  2. I'm not confident that adding the attachment_fallback_mimetypes filter is needed here since any mimes that are added to that array might need its own process for creating a fallback image rather than using the same process we use for PDFs.

I'm going to keep testing, but otherwise this looks good.

@kirasong
7 years ago

#94 in reply to: ↑ 93 ; follow-up: @kirasong
7 years ago

Replying to joemcgill:

  1. I fixed two issues with the way orientations were being calculated. First, in wp_prepare_attachment_for_js() when we use the sizes fallback to set the full size data we were calculating the orientation in reverse. While it's simple to do this calculation, I'm not against adding a helper function as @markoheijnen suggested on Slack. I also fixed a typo which meant we were failing to set the orientation, height, and width attributes for the response.

Also not opposed. Good catch!

  1. The only other change is to go back to the default size and crop attributes for image sizes in wp_generate_attachment_metadata(), which I don't feel strongly about if there is a good reason to hard code those values in.

This was done to make the thumbnail size more useful. By disabling crop for the thumbnail size, this makes the entire page visible, and avoids cropping out of the middle of a page (which allows visible headers in the thumbnail). I've left it the way you changed it, but please test with both and see which one you prefer (it's easiest to test by uploading various PDFs with each patch, then looking in the Media list view).

Remaining issues I see:

  1. On the edit media screen wp-admin/post.php?action=edit the thumbnail is not showing up for PDFs because it's attempting to use the attachment file and not the thumbnail image.

This was caused by the removal of the bits in image_downsize() that replaced the PDF's filename with the "full image" filename.

I've restored these in the most recent patch, but am definitely open to a different or cleaner way of doing it.

  1. I'm not confident that adding the attachment_fallback_mimetypes filter is needed here since any mimes that are added to that array might need its own process for creating a fallback image rather than using the same process we use for PDFs.

I'm not opposed to pulling this for now, and adding it back if there is a call for it.

I'm going to keep testing, but otherwise this looks good.

Great to hear! I noticed that the 'full' size is still in there. I've been assuming you want to see it pulled before a commit, rather than standardize on it, but want to be sure we're on the same page here, since there's a bit of doing to do so. :)

@joemcgill
7 years ago

#95 in reply to: ↑ 94 @joemcgill
7 years ago

  • Keywords needs-unit-tests removed

Replying to mikeschroder:

This was done to make the thumbnail size more useful. By disabling crop for the thumbnail size, this makes the entire page visible, and avoids cropping out of the middle of a page (which allows visible headers in the thumbnail).

It's probably better the way you had it. 31050.6.diff reverts back to disabling crop for thumbnails.

This was caused by the removal of the bits in image_downsize() that replaced the PDF's filename with the "full image" filename.

I've restored these in the most recent patch, but am definitely open to a different or cleaner way of doing it.

Ah, I had missed that when image_get_intermediate_size() fails, we needed to set fallback data. Looks good to me.

I'm not opposed to pulling [the attachment_fallback_mimetypes filter] for now, and adding it back if there is a call for it.

Let's do that for now. 31050.6.diff removes the filter. I'm ok to add it back if we have a need for it but would rather not get stuck with back-compat issues if we find we needed the filter to work differently.

Great to hear! I noticed that the 'full' size is still in there. I've been assuming you want to see it pulled before a commit, rather than standardize on it, but want to be sure we're on the same page here since there's a bit of doing to do so. :)

The more I think about it, the more I'm coming around to the idea of keeping the full size data in the sizes array.

I've added a unit test for wp_generate_attachment_metadata() to show that it should generate image sizes for PDFs. We could certainly do with more tests here, but it's a start. You'll need to add the wordpress-gsoc-flyer.pdf file from this ticket to your /tests/phpunit/data/images/ directory to get this test to pass.

@kirasong
7 years ago

#96 @kirasong
7 years ago

  • Keywords commit added

Thanks @joemcgill!

Made a minor changes to the test (mainly to clean up the PDF after the fact) and putting this together for a first-run commit.

Was able to test using all of the posted PDFs locally, and they end up with rendered thumbnails as expected.

Would be interested to hear from those who have tested if the change to rendering DPI makes them work better on their hosts as well.

@kirasong
7 years ago

Skip test when PDF rendering is not supported.

#97 @kirasong
7 years ago

  • Resolution set to fixed
  • Status changed from assigned to closed

In 38949:

Media: Add support for rendering PDF thumbnails.

When support for PDFs is available, on upload,
render 'Thumbnail', 'Medium', 'Large', and 'Full' sizes of
the first page, and save them in attachment meta.

Use these renders within Add Media, Media Gallery and List views,
Attachment Details, Post/Attachment Edit screens, and Attachment pages.

Support available by default via Imagick -> ImageMagick -> Ghostscript,
but can be provided by any WP_Image_Editor that supports PDFs.

Props adamsilverstein, azaozz, celloexpressions, desrosj, dglingren, ericlewis, ipstenu, joemcgill, joyously, markoheijnen, melchoyce, mikeschroder, tomauger.
Fixes #31050.

#98 @kirasong
7 years ago

Now that this has landed in trunk, if you still see issues with PDFs/hosting you're testing, please follow up here: #38522

#99 @Stagger Lee
7 years ago

What is needed to make it work ?
I have installed latest nightly Beta. Upload PDF and nothing happens.

Imagmagick is installed in server:
https://s18.postimg.org/5o5a7kfzt/2016_11_06_183022.jpg


Edit: I see now code and files in trunk are not the same as in nightly builds.

Last edited 7 years ago by Stagger Lee (previous) (diff)

#100 @Stagger Lee
7 years ago

Ghostscript was missing, it is not part of Imagemagick. Installs separately.
Sorry for mail notifications, but it can help someone else.

Not so easy to install on localhost/windows. Kind of 7 different tutorials with 2 of my own guesses/tweaks.

#101 @kirasong
7 years ago

In 39303:

Media: Allow override of PDF setup for multiple pages or DPI.

After [39187], WordPress started loading only the first page of a PDF.

This is appropriate for performance, but made it impossible to
write plugins that read other pages without overriding load().

Introduces WP_Image_Editor_Imagick->pdf_setup(), to allow an override
to change WordPress' rendering DPI defaults or which pages are loaded.

Fixes #38832. See #38522, #31050.
Props markoheijnen, joemcgill, mikeschroder.

#102 @joemcgill
7 years ago

In 40130:

Media: Keep PDF previews from overwriting files.

Since support for PDF previews were added in [38949], it's possible
that the generated image file could overwrite an existing image file
with the same name. This uses wp_unique_filename() to avoid this
issue and adds a '-pdf' identifier on the end of filenames.

Props gitlost, derosj, mikeschroder, joemcgill.
Fixes #39875. See #31050.

#103 @joemcgill
7 years ago

In 40133:

Media: Keep PDF previews from overwriting files.

Since support for PDF previews were added in [38949], it's possible
that the generated image file could overwrite an existing image file
with the same name. This uses wp_unique_filename() to avoid this
issue and adds a '-pdf' identifier on the end of filenames.

Props gitlost, desrosj, mikeschroder, joemcgill.
Merges [40130] and [40131] to the 4.7 branch.
Fixes #39875. See #31050.

This ticket was mentioned in Slack in #core by swissspidy. View the logs.


7 years ago

#105 @desrosj
4 years ago

  • Keywords has-dev-note added; commit removed

Linking to the published dev note for reference: https://make.wordpress.org/core/2016/11/15/enhanced-pdf-support-4-7/

Note: See TracTickets for help on using tickets.