WordPress.org

Make WordPress Core

Opened 4 years ago

Last modified 2 months ago

#18311 closed task (blessed)

Support HTML in image captions. — at Version 16

Reported by: prettyboymp Owned by: sushkov
Milestone: 3.4 Priority: highest omg bbq
Severity: blocker Version:
Component: Editor Keywords: has-patch
Focuses: Cc:

Description (last modified by azaozz)

The ability to add any HTML formatting to text in captions is very limited.

  • HTML markup can be added from the HTML editor as long as you don't switch to the visual editor and the markup does not contain any quotes.
  • Currently the visual editor removes any HTML tags from captions.

To do:

  • Handle HTML in the caption text escaping/encoding it to pass through the shortcode matching regex.
  • Provide some way of adding HTML in the captions in the gallery -> image properties. (This may be limited to links only as they seem most needed).
  • Improve editing of caption text in the visual editor after inserting an image with caption or add third "popup button" to images that would add/remove captions directly without needing to open the uploader/gallery popup.

Change History (19)

comment:1 @prettyboymp4 years ago

I toyed with using a shortcodes to implement the markup just to fix the problem, but found that having the ] in the caption also breaks the handling, see #15694

comment:2 @jane4 years ago

  • Component changed from General to Media

@h0tw1r34 years ago

use 'patch -p1 < my-patch.diff' from the base wordpress folder

comment:3 @h0tw1r34 years ago

  • Component changed from Media to TinyMCE

This is related to how the wpeditimage plugin for TinyMCE currently works. It was originally written to support images, not other "stuff" in the media library or a generic for other things. I use it as a container for the flowplayer shorttag.

The first half of the problem is to stop the plugin from stripping the caption tag unnecessarily when switching from HTML to Visual mode.

"my-patch.diff" fixes that part of the issue.

comment:5 @navjotjsingh4 years ago

  • Cc navjotjsingh@… added

comment:6 @jeremyclarke4 years ago

I don't think the source_url method is the way to go. Definitely adds an unnecessarily confusing element to the UI and the whole system. I also don't think that just adding attribution in brackets like that is nearly universal enough.

What we need is to stop the caption system from stripping out light HTML from the caption field. There's no semantic justification for not having formatting and in fact there are MANY scenarios where having it is important, not the least of which is the attribution scenario which is vital for legal reasons.

WP IDEA with 55 supporters: http://wordpress.org/extend/ideas/topic/fix-the-caption-system-to-allow-links-inside-captions

WP Tavern Post about it: http://www.wptavern.com/how-to-insert-links-inside-of-image-captions

comment:7 @nacin4 years ago

Someone just needs to write the patch.

It's not a lack of support that is holding us back, it's the lack of code to implement this.

comment:8 @nacin4 years ago

  • Keywords needs-patch added
  • Milestone changed from Awaiting Review to Future Release

comment:9 @jeremyclarke4 years ago

  • Cc jer@… added

The first step of the problem is in image_add_caption(), which is attached to the image_send_to_editor filter and handles actually inserting the caption shortcode into post content. Notably, the image_add_caption() function is completely undocumented ( {@internal Missing Short Description}}) so it's hard to know what was intended by it's behavior, though clearly the idea was to simply remove anything that might be offensive inside a shortcode attribute without consideration for formatting that might be important to preserve.

Interestingly the upload manager popup correctly saves whatever HTML you insert in the caption without stripping it or showing an error. The problem is that when the shortcode is inserted into the post, image_add_caption() runs the following string replacement, which completely neutralises any HTML in the caption by replacing it with HTML entities:

$caption = str_replace(	array( '>',    '<',    '"',      "'" ),
	array( '&gt;', '&lt;', '&quot;', '&#039;' ),
		$caption
	  );

This is despite the fact that re-opening the image in the gallery shows you that the HTML is still saved correctly and unmolested in the caption field, which is very confusing. At a minimum the media uploader should not save any HTML in the caption field that will not ultimately function in the caption shortcode itself.

If the str_replace is removed then inserted HTML continues to not function because when the actual $caption content is output, addslashes() is run on it first( addslashes($caption)) which creates useless broken HTML.

The issue is further complicated by the fact that the caption text is stored in the caption="" attribute of the shortcode, so any double-quotes within the caption text truly don't make any sense and absolutely SHOULD be either escaped/slashed or replaced with single-quotes for storage inside the shortcode attribute.

THE CAPTION SHORTCODE IS FUNDAMENTALLY FLAWED, NUTS

Looking at it closely it seems like the architecture of the caption shortcode is fundamentally flawed because it uses an 'attribute' of the shortcode (caption="") to store the caption text, and attributes are rightly blocked from having the kind of formatting a useful caption needs. Since the shortcode has only one piece of content that can go between the two tags (i.e. between [caption] and caption) that content should have been the caption text itself (which could then take advantage of being regular post content), rather than the image html tag that is currently there (which could have been broken down into esc_attr()-safe attributes stored inside the initial [caption] shortcode and parsed into an img tag during shortcode execution).

Unfortunately re-architecting the caption shortcode now is probably off the table, and I doubt we can get traction for fully replacing it with a new more powerful shortcode in core. That is the problem we face though, the current architecture is just wrong for captions that are any more than another kind of alt label.

WE NEED SOME WAY TO ESCAPE HTML IN THE CAPTION TEXT ON SAVE, THEN UNESCAPE ON DISPLAY

So the trick we need to pull off is getting the system to somehow allow for at least a limited set of HTML (personally I'd settle for just links) inside the 'caption' attribute of the [caption] shortcode, then making the shortcode function img_caption_shortcode() display it correctly. If that worked, then it should be possible to get the visual editor to stop messing it up when you switch back and forth.

To me it seems like the best approach within the current shortcode would be to carefully account for particular HTML tags inside the 'caption' attribute by escaping them in a particular way while inserting them into the post, then when executing the shortcode carefully accounting for the escaped HTML and reverting it into an active state.

USING ADDSLASHES()/STRIPSLASHES() TO ESCAPE CAPTION TEXT: FAIL

The simplest version of that idea is to just remove the anti-html str_replace (quoted above) from image_add_caption() and instead depend on addslashes() to keep the content safe, then, in {{img_caption_shortcode()}} using stripslashes() to get the content back to it's original state.

Unfortunately we can't just stripslashes() in the current state of the codebase because a slash-escaped link tag inside the caption attribute causes the whole shortcode's $attr array to lose it's lunch and become broken, here's and example where the caption was "Testing a link" and the link was an a tag leading to test.com which used double-quotes around the URL:

 [id] => attachment_74823
 [align] => alignnone
 [width] => 375
 [0] => caption="Testing
 [1] => a
 [2] =>  href="http://test.com">link"

That's the $attr array recieved by the shortcode function img_caption_shortcode(), so before we even have a chance to stripslashes($caption) the shortcode fails completely because the $caption attribute is mandatory for the shortcode and not present in $atts.

The culprit here is shortcode_parse_atts() which uses some intense regex voodoo I can't even begin to understand but is clearly not ready for a world where slash-escaped double-quotes are part of a shortcode attribute.

However, in this case switching to single-quotes actually fixes the problem, and results in the link showing correctly on the image. Maybe a final solution will involve enforcing single-quotes -only within caption text HTML, though that seems awkward, since it will be hard to distinguish between a double-quote used as part of a link and one used as part of regular text, and we don't want to replace text-quotes with a single-quote as that would be confusing for authors.

USING URLENCODE()/URLDECODE() TO ESCAPE CAPTION TEXT: UGLY

An alternate idea would be to use another escaping system entirely, to completely neutralize the caption text before sending it to the editor, then re-activate it when parsing the shortcode on display. I tried replacing addslashes/stripshashes with urlencode/urldecode and the results were very promising. In image_add_caption() I use urlencode($caption) before sending it to the editor, which sends things like this to the post:

caption="TEsting+a+%3Ca+href%3D%22http%3A%2F%2Ftest.com%22%3Elink%3C%2Fa%3E+with+also+some+%3Cstrong%3Ebold%3C%2Fstrong%3E+text.+"

Then in img_caption_shortcode() I run $caption = urldecode($caption), which reverts it back to it's original state and shows perfectly fine on the post preview.

This even avoids having the visual editor break the content when switching back and forth, but unfortunately it also looks AWFUL in the visual editor, because the urlencode() version of the caption is shown, which would probably confuse even users savvy enough to be adding HTML links in the caption field in the first place. This same awfulness applies to trying to read or edit the caption inside the HTML editor, which is infinitely more difficult then when the caption text is shown as plain text. Yet another place where the ugly urlencoded version shows is in the visual editor when you edit image settings.

WHAT OTHER FORMAT MIGHT WORK?

It seems like a big problem here is that any attempt to escape the HTML will need us to account for every single place where it is shown to users and unescape it for display in that space. Otherwise the escaping pattern would have to be simple/obvious enough that looking at the escaped version still made it clear what the HTML was for.

Maybe the solution is some kind of non-HTML code similar to shortcodes that could be used inside the caption attribute to identify links etc. This is not ideal as it creates yet another layer of syntax to learn and I would avoid that if at all possible, but it may be the only way to get around the problematic architecture that the caption shortcode currently suffers from. Maybe something that emulates mediawiki markup like this could work?

[http://test.com/ test]

OOF! THERE IS A REASON THIS HAS GONE SO LONG WITHOUT A FIX

Okay enough from me for today. Just wanted to dig into the code and do some research about the problem. I think that this probably needs more than just a patch, it needs thinking and decisions because so far there's still no solution that seems appropriate despite the obviousness of the need.

@kurtpayne4 years ago

Allow a subset of HTML in [caption] shortcode

comment:10 @kurtpayne4 years ago

  • Cc kpayne@… added
  • Keywords has-patch added

Perhaps something like 18311.2.patch would work?

This allows simple HTML (currently strong, em, and a) in the [caption] shortcode. It has to be turned on via the render_html=true attribute. It works with no modification to any other code, contained completely within img_caption_shortcode. It doesn't matter what encoded entities TinyMCE throws at it because it will decode it, keep good html, then re-encode it to sanitize bad html.

@kurtpayne4 years ago

Adding render_html attribute, default to false for backwards compatibility

comment:12 follow-up: @azaozz3 years ago

We need a more general approach both in the visual editor and in parsing the shortcode. Also this could involve an UI change (How to add a link in the caption in the image properties popup?) or explicitly moving editing of captions to the visual editor after the image has been inserted.

comment:13 in reply to: ↑ 12 ; follow-up: @DrewAPicture3 years ago

  • Cc xoodrew@… added

Replying to azaozz:

We need a more general approach both in the visual editor and in parsing the shortcode. Also this could involve an UI change (How to add a link in the caption in the image properties popup?) or explicitly moving editing of captions to the visual editor after the image has been inserted.

Would moving the editing of captions to the visual editor affect the pulling of EXIF data (such as a caption) from those images?

Last edited 3 years ago by DrewAPicture (previous) (diff)

comment:14 follow-up: @sushkov3 years ago

  • Cc sushkov added

@jeremyclarke has a point when says that image caption content should go between [caption] shortcode tags.

A fix could be achieved with minimum backcompat impact by separating the [caption] shortcode from img tag. To explain, we could move the [caption] before (makes more sense to me, or after) the img tag, so when the parser meets it, the context is not lost.

On the javascript side, that would mean little changes too.

Example:
[caption stuff=""]My awesome caption[/caption]<img src="" alt=""/>

The good part here, is that basically we do no make changes in the shortcode api (no overhead like mustaches {{}}) we just change the location of it inside the post_content.

comment:15 in reply to: ↑ 13 @azaozz3 years ago

Replying to DrewAPicture:

Would moving the editing of captions to the visual editor affect the pulling of EXIF data (such as a caption) from those images?

No, we "read" the EXIF data in PHP after the image is uploaded and store it as meta. This is another enhancement (that has a ticket I think): using all relevant EXIF from the images and rotating them when needed after uploading and before making smaller sizes.

comment:16 in reply to: ↑ 14 @azaozz3 years ago

  • Component changed from TinyMCE to General
  • Description modified (diff)
  • Keywords has-patch removed
  • Milestone changed from Future Release to 3.4
  • Owner set to sushkov
  • Status changed from new to assigned
  • Summary changed from Caption shortcode doesn't support markup in the caption. to Support HTML in image captions.
  • Type changed from enhancement to task (blessed)

Replying to sushkov:

@jeremyclarke has a point when says that image caption content should go between [caption] shortcode tags.

Don't think changing the structure of the image caption shortcode is possible. This would break all plugins and themes that use it directly (we explicitly allow that in img_caption_shortcode() with the 'img_caption_shortcode' filter).

I rather see that as escaping all quotes or encoding the html string in caption="...", whichever works better. If we need to encode it, we can decode before running the 'img_caption_shortcode' filter to keep it back-compat.

Note: See TracTickets for help on using tickets.