WordPress.org

Make WordPress Core

Opened 5 years ago

Closed 19 months ago

#9417 closed defect (bug) (fixed)

IPTC IIM IRB character encoding (UTF-8) misinterpreted as ISO-8859-1

Reported by: kallewangstedt Owned by: westi
Milestone: 3.5 Priority: normal
Severity: major Version: 2.7.1
Component: Charset Keywords: has-patch
Focuses: Cc:

Description

When uploading a JPEG image with IPTC tags in the EXIF that are encoded as UTF-8, the tags are misinterpreted as ISO-8859-1 when they get extracted into the Title and Description fields in swfupload.

This occurs regardless of the encoding set in WP options (wich are UTF-8 by default).

Attachments (6)

9417.patch (732 bytes) - added by hakre 5 years ago.
utf-8 fix
marantz-pm-40se.jpg (51.5 KB) - added by demetris 5 years ago.
Sample JPEG file with trilingual UTF-8 text in the ITPC caption-abstract field
swfupload-utf8-iso88591.png (42.3 KB) - added by demetris 5 years ago.
Screenshot of Flash uploader with UTF-8 text in ITPC field read as ISO-8859-1
9417.2.patch (2.1 KB) - added by hakre 5 years ago.
existing UTF8 encoding of IPTC blocks preserved
9417.3.patch (5.2 KB) - added by SergeyBiryukov 3 years ago.
9417.4.patch (5.1 KB) - added by SergeyBiryukov 22 months ago.
Refreshed

Download all attachments as: .zip

Change History (57)

comment:1 kallewangstedt5 years ago

  • Milestone changed from Unassigned to 2.8

comment:2 Denis-de-Bernardy5 years ago

  • Keywords needs-patch reporter-feedback added

a sample image might help

comment:3 hakre5 years ago

In a check yesterday I encountered ISO-encoded files in SWF-Uploader. I guess this is the cause of evil. A core-dev should ensure that all files are UTF-8 encoded. I make a try if that is patchable.

hakre5 years ago

utf-8 fix

comment:4 follow-up: hakre5 years ago

  • Keywords has-patch needs-testing added; needs-patch removed

Please test

comment:5 Denis-de-Bernardy5 years ago

This actually fixed things on your end? :-)

comment:6 in reply to: ↑ 4 Denis-de-Bernardy5 years ago

Replying to hakre:

Please test

Would love to, but wouldn't know how... or where to find the needed test data.

comment:8 hakre5 years ago

I could not test because I do not have such a jpeg file with that exif data. but maybe this is related since with encodings the chain of data must stay intact.

comment:9 Denis-de-Bernardy5 years ago

hehe, sounds like one of those bugs that won't get fixed from lack of valid test cases. :D

comment:10 hakre5 years ago

+1 having the files consistently encoded anyway. they are loaded in from utf-8 output (normally) and therefore should be utf-8 encoded as well.

comment:11 demetris5 years ago

  • Cc dkikizas@… added

Bug confirmed. I see exactly what kallewangstedt describes.

hakre’s patch does not help. (I think the patched file is not used at all, as it is merged and minified into swfupload-all.js, and, in any case, the patch just changes one character in a comment block.)

I attached a picture, for anyone to see how the characters appear (utf-8 read as iso-8859-1).

The picture is in Picasa Web, if anyone wants to test:

http://picasaweb.google.gr/demetris.pics/2

The ITPC field I used for testing is: caption-abstract. Notice that the text appears fine in Picasa Web.

demetris5 years ago

Sample JPEG file with trilingual UTF-8 text in the ITPC caption-abstract field

demetris5 years ago

Screenshot of Flash uploader with UTF-8 text in ITPC field read as ISO-8859-1

comment:12 Denis-de-Bernardy5 years ago

  • Milestone changed from 2.8 to Future Release

punting to future pending a patch, now that we've test data.

comment:13 hakre5 years ago

iptcparse() is used to gather iptc data. it is assumed that the iptc data is ISO-8859-1 encoded and is converted to utf8 by utf8_encode() then.

this is not a swfupload issue. this is a encoding issue based on the handling within wordpress.

so the question is: how is the iptc encoding done? is it marked? at least a test for having utf8 data in there seems reasonable to me. if it is already valid utf8 then good to go. if not, the latin-1 can be assumed and encoded into utf8 (as it is done now).

comment:14 Denis-de-Bernardy5 years ago

  • Component changed from Upload to Charset
  • Keywords needs-patch 2nd-opinion added; has-patch reporter-feedback needs-testing removed
  • Milestone changed from Future Release to 2.8

@hakre: would you like charset issues to be automatically assigned to you?

comment:15 hakre5 years ago

  • Keywords has-patch needs-testing added; needs-patch removed

Fix. Patch does check for having utf-8 previous encoding to it. And thats it (for now). Tested with the provided test-image and it does the job. Copy and Paste from the description textbox: "This is a comment. / Это комментарий. / Βλέπετε ένα σχόλιο.".

So this does the job without following the specs. For those I dropped a link in the docblock. The subject is IPTC IIM IRB (IPTC Information Interchange Model Image Resource Blocks) just for note in case this issue needs more attention in the future.

@Denis: you can do so, let's look how it goes.

comment:16 Denis-de-Bernardy5 years ago

done. you left a var_dump() in your patch.

hakre5 years ago

existing UTF8 encoding of IPTC blocks preserved

comment:17 hakre5 years ago

  • Summary changed from Character encoding in swfupload misinterpreted as ISO-8859-1 to IPTC IIM IRB character encoding (UTF-8) misinterpreted as ISO-8859-1

@var_dump() - it's getting too late over here. removed it from the patch.

comment:18 westi5 years ago

  • Cc westi added

comment:19 Denis-de-Bernardy5 years ago

  • Milestone changed from 2.8 to 2.9

punting per IRC discussion

comment:20 hakre5 years ago

can not find that discussion? any details on current state? any problems with the patch?

comment:21 dd325 years ago

IRC discussion was about punting any non-critical bugs from 2.8 due to pending release.

comment:22 Denis-de-Bernardy5 years ago

  • Keywords 2nd-opinion removed

comment:23 hakre5 years ago

  • Milestone changed from 2.9 to 2.8.2

Then I put this on the list for 2.8.2, because I assume 2.8.1 should be for criticals as well.

comment:24 hakre5 years ago

  • Milestone changed from 2.8.5 to 2.9

comment:25 hakre5 years ago

related: #9417

comment:26 hakre5 years ago

related: #6412

comment:27 hakre5 years ago

  • Keywords tested added; needs-testing removed

comment:28 yoavf4 years ago

  • Cc yoav@… added

Double tested :)

comment:29 westi4 years ago

  • Owner set to westi
  • Status changed from new to reviewing

comment:30 hakre4 years ago

I just got some more files (from various systems) and some code by a coder I met on a conf in september. I will test this within the current bughunt. I'm pretty shure this will improve the topic.

comment:31 westi4 years ago

  • Keywords needs-unit-tests added
  • Milestone changed from 2.9 to 3.0

Move to 3.0 for now.

comment:32 hakre4 years ago

Related: #7580

comment:33 hakre4 years ago

Reference: #11547

comment:34 hakre4 years ago

Related: #11417

comment:36 nacin4 years ago

  • Keywords needs-refresh early added
  • Milestone changed from 3.0 to 3.1

comment:37 nacin3 years ago

  • Keywords has-patch tested early removed
  • Milestone changed from Awaiting Triage to Future Release

comment:38 hakre3 years ago

from php.ini-development (as of PHP 5.3.5), see the EXIF section:

[exif]
; Exif UNICODE user comments are handled as UCS-2BE/UCS-2LE and JIS as JIS.
; With mbstring support this will automatically be converted into the encoding
; given by corresponding encode setting. When empty mbstring.internal_encoding
; is used. For the decode settings you can distinguish between motorola and
; intel byte order. A decode setting cannot be empty.
; http://php.net/exif.encode-unicode
;exif.encode_unicode = ISO-8859-15

; http://php.net/exif.decode-unicode-motorola
;exif.decode_unicode_motorola = UCS-2BE

; http://php.net/exif.decode-unicode-intel
;exif.decode_unicode_intel = UCS-2LE

; http://php.net/exif.encode-jis
;exif.encode_jis =

; http://php.net/exif.decode-jis-motorola
;exif.decode_jis_motorola = JIS

; http://php.net/exif.decode-jis-intel
;exif.decode_jis_intel = JIS

I just stumbeled over that information, it might be that this might be related to a configuration issue as well.

SergeyBiryukov3 years ago

comment:39 SergeyBiryukov3 years ago

Looks like the tags are double-encoded if they are already in UTF-8. The approach from 9417.2.patch works for me. Refreshed for current trunk.

comment:40 SergeyBiryukov3 years ago

  • Keywords needs-refresh removed

comment:41 SergeyBiryukov3 years ago

  • Milestone changed from Future Release to 3.3

comment:42 follow-up: nacin2 years ago

  • Keywords punt has-patch added

We need unit tests here. Anyone want to tackle it now? Looks good though. Probably long past time to punt it.

comment:43 in reply to: ↑ 42 westi2 years ago

  • Keywords 3.4-early added
  • Milestone changed from 3.3 to Future Release

Replying to nacin:

We need unit tests here. Anyone want to tackle it now? Looks good though. Probably long past time to punt it.

Still no unit tests punting.

comment:44 SergeyBiryukov2 years ago

  • Keywords needs-unit-tests removed

comment:45 SergeyBiryukov2 years ago

  • Keywords punt 3.4-early removed

comment:46 SergeyBiryukov2 years ago

Closed #20408 as a duplicate.

SergeyBiryukov22 months ago

Refreshed

comment:47 SergeyBiryukov19 months ago

#21903 was marked as a duplicate.

comment:48 follow-up: nacin19 months ago

  • Milestone changed from Future Release to 3.5

The patch on #21903 tries to use the IPTC identifier for UTF-8. If that is consistently accurate, I prefer it over this. Some research on the standard would be great — otherwise, this seems fine.

comment:49 in reply to: ↑ 48 ; follow-up: SergeyBiryukov19 months ago

Replying to nacin:

The patch on #21903 tries to use the IPTC identifier for UTF-8. If that is consistently accurate, I prefer it over this.

From my testing, $iptc['1#090'] marker may not always be present: ticket:20408:3.

comment:50 in reply to: ↑ 49 chenxing19 months ago

Replying to SergeyBiryukov:

Replying to nacin:

The patch on #21903 tries to use the IPTC identifier for UTF-8. If that is consistently accurate, I prefer it over this.

From my testing, $iptc['1#090'] marker may not always be present: ticket:20408:3.

I don't know if there is a reliable source. I got it from here: http://php.net/manual/en/function.iptcparse.php#105025

I tried to Google for a reliable source but with no luck...

comment:51 nacin19 months ago

  • Resolution set to fixed
  • Status changed from reviewing to closed

In [21905]:

Avoid mangling UTF-8 strings that may be present in image metadata. props SergeyBiryukov for the unit tests [UT665]. fixes #9417.

Note: See TracTickets for help on using tickets.