Opened 4 years ago

Closed 8 months ago

#9417 closed defect (bug) (fixed)

IPTC IIM IRB character encoding (UTF-8) misinterpreted as ISO-8859-1

Reported by: kallewangstedt Owned by: westi
Priority: normal Milestone: 3.5
Component: Charset Version: 2.7.1
Severity: major Keywords: has-patch
Cc: dkikizas@…, westi, yoav@…

Description

When uploading a JPEG image with IPTC tags in the EXIF that are encoded as UTF-8, the tags are misinterpreted as ISO-8859-1 when they get extracted into the Title and Description fields in swfupload.

This occurs regardless of the encoding set in WP options (wich are UTF-8 by default).

Attachments (6)

9417.patch (732 bytes) - added by hakre 4 years ago.
utf-8 fix
marantz-pm-40se.jpg (51.5 KB) - added by demetris 4 years ago.
Sample JPEG file with trilingual UTF-8 text in the ITPC caption-abstract field
swfupload-utf8-iso88591.png (42.3 KB) - added by demetris 4 years ago.
Screenshot of Flash uploader with UTF-8 text in ITPC field read as ISO-8859-1
9417.2.patch (2.1 KB) - added by hakre 4 years ago.
existing UTF8 encoding of IPTC blocks preserved
9417.3.patch (5.2 KB) - added by SergeyBiryukov 2 years ago.
9417.4.patch (5.1 KB) - added by SergeyBiryukov 11 months ago.
Refreshed

Download all attachments as: .zip

Change History (57)

  • Milestone changed from Unassigned to 2.8
  • Keywords needs-patch reporter-feedback added

a sample image might help

In a check yesterday I encountered ISO-encoded files in SWF-Uploader. I guess this is the cause of evil. A core-dev should ensure that all files are UTF-8 encoded. I make a try if that is patchable.

hakre4 years ago

utf-8 fix

comment:4 follow-up: ↓ 6   hakre4 years ago

  • Keywords has-patch needs-testing added; needs-patch removed

Please test

This actually fixed things on your end? :-)

comment:6 in reply to: ↑ 4   Denis-de-Bernardy4 years ago

Replying to hakre:

Please test

Would love to, but wouldn't know how... or where to find the needed test data.

I could not test because I do not have such a jpeg file with that exif data. but maybe this is related since with encodings the chain of data must stay intact.

hehe, sounds like one of those bugs that won't get fixed from lack of valid test cases. :D

+1 having the files consistently encoded anyway. they are loaded in from utf-8 output (normally) and therefore should be utf-8 encoded as well.

  • Cc dkikizas@… added

Bug confirmed. I see exactly what kallewangstedt describes.

hakre’s patch does not help. (I think the patched file is not used at all, as it is merged and minified into swfupload-all.js, and, in any case, the patch just changes one character in a comment block.)

I attached a picture, for anyone to see how the characters appear (utf-8 read as iso-8859-1).

The picture is in Picasa Web, if anyone wants to test:

http://picasaweb.google.gr/demetris.pics/2

The ITPC field I used for testing is: caption-abstract. Notice that the text appears fine in Picasa Web.

Sample JPEG file with trilingual UTF-8 text in the ITPC caption-abstract field

Screenshot of Flash uploader with UTF-8 text in ITPC field read as ISO-8859-1

  • Milestone changed from 2.8 to Future Release

punting to future pending a patch, now that we've test data.

iptcparse() is used to gather iptc data. it is assumed that the iptc data is ISO-8859-1 encoded and is converted to utf8 by utf8_encode() then.

this is not a swfupload issue. this is a encoding issue based on the handling within wordpress.

so the question is: how is the iptc encoding done? is it marked? at least a test for having utf8 data in there seems reasonable to me. if it is already valid utf8 then good to go. if not, the latin-1 can be assumed and encoded into utf8 (as it is done now).

  • Component changed from Upload to Charset
  • Keywords needs-patch 2nd-opinion added; has-patch reporter-feedback needs-testing removed
  • Milestone changed from Future Release to 2.8

@hakre: would you like charset issues to be automatically assigned to you?

  • Keywords has-patch needs-testing added; needs-patch removed

Fix. Patch does check for having utf-8 previous encoding to it. And thats it (for now). Tested with the provided test-image and it does the job. Copy and Paste from the description textbox: "This is a comment. / Это комментарий. / Βλέπετε ένα σχόλιο.".

So this does the job without following the specs. For those I dropped a link in the docblock. The subject is IPTC IIM IRB (IPTC Information Interchange Model Image Resource Blocks) just for note in case this issue needs more attention in the future.

@Denis: you can do so, let's look how it goes.

done. you left a var_dump() in your patch.

hakre4 years ago

existing UTF8 encoding of IPTC blocks preserved

  • Summary changed from Character encoding in swfupload misinterpreted as ISO-8859-1 to IPTC IIM IRB character encoding (UTF-8) misinterpreted as ISO-8859-1

@var_dump() - it's getting too late over here. removed it from the patch.

  • Cc westi added
  • Milestone changed from 2.8 to 2.9

punting per IRC discussion

can not find that discussion? any details on current state? any problems with the patch?

IRC discussion was about punting any non-critical bugs from 2.8 due to pending release.

  • Keywords 2nd-opinion removed
  • Milestone changed from 2.9 to 2.8.2

Then I put this on the list for 2.8.2, because I assume 2.8.1 should be for criticals as well.

  • Milestone changed from 2.8.5 to 2.9

related: #9417

related: #6412

  • Keywords tested added; needs-testing removed
  • Cc yoav@… added

Double tested :)

  • Owner set to westi
  • Status changed from new to reviewing

I just got some more files (from various systems) and some code by a coder I met on a conf in september. I will test this within the current bughunt. I'm pretty shure this will improve the topic.

  • Keywords needs-unit-tests added
  • Milestone changed from 2.9 to 3.0

Move to 3.0 for now.

Related: #7580

Reference: #11547

Related: #11417

  • Keywords needs-refresh early added
  • Milestone changed from 3.0 to 3.1
  • Keywords has-patch tested early removed
  • Milestone changed from Awaiting Triage to Future Release

from php.ini-development (as of PHP 5.3.5), see the EXIF section:

[exif]
; Exif UNICODE user comments are handled as UCS-2BE/UCS-2LE and JIS as JIS.
; With mbstring support this will automatically be converted into the encoding
; given by corresponding encode setting. When empty mbstring.internal_encoding
; is used. For the decode settings you can distinguish between motorola and
; intel byte order. A decode setting cannot be empty.
; http://php.net/exif.encode-unicode
;exif.encode_unicode = ISO-8859-15

; http://php.net/exif.decode-unicode-motorola
;exif.decode_unicode_motorola = UCS-2BE

; http://php.net/exif.decode-unicode-intel
;exif.decode_unicode_intel = UCS-2LE

; http://php.net/exif.encode-jis
;exif.encode_jis =

; http://php.net/exif.decode-jis-motorola
;exif.decode_jis_motorola = JIS

; http://php.net/exif.decode-jis-intel
;exif.decode_jis_intel = JIS

I just stumbeled over that information, it might be that this might be related to a configuration issue as well.

Looks like the tags are double-encoded if they are already in UTF-8. The approach from 9417.2.patch works for me. Refreshed for current trunk.

  • Keywords needs-refresh removed
  • Milestone changed from Future Release to 3.3

comment:42 follow-up: ↓ 43   nacin19 months ago

  • Keywords punt has-patch added

We need unit tests here. Anyone want to tackle it now? Looks good though. Probably long past time to punt it.

comment:43 in reply to: ↑ 42   westi18 months ago

  • Keywords 3.4-early added
  • Milestone changed from 3.3 to Future Release

Replying to nacin:

We need unit tests here. Anyone want to tackle it now? Looks good though. Probably long past time to punt it.

Still no unit tests punting.

  • Keywords needs-unit-tests removed
  • Keywords punt 3.4-early removed

Closed #20408 as a duplicate.

Refreshed

#21903 was marked as a duplicate.

comment:48 follow-up: ↓ 49   nacin8 months ago

  • Milestone changed from Future Release to 3.5

The patch on #21903 tries to use the IPTC identifier for UTF-8. If that is consistently accurate, I prefer it over this. Some research on the standard would be great — otherwise, this seems fine.

comment:49 in reply to: ↑ 48 ; follow-up: ↓ 50   SergeyBiryukov8 months ago

Replying to nacin:

The patch on #21903 tries to use the IPTC identifier for UTF-8. If that is consistently accurate, I prefer it over this.

From my testing, $iptc['1#090'] marker may not always be present: ticket:20408:3.

comment:50 in reply to: ↑ 49   chenxing8 months ago

Replying to SergeyBiryukov:

Replying to nacin:

The patch on #21903 tries to use the IPTC identifier for UTF-8. If that is consistently accurate, I prefer it over this.

From my testing, $iptc['1#090'] marker may not always be present: ticket:20408:3.

I don't know if there is a reliable source. I got it from here: http://php.net/manual/en/function.iptcparse.php#105025

I tried to Google for a reliable source but with no luck...

  • Resolution set to fixed
  • Status changed from reviewing to closed

In [21905]:

Avoid mangling UTF-8 strings that may be present in image metadata. props SergeyBiryukov for the unit tests [UT665]. fixes #9417.

Note: See TracTickets for help on using tickets.