Make WordPress Core

Opened 16 years ago

Closed 12 years ago

#9417 closed defect (bug) (fixed)

IPTC IIM IRB character encoding (UTF-8) misinterpreted as ISO-8859-1

Reported by: kallewangstedt's profile kallewangstedt Owned by: westi's profile westi
Milestone: 3.5 Priority: normal
Severity: major Version: 2.7.1
Component: Charset Keywords: has-patch
Focuses: Cc:

Description

When uploading a JPEG image with IPTC tags in the EXIF that are encoded as UTF-8, the tags are misinterpreted as ISO-8859-1 when they get extracted into the Title and Description fields in swfupload.

This occurs regardless of the encoding set in WP options (wich are UTF-8 by default).

Attachments (6)

9417.patch (732 bytes) - added by hakre 16 years ago.
utf-8 fix
marantz-pm-40se.jpg (51.5 KB) - added by demetris 16 years ago.
Sample JPEG file with trilingual UTF-8 text in the ITPC caption-abstract field
swfupload-utf8-iso88591.png (42.3 KB) - added by demetris 16 years ago.
Screenshot of Flash uploader with UTF-8 text in ITPC field read as ISO-8859-1
9417.2.patch (2.1 KB) - added by hakre 16 years ago.
existing UTF8 encoding of IPTC blocks preserved
9417.3.patch (5.2 KB) - added by SergeyBiryukov 14 years ago.
9417.4.patch (5.1 KB) - added by SergeyBiryukov 12 years ago.
Refreshed

Download all attachments as: .zip

Change History (57)

#1 @kallewangstedt
16 years ago

  • Milestone changed from Unassigned to 2.8

#2 @Denis-de-Bernardy
16 years ago

  • Keywords needs-patch reporter-feedback added

a sample image might help

#3 @hakre
16 years ago

In a check yesterday I encountered ISO-encoded files in SWF-Uploader. I guess this is the cause of evil. A core-dev should ensure that all files are UTF-8 encoded. I make a try if that is patchable.

@hakre
16 years ago

utf-8 fix

#4 follow-up: @hakre
16 years ago

  • Keywords has-patch needs-testing added; needs-patch removed

Please test

#5 @Denis-de-Bernardy
16 years ago

This actually fixed things on your end? :-)

#6 in reply to: ↑ 4 @Denis-de-Bernardy
16 years ago

Replying to hakre:

Please test

Would love to, but wouldn't know how... or where to find the needed test data.

#8 @hakre
16 years ago

I could not test because I do not have such a jpeg file with that exif data. but maybe this is related since with encodings the chain of data must stay intact.

#9 @Denis-de-Bernardy
16 years ago

hehe, sounds like one of those bugs that won't get fixed from lack of valid test cases. :D

#10 @hakre
16 years ago

+1 having the files consistently encoded anyway. they are loaded in from utf-8 output (normally) and therefore should be utf-8 encoded as well.

#11 @demetris
16 years ago

  • Cc dkikizas@… added

Bug confirmed. I see exactly what kallewangstedt describes.

hakre’s patch does not help. (I think the patched file is not used at all, as it is merged and minified into swfupload-all.js, and, in any case, the patch just changes one character in a comment block.)

I attached a picture, for anyone to see how the characters appear (utf-8 read as iso-8859-1).

The picture is in Picasa Web, if anyone wants to test:

http://picasaweb.google.gr/demetris.pics/2

The ITPC field I used for testing is: caption-abstract. Notice that the text appears fine in Picasa Web.

@demetris
16 years ago

Sample JPEG file with trilingual UTF-8 text in the ITPC caption-abstract field

@demetris
16 years ago

Screenshot of Flash uploader with UTF-8 text in ITPC field read as ISO-8859-1

#12 @Denis-de-Bernardy
16 years ago

  • Milestone changed from 2.8 to Future Release

punting to future pending a patch, now that we've test data.

#13 @hakre
16 years ago

iptcparse() is used to gather iptc data. it is assumed that the iptc data is ISO-8859-1 encoded and is converted to utf8 by utf8_encode() then.

this is not a swfupload issue. this is a encoding issue based on the handling within wordpress.

so the question is: how is the iptc encoding done? is it marked? at least a test for having utf8 data in there seems reasonable to me. if it is already valid utf8 then good to go. if not, the latin-1 can be assumed and encoded into utf8 (as it is done now).

#14 @Denis-de-Bernardy
16 years ago

  • Component changed from Upload to Charset
  • Keywords needs-patch 2nd-opinion added; has-patch reporter-feedback needs-testing removed
  • Milestone changed from Future Release to 2.8

@hakre: would you like charset issues to be automatically assigned to you?

#15 @hakre
16 years ago

  • Keywords has-patch needs-testing added; needs-patch removed

Fix. Patch does check for having utf-8 previous encoding to it. And thats it (for now). Tested with the provided test-image and it does the job. Copy and Paste from the description textbox: "This is a comment. / Это комментарий. / Βλέπετε ένα σχόλιο.".

So this does the job without following the specs. For those I dropped a link in the docblock. The subject is IPTC IIM IRB (IPTC Information Interchange Model Image Resource Blocks) just for note in case this issue needs more attention in the future.

@Denis: you can do so, let's look how it goes.

#16 @Denis-de-Bernardy
16 years ago

done. you left a var_dump() in your patch.

@hakre
16 years ago

existing UTF8 encoding of IPTC blocks preserved

#17 @hakre
16 years ago

  • Summary changed from Character encoding in swfupload misinterpreted as ISO-8859-1 to IPTC IIM IRB character encoding (UTF-8) misinterpreted as ISO-8859-1

@var_dump() - it's getting too late over here. removed it from the patch.

#18 @westi
16 years ago

  • Cc westi added

#19 @Denis-de-Bernardy
16 years ago

  • Milestone changed from 2.8 to 2.9

punting per IRC discussion

#20 @hakre
16 years ago

can not find that discussion? any details on current state? any problems with the patch?

#21 @dd32
16 years ago

IRC discussion was about punting any non-critical bugs from 2.8 due to pending release.

#22 @Denis-de-Bernardy
15 years ago

  • Keywords 2nd-opinion removed

#23 @hakre
15 years ago

  • Milestone changed from 2.9 to 2.8.2

Then I put this on the list for 2.8.2, because I assume 2.8.1 should be for criticals as well.

#24 @hakre
15 years ago

  • Milestone changed from 2.8.5 to 2.9

#25 @hakre
15 years ago

related: #9417

#26 @hakre
15 years ago

related: #6412

#27 @hakre
15 years ago

  • Keywords tested added; needs-testing removed

#28 @yoavf
15 years ago

  • Cc yoav@… added

Double tested :)

#29 @westi
15 years ago

  • Owner set to westi
  • Status changed from new to reviewing

#30 @hakre
15 years ago

I just got some more files (from various systems) and some code by a coder I met on a conf in september. I will test this within the current bughunt. I'm pretty shure this will improve the topic.

#31 @westi
15 years ago

  • Keywords needs-unit-tests added
  • Milestone changed from 2.9 to 3.0

Move to 3.0 for now.

#32 @hakre
15 years ago

Related: #7580

#33 @hakre
15 years ago

Reference: #11547

#34 @hakre
15 years ago

Related: #11417

#36 @nacin
15 years ago

  • Keywords needs-refresh early added
  • Milestone changed from 3.0 to 3.1

#37 @nacin
14 years ago

  • Keywords has-patch tested early removed
  • Milestone changed from Awaiting Triage to Future Release

#38 @hakre
14 years ago

from php.ini-development (as of PHP 5.3.5), see the EXIF section:

[exif]
; Exif UNICODE user comments are handled as UCS-2BE/UCS-2LE and JIS as JIS.
; With mbstring support this will automatically be converted into the encoding
; given by corresponding encode setting. When empty mbstring.internal_encoding
; is used. For the decode settings you can distinguish between motorola and
; intel byte order. A decode setting cannot be empty.
; http://php.net/exif.encode-unicode
;exif.encode_unicode = ISO-8859-15

; http://php.net/exif.decode-unicode-motorola
;exif.decode_unicode_motorola = UCS-2BE

; http://php.net/exif.decode-unicode-intel
;exif.decode_unicode_intel = UCS-2LE

; http://php.net/exif.encode-jis
;exif.encode_jis =

; http://php.net/exif.decode-jis-motorola
;exif.decode_jis_motorola = JIS

; http://php.net/exif.decode-jis-intel
;exif.decode_jis_intel = JIS

I just stumbeled over that information, it might be that this might be related to a configuration issue as well.

#39 @SergeyBiryukov
14 years ago

Looks like the tags are double-encoded if they are already in UTF-8. The approach from 9417.2.patch works for me. Refreshed for current trunk.

#40 @SergeyBiryukov
14 years ago

  • Keywords needs-refresh removed

#41 @SergeyBiryukov
13 years ago

  • Milestone changed from Future Release to 3.3

#42 follow-up: @nacin
13 years ago

  • Keywords punt has-patch added

We need unit tests here. Anyone want to tackle it now? Looks good though. Probably long past time to punt it.

#43 in reply to: ↑ 42 @westi
13 years ago

  • Keywords 3.4-early added
  • Milestone changed from 3.3 to Future Release

Replying to nacin:

We need unit tests here. Anyone want to tackle it now? Looks good though. Probably long past time to punt it.

Still no unit tests punting.

#44 @SergeyBiryukov
13 years ago

  • Keywords needs-unit-tests removed

#45 @SergeyBiryukov
13 years ago

  • Keywords punt 3.4-early removed

#46 @SergeyBiryukov
13 years ago

Closed #20408 as a duplicate.

@SergeyBiryukov
12 years ago

Refreshed

#47 @SergeyBiryukov
12 years ago

#21903 was marked as a duplicate.

#48 follow-up: @nacin
12 years ago

  • Milestone changed from Future Release to 3.5

The patch on #21903 tries to use the IPTC identifier for UTF-8. If that is consistently accurate, I prefer it over this. Some research on the standard would be great — otherwise, this seems fine.

#49 in reply to: ↑ 48 ; follow-up: @SergeyBiryukov
12 years ago

Replying to nacin:

The patch on #21903 tries to use the IPTC identifier for UTF-8. If that is consistently accurate, I prefer it over this.

From my testing, $iptc['1#090'] marker may not always be present: ticket:20408:3.

#50 in reply to: ↑ 49 @chenxing
12 years ago

Replying to SergeyBiryukov:

Replying to nacin:

The patch on #21903 tries to use the IPTC identifier for UTF-8. If that is consistently accurate, I prefer it over this.

From my testing, $iptc['1#090'] marker may not always be present: ticket:20408:3.

I don't know if there is a reliable source. I got it from here: http://php.net/manual/en/function.iptcparse.php#105025

I tried to Google for a reliable source but with no luck...

#51 @nacin
12 years ago

  • Resolution set to fixed
  • Status changed from reviewing to closed

In [21905]:

Avoid mangling UTF-8 strings that may be present in image metadata. props SergeyBiryukov for the unit tests [UT665]. fixes #9417.

Note: See TracTickets for help on using tickets.