Opened 16 years ago
Closed 12 years ago
#9417 closed defect (bug) (fixed)
IPTC IIM IRB character encoding (UTF-8) misinterpreted as ISO-8859-1
Reported by: | kallewangstedt | Owned by: | westi |
---|---|---|---|
Milestone: | 3.5 | Priority: | normal |
Severity: | major | Version: | 2.7.1 |
Component: | Charset | Keywords: | has-patch |
Focuses: | Cc: |
Description
When uploading a JPEG image with IPTC tags in the EXIF that are encoded as UTF-8, the tags are misinterpreted as ISO-8859-1 when they get extracted into the Title and Description fields in swfupload.
This occurs regardless of the encoding set in WP options (wich are UTF-8 by default).
Attachments (6)
Change History (57)
#3
@
16 years ago
In a check yesterday I encountered ISO-encoded files in SWF-Uploader. I guess this is the cause of evil. A core-dev should ensure that all files are UTF-8 encoded. I make a try if that is patchable.
#4
follow-up:
↓ 6
@
16 years ago
- Keywords has-patch needs-testing added; needs-patch removed
Please test
#6
in reply to:
↑ 4
@
16 years ago
Replying to hakre:
Please test
Would love to, but wouldn't know how... or where to find the needed test data.
#8
@
16 years ago
I could not test because I do not have such a jpeg file with that exif data. but maybe this is related since with encodings the chain of data must stay intact.
#9
@
16 years ago
hehe, sounds like one of those bugs that won't get fixed from lack of valid test cases. :D
#10
@
16 years ago
+1 having the files consistently encoded anyway. they are loaded in from utf-8 output (normally) and therefore should be utf-8 encoded as well.
#11
@
16 years ago
- Cc dkikizas@… added
Bug confirmed. I see exactly what kallewangstedt describes.
hakre’s patch does not help. (I think the patched file is not used at all, as it is merged and minified into swfupload-all.js, and, in any case, the patch just changes one character in a comment block.)
I attached a picture, for anyone to see how the characters appear (utf-8 read as iso-8859-1).
The picture is in Picasa Web, if anyone wants to test:
http://picasaweb.google.gr/demetris.pics/2
The ITPC field I used for testing is: caption-abstract. Notice that the text appears fine in Picasa Web.
#12
@
16 years ago
- Milestone changed from 2.8 to Future Release
punting to future pending a patch, now that we've test data.
#13
@
16 years ago
iptcparse() is used to gather iptc data. it is assumed that the iptc data is ISO-8859-1 encoded and is converted to utf8 by utf8_encode() then.
this is not a swfupload issue. this is a encoding issue based on the handling within wordpress.
so the question is: how is the iptc encoding done? is it marked? at least a test for having utf8 data in there seems reasonable to me. if it is already valid utf8 then good to go. if not, the latin-1 can be assumed and encoded into utf8 (as it is done now).
#14
@
16 years ago
- Component changed from Upload to Charset
- Keywords needs-patch 2nd-opinion added; has-patch reporter-feedback needs-testing removed
- Milestone changed from Future Release to 2.8
@hakre: would you like charset issues to be automatically assigned to you?
#15
@
16 years ago
- Keywords has-patch needs-testing added; needs-patch removed
Fix. Patch does check for having utf-8 previous encoding to it. And thats it (for now). Tested with the provided test-image and it does the job. Copy and Paste from the description textbox: "This is a comment. / Это комментарий. / Βλέπετε ένα σχόλιο.".
So this does the job without following the specs. For those I dropped a link in the docblock. The subject is IPTC IIM IRB (IPTC Information Interchange Model Image Resource Blocks) just for note in case this issue needs more attention in the future.
@Denis: you can do so, let's look how it goes.
#17
@
16 years ago
- Summary changed from Character encoding in swfupload misinterpreted as ISO-8859-1 to IPTC IIM IRB character encoding (UTF-8) misinterpreted as ISO-8859-1
@var_dump() - it's getting too late over here. removed it from the patch.
#20
@
16 years ago
can not find that discussion? any details on current state? any problems with the patch?
#21
@
16 years ago
IRC discussion was about punting any non-critical bugs from 2.8 due to pending release.
#23
@
15 years ago
- Milestone changed from 2.9 to 2.8.2
Then I put this on the list for 2.8.2, because I assume 2.8.1 should be for criticals as well.
#30
@
15 years ago
I just got some more files (from various systems) and some code by a coder I met on a conf in september. I will test this within the current bughunt. I'm pretty shure this will improve the topic.
#31
@
15 years ago
- Keywords needs-unit-tests added
- Milestone changed from 2.9 to 3.0
Move to 3.0 for now.
#37
@
14 years ago
- Keywords has-patch tested early removed
- Milestone changed from Awaiting Triage to Future Release
#38
@
14 years ago
from php.ini-development (as of PHP 5.3.5), see the EXIF section:
[exif]
; Exif UNICODE user comments are handled as UCS-2BE/UCS-2LE and JIS as JIS.
; With mbstring support this will automatically be converted into the encoding
; given by corresponding encode setting. When empty mbstring.internal_encoding
; is used. For the decode settings you can distinguish between motorola and
; intel byte order. A decode setting cannot be empty.
; http://php.net/exif.encode-unicode
;exif.encode_unicode = ISO-8859-15
; http://php.net/exif.decode-unicode-motorola
;exif.decode_unicode_motorola = UCS-2BE
; http://php.net/exif.decode-unicode-intel
;exif.decode_unicode_intel = UCS-2LE
; http://php.net/exif.encode-jis
;exif.encode_jis =
; http://php.net/exif.decode-jis-motorola
;exif.decode_jis_motorola = JIS
; http://php.net/exif.decode-jis-intel
;exif.decode_jis_intel = JIS
I just stumbeled over that information, it might be that this might be related to a configuration issue as well.
#39
@
14 years ago
Looks like the tags are double-encoded if they are already in UTF-8. The approach from 9417.2.patch works for me. Refreshed for current trunk.
#42
follow-up:
↓ 43
@
13 years ago
- Keywords punt has-patch added
We need unit tests here. Anyone want to tackle it now? Looks good though. Probably long past time to punt it.
#43
in reply to:
↑ 42
@
13 years ago
- Keywords 3.4-early added
- Milestone changed from 3.3 to Future Release
Replying to nacin:
We need unit tests here. Anyone want to tackle it now? Looks good though. Probably long past time to punt it.
Still no unit tests punting.
#48
follow-up:
↓ 49
@
12 years ago
- Milestone changed from Future Release to 3.5
The patch on #21903 tries to use the IPTC identifier for UTF-8. If that is consistently accurate, I prefer it over this. Some research on the standard would be great — otherwise, this seems fine.
#49
in reply to:
↑ 48
;
follow-up:
↓ 50
@
12 years ago
Replying to nacin:
The patch on #21903 tries to use the IPTC identifier for UTF-8. If that is consistently accurate, I prefer it over this.
From my testing, $iptc['1#090']
marker may not always be present: ticket:20408:3.
#50
in reply to:
↑ 49
@
12 years ago
Replying to SergeyBiryukov:
Replying to nacin:
The patch on #21903 tries to use the IPTC identifier for UTF-8. If that is consistently accurate, I prefer it over this.
From my testing,
$iptc['1#090']
marker may not always be present: ticket:20408:3.
I don't know if there is a reliable source. I got it from here: http://php.net/manual/en/function.iptcparse.php#105025
I tried to Google for a reliable source but with no luck...
a sample image might help