Opened 4 years ago
Closed 8 months ago
#9417 closed defect (bug) (fixed)
IPTC IIM IRB character encoding (UTF-8) misinterpreted as ISO-8859-1
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Priority: | normal | Milestone: | 3.5 |
| Component: | Charset | Version: | 2.7.1 |
| Severity: | major | Keywords: | has-patch |
| Cc: | dkikizas@…, westi, yoav@… |
Description
When uploading a JPEG image with IPTC tags in the EXIF that are encoded as UTF-8, the tags are misinterpreted as ISO-8859-1 when they get extracted into the Title and Description fields in swfupload.
This occurs regardless of the encoding set in WP options (wich are UTF-8 by default).
Attachments (6)
Change History (57)
comment:1
kallewangstedt — 4 years ago
- Milestone changed from Unassigned to 2.8
- Keywords needs-patch reporter-feedback added
In a check yesterday I encountered ISO-encoded files in SWF-Uploader. I guess this is the cause of evil. A core-dev should ensure that all files are UTF-8 encoded. I make a try if that is patchable.
- Keywords has-patch needs-testing added; needs-patch removed
Please test
This actually fixed things on your end? :-)
comment:6
in reply to:
↑ 4
Denis-de-Bernardy — 4 years ago
Replying to hakre:
Please test
Would love to, but wouldn't know how... or where to find the needed test data.
see also #9413
I could not test because I do not have such a jpeg file with that exif data. but maybe this is related since with encodings the chain of data must stay intact.
hehe, sounds like one of those bugs that won't get fixed from lack of valid test cases. :D
comment:10
hakre — 4 years ago
+1 having the files consistently encoded anyway. they are loaded in from utf-8 output (normally) and therefore should be utf-8 encoded as well.
comment:11
demetris — 4 years ago
- Cc dkikizas@… added
Bug confirmed. I see exactly what kallewangstedt describes.
hakre’s patch does not help. (I think the patched file is not used at all, as it is merged and minified into swfupload-all.js, and, in any case, the patch just changes one character in a comment block.)
I attached a picture, for anyone to see how the characters appear (utf-8 read as iso-8859-1).
The picture is in Picasa Web, if anyone wants to test:
http://picasaweb.google.gr/demetris.pics/2
The ITPC field I used for testing is: caption-abstract. Notice that the text appears fine in Picasa Web.
Sample JPEG file with trilingual UTF-8 text in the ITPC caption-abstract field
Screenshot of Flash uploader with UTF-8 text in ITPC field read as ISO-8859-1
- Milestone changed from 2.8 to Future Release
punting to future pending a patch, now that we've test data.
comment:13
hakre — 4 years ago
iptcparse() is used to gather iptc data. it is assumed that the iptc data is ISO-8859-1 encoded and is converted to utf8 by utf8_encode() then.
this is not a swfupload issue. this is a encoding issue based on the handling within wordpress.
so the question is: how is the iptc encoding done? is it marked? at least a test for having utf8 data in there seems reasonable to me. if it is already valid utf8 then good to go. if not, the latin-1 can be assumed and encoded into utf8 (as it is done now).
- Component changed from Upload to Charset
- Keywords needs-patch 2nd-opinion added; has-patch reporter-feedback needs-testing removed
- Milestone changed from Future Release to 2.8
@hakre: would you like charset issues to be automatically assigned to you?
comment:15
hakre — 4 years ago
- Keywords has-patch needs-testing added; needs-patch removed
Fix. Patch does check for having utf-8 previous encoding to it. And thats it (for now). Tested with the provided test-image and it does the job. Copy and Paste from the description textbox: "This is a comment. / Это комментарий. / Βλέπετε ένα σχόλιο.".
So this does the job without following the specs. For those I dropped a link in the docblock. The subject is IPTC IIM IRB (IPTC Information Interchange Model Image Resource Blocks) just for note in case this issue needs more attention in the future.
@Denis: you can do so, let's look how it goes.
done. you left a var_dump() in your patch.
comment:17
hakre — 4 years ago
- Summary changed from Character encoding in swfupload misinterpreted as ISO-8859-1 to IPTC IIM IRB character encoding (UTF-8) misinterpreted as ISO-8859-1
@var_dump() - it's getting too late over here. removed it from the patch.
comment:18
westi — 4 years ago
- Cc westi added
- Milestone changed from 2.8 to 2.9
punting per IRC discussion
comment:20
hakre — 4 years ago
can not find that discussion? any details on current state? any problems with the patch?
comment:21
dd32 — 4 years ago
IRC discussion was about punting any non-critical bugs from 2.8 due to pending release.
- Keywords 2nd-opinion removed
comment:23
hakre — 4 years ago
- Milestone changed from 2.9 to 2.8.2
Then I put this on the list for 2.8.2, because I assume 2.8.1 should be for criticals as well.
comment:24
hakre — 4 years ago
- Milestone changed from 2.8.5 to 2.9
comment:25
hakre — 4 years ago
related: #9417
comment:26
hakre — 4 years ago
related: #6412
comment:27
hakre — 4 years ago
- Keywords tested added; needs-testing removed
comment:29
westi — 4 years ago
- Owner set to westi
- Status changed from new to reviewing
comment:30
hakre — 4 years ago
I just got some more files (from various systems) and some code by a coder I met on a conf in september. I will test this within the current bughunt. I'm pretty shure this will improve the topic.
comment:31
westi — 3 years ago
- Keywords needs-unit-tests added
- Milestone changed from 2.9 to 3.0
Move to 3.0 for now.
comment:32
hakre — 3 years ago
Related: #7580
comment:33
hakre — 3 years ago
Reference: #11547
comment:34
hakre — 3 years ago
Related: #11417
comment:35
miqrogroove — 3 years ago
comment:36
nacin — 3 years ago
- Keywords needs-refresh early added
- Milestone changed from 3.0 to 3.1
comment:37
nacin — 3 years ago
- Keywords has-patch tested early removed
- Milestone changed from Awaiting Triage to Future Release
comment:38
hakre — 2 years ago
from php.ini-development (as of PHP 5.3.5), see the EXIF section:
[exif]
; Exif UNICODE user comments are handled as UCS-2BE/UCS-2LE and JIS as JIS.
; With mbstring support this will automatically be converted into the encoding
; given by corresponding encode setting. When empty mbstring.internal_encoding
; is used. For the decode settings you can distinguish between motorola and
; intel byte order. A decode setting cannot be empty.
; http://php.net/exif.encode-unicode
;exif.encode_unicode = ISO-8859-15
; http://php.net/exif.decode-unicode-motorola
;exif.decode_unicode_motorola = UCS-2BE
; http://php.net/exif.decode-unicode-intel
;exif.decode_unicode_intel = UCS-2LE
; http://php.net/exif.encode-jis
;exif.encode_jis =
; http://php.net/exif.decode-jis-motorola
;exif.decode_jis_motorola = JIS
; http://php.net/exif.decode-jis-intel
;exif.decode_jis_intel = JIS
I just stumbeled over that information, it might be that this might be related to a configuration issue as well.
SergeyBiryukov — 2 years ago
comment:39
SergeyBiryukov — 2 years ago
Looks like the tags are double-encoded if they are already in UTF-8. The approach from 9417.2.patch works for me. Refreshed for current trunk.
comment:40
SergeyBiryukov — 2 years ago
- Keywords needs-refresh removed
- Milestone changed from Future Release to 3.3
comment:42
follow-up:
↓ 43
nacin — 19 months ago
- Keywords punt has-patch added
We need unit tests here. Anyone want to tackle it now? Looks good though. Probably long past time to punt it.
comment:43
in reply to:
↑ 42
westi — 18 months ago
- Keywords 3.4-early added
- Milestone changed from 3.3 to Future Release
Replying to nacin:
We need unit tests here. Anyone want to tackle it now? Looks good though. Probably long past time to punt it.
Still no unit tests punting.
- Keywords punt 3.4-early removed
Closed #20408 as a duplicate.
comment:47
SergeyBiryukov — 8 months ago
#21903 was marked as a duplicate.
comment:48
follow-up:
↓ 49
nacin — 8 months ago
- Milestone changed from Future Release to 3.5
The patch on #21903 tries to use the IPTC identifier for UTF-8. If that is consistently accurate, I prefer it over this. Some research on the standard would be great — otherwise, this seems fine.
comment:49
in reply to:
↑ 48
;
follow-up:
↓ 50
SergeyBiryukov — 8 months ago
Replying to nacin:
The patch on #21903 tries to use the IPTC identifier for UTF-8. If that is consistently accurate, I prefer it over this.
From my testing, $iptc['1#090'] marker may not always be present: ticket:20408:3.
comment:50
in reply to:
↑ 49
chenxing — 8 months ago
Replying to SergeyBiryukov:
Replying to nacin:
The patch on #21903 tries to use the IPTC identifier for UTF-8. If that is consistently accurate, I prefer it over this.
From my testing, $iptc['1#090'] marker may not always be present: ticket:20408:3.
I don't know if there is a reliable source. I got it from here: http://php.net/manual/en/function.iptcparse.php#105025
I tried to Google for a reliable source but with no luck...
comment:51
nacin — 8 months ago
- Resolution set to fixed
- Status changed from reviewing to closed
In [21905]:

a sample image might help