Opened 4 weeks ago
Last modified 3 weeks ago
#64842 new defect (bug)
Upload problems with Umlauts in ID3 Tags
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Milestone: | Awaiting Review | Priority: | normal |
| Severity: | normal | Version: | 6.9.1 |
| Component: | Upload | Keywords: | has-patch needs-testing |
| Focuses: | Cc: |
Description
Dear developers,
we encounter problems on our site when uploading MP3s that contain German Umlauts (äöüÄÖÜß) in their ID3 tags. The error returned is “Could not insert attachment into database”.
Example file: https://cba.media/wp-content/uploads/example_with_umlaut.mp3
This issue was posted for the first time here: https://wordpress.org/support/topic/upload-problems-with-umlauts-in-id3-tags/#post-16189804 and still persists.
Thanks for your help and best regards
Attachments (1)
Note: See
TracTickets for help on using
tickets.
I was able to reproduce this. Uploading the example MP3 via Media > Add Media File fails with "Could not insert attachment into database." However, uploading the same file inside a post using the Audio or File block succeeds.
This difference points to the two different code paths:
media_handle_upload()inwp-admin/includes/media.php./wp/v2/media) viaWP_REST_Attachments_Controller, which handles metadata differently.Root cause:
The ID3v1 specification mandates ISO-8859-1 encoding for tag values. German umlauts like
äöüÄÖÜßare valid ISO-8859-1 characters, but they are not valid UTF-8 byte sequences.The
getID3library (bundled inwp-includes/ID3/) is configured with$encoding = 'UTF-8'and should convert ID3v1 tags from ISO-8859-1 to UTF-8. However, in certain cases — particularly when files have both ID3v1 and ID3v2 tags, or when tag editors write non-standard encodings — the conversion doesn't happen correctly.In
wp_add_id3_tag_data(), these potentially invalid-UTF-8 tag values are passed throughwp_kses_post(), which does not fix encoding issues. The values then flow intomedia_handle_upload():$meta['title']is assigned directly to$titlewithoutsanitize_text_field()(the filename-based title getssanitize_text_field(), but the ID3 title does not).$title,$meta['album'],$meta['artist'], and$meta['genre']are interpolated into$contentviasprintf().post_titleandpost_contentare passed towp_insert_attachment()→wp_insert_post().Patch:
Attaching
64842.3.diffwhich addresses this in three ways:_wp_id3_ensure_utf8()— a private helper inmedia.phpthat detects invalid UTF-8 and converts from Windows-1252 (a superset of ISO-8859-1 covering the ID3v1 spec encoding). This preserves the actual umlaut characters rather than stripping them.wp_add_id3_tag_data()— each tag value is passed through_wp_id3_ensure_utf8()beforewp_kses_post(), fixing the encoding at the source.sanitize_text_field()on the ID3 title inmedia_handle_upload()— currently the ID3-sourced title is assigned raw, unlike the filename-based fallback.I chose
mb_convert_encoding()with'Windows-1252'source encoding over'ISO-8859-1'because Windows-1252 is a strict superset (covers bytes0x80–0x9Fwhich ISO-8859-1 leaves undefined) and is what most real-world tag editors actually use.Testing:
https://cba.media/wp-content/uploads/example_with_umlaut.mp3Note: The recent UTF-8 modernization work in #63863 (WordPress 6.9) improves
wp_check_invalid_utf8()with replacement characters, but that function is designed for strings that are *nominally* UTF-8 with some bad bytes. Here the problem is that the entire string is in a *different encoding* (ISO-8859-1), so conversion is the correct approach rather than replacement.