Make WordPress Core

Opened 4 weeks ago

Last modified 3 weeks ago

#64842 new defect (bug)

Upload problems with Umlauts in ID3 Tags

Reported by: claireschlamm's profile claireschlamm Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 6.9.1
Component: Upload Keywords: has-patch needs-testing
Focuses: Cc:

Description

Dear developers,

we encounter problems on our site when uploading MP3s that contain German Umlauts (äöüÄÖÜß) in their ID3 tags. The error returned is “Could not insert attachment into database”.

Example file: https://cba.media/wp-content/uploads/example_with_umlaut.mp3

This issue was posted for the first time here: https://wordpress.org/support/topic/upload-problems-with-umlauts-in-id3-tags/#post-16189804 and still persists.

Thanks for your help and best regards

Attachments (1)

64842.3.diff (2.6 KB) - added by abhishekfdd 3 weeks ago.

Download all attachments as: .zip

Change History (2)

#1 @abhishekfdd
3 weeks ago

  • Keywords has-patch needs-testing added

I was able to reproduce this. Uploading the example MP3 via Media > Add Media File fails with "Could not insert attachment into database." However, uploading the same file inside a post using the Audio or File block succeeds.

This difference points to the two different code paths:

  • Media Library upload uses media_handle_upload() in wp-admin/includes/media.php.
  • Block editor upload uses the REST API (/wp/v2/media) via WP_REST_Attachments_Controller, which handles metadata differently.

Root cause:

The ID3v1 specification mandates ISO-8859-1 encoding for tag values. German umlauts like äöüÄÖÜß are valid ISO-8859-1 characters, but they are not valid UTF-8 byte sequences.

The getID3 library (bundled in wp-includes/ID3/) is configured with $encoding = 'UTF-8' and should convert ID3v1 tags from ISO-8859-1 to UTF-8. However, in certain cases — particularly when files have both ID3v1 and ID3v2 tags, or when tag editors write non-standard encodings — the conversion doesn't happen correctly.

In wp_add_id3_tag_data(), these potentially invalid-UTF-8 tag values are passed through wp_kses_post(), which does not fix encoding issues. The values then flow into media_handle_upload():

  1. $meta['title'] is assigned directly to $title without sanitize_text_field() (the filename-based title gets sanitize_text_field(), but the ID3 title does not).
  2. $title, $meta['album'], $meta['artist'], and $meta['genre'] are interpolated into $content via sprintf().
  3. Both post_title and post_content are passed to wp_insert_attachment()wp_insert_post().
  4. MySQL rejects the invalid UTF-8, and the insertion fails.

Patch:

Attaching 64842.3.diff which addresses this in three ways:

  1. Introduces _wp_id3_ensure_utf8() — a private helper in media.php that detects invalid UTF-8 and converts from Windows-1252 (a superset of ISO-8859-1 covering the ID3v1 spec encoding). This preserves the actual umlaut characters rather than stripping them.
  2. Applies the conversion in wp_add_id3_tag_data() — each tag value is passed through _wp_id3_ensure_utf8() before wp_kses_post(), fixing the encoding at the source.
  3. Adds sanitize_text_field() on the ID3 title in media_handle_upload() — currently the ID3-sourced title is assigned raw, unlike the filename-based fallback.

I chose mb_convert_encoding() with 'Windows-1252' source encoding over 'ISO-8859-1' because Windows-1252 is a strict superset (covers bytes 0x80–0x9F which ISO-8859-1 leaves undefined) and is what most real-world tag editors actually use.

Testing:

  1. Download the reporter's example file from https://cba.media/wp-content/uploads/example_with_umlaut.mp3
  2. Without patch: upload via Media > Add Media File → fails with "Could not insert attachment into database"
  3. With patch: upload succeeds; the attachment title and description preserve the German umlauts correctly
  4. Also verify that uploading the same file via the Audio/File block in the editor still works (no regression)
  5. Test with a file containing only ASCII ID3 tags to confirm no regression on normal uploads

Note: The recent UTF-8 modernization work in #63863 (WordPress 6.9) improves wp_check_invalid_utf8() with replacement characters, but that function is designed for strings that are *nominally* UTF-8 with some bad bytes. Here the problem is that the entire string is in a *different encoding* (ISO-8859-1), so conversion is the correct approach rather than replacement.

@abhishekfdd
3 weeks ago

Note: See TracTickets for help on using tickets.