Make WordPress Core

Opened 11 months ago

Closed 7 months ago

#62995 closed defect (bug) (fixed)

Uploading Mac screenshots results in broken images, due to question marks inserted in filenames

Reported by: room34's profile room34 Owned by: dmsnell's profile dmsnell
Milestone: 6.8.2 Priority: normal
Severity: normal Version:
Component: Media Keywords: reporter-feedback has-patch has-unit-tests dev-reviewed fixed-major
Focuses: administration Cc:

Description

macOS 15 Sequoia (and possibly earlier versions) uses a special character, rather than a regular ASCII space, in screenshot filenames, between the time and the AM/PM. For instance:

Screenshot 2025-02-19 at 2.17.33 PM.png

It may get lost here, but the character before PM in that filename is not a regular space. I ran the filename text string through a hex converter, and it identified it as the Unicode [the narrow no-break space](https://www.fileformat.info/info/unicode/char/202f/index.htm) (U+202F).

The problem is, when uploading a file with that type of filename (tested in both Safari and Firefox), the WordPress function that standardizes Media Library filenames converts that character into a question mark, rather than a hyphen.

For obvious reasons, there should never be a question mark in filenames in the Media Library. The result is of course that the browser interprets the part after the question mark as a query string, and now the URL is requesting a non-existent file, causing a broken image on the page.

The attached screenshot shows the filename with a question mark after uploading the image to the Media Library.

Attachments (2)

Screenshot 2025-02-20 at 5.10.59 PM.png (801.7 KB) - added by room34 11 months ago.
Screenshot showing filename with question mark in Media Library
Screenshot 2025-02-20 at 5.10.59 PM.zip (688.4 KB) - added by room34 11 months ago.
Zipped version of the file for testing

Download all attachments as: .zip

Change History (31)

@room34
11 months ago

Screenshot showing filename with question mark in Media Library

#1 follow-up: @audrasjb
11 months ago

  • Focuses ui removed
  • Keywords reporter-feedback added
  • Version 6.7.2 deleted

Hello and thanks for the ticket,

Could you please attach the exact file you can't upload so we can try to reproduce the issue?

#2 in reply to: ↑ 1 @room34
11 months ago

Replying to audrasjb:

Hello and thanks for the ticket,

Could you please attach the exact file you can't upload so we can try to reproduce the issue?

The file I uploaded is a screenshot itself with the same type of filename, so it could be used for testing. However I suspect its filename has been converted by Trac in a different way and the issue would not happen. Testing can be done easily by anyone on a Mac by just creating a screenshot and trying to upload it.

#3 @audrasjb
11 months ago

@room34 I can confirm this file can't be used to reproduce the issue. Maybe compress your initial file to a zip file and try to attach it?

#4 follow-up: @room34
11 months ago

@audrasjb I think it’s possible/likely that you need to be on a Mac file system to properly reproduce the issue, so you or whoever is trying to reproduce could just take a screenshot on that system and use it for testing.

#5 in reply to: ↑ 4 @audrasjb
11 months ago

Replying to room34:

@audrasjb I think it’s possible/likely that you need to be on a Mac file system to properly reproduce the issue, so you or whoever is trying to reproduce could just take a screenshot on that system and use it for testing.

I am on a Mac system. But as my computer's locale is French, I don't have the same default file names than yours.

Version 0, edited 11 months ago by audrasjb (next)

#6 @room34
11 months ago

@audrasjb Of course… I apologize for my US-centric mindset. I've zipped the file and am adding it as another attachment.

@room34
11 months ago

Zipped version of the file for testing

#7 @audrasjb
11 months ago

Thanks! By the way I can't reproduce the issue on my side: the media is successfully uploaded. Maybe it's an environment issue on your side?
Could you please try on a fresh instance of WP? Or on this Playground instance?

#8 @room34
11 months ago

Interesting...

Tested in Playground: Uploads successfully.

Tested with clean WordPress install of a new site on the same server: Uploads successfully.

Tested on original site with all plugins deactivated, Twenty Twenty-Five theme, and WP core reinstalled: Error still occurs.

It looks like it has to do with the data tables. The original site where the error is happening is a very old installation. It's running 6.7.2, but it was originally created with WordPress 2.1 in 2007.

I noticed the installations where this was successful did not replace the narrow no-break space character in the filename. I checked the database, and the core WP tables have "latin1_swedish_ci" collation, so they're not UTF-8. That would explain the question marks.

Anyway… I do think this still points to an issue that should be addressed. Since WordPress is already modifying Media Library filenames to be maximally compatible for URLs, I would think ideally it would strip out most non-ASCII characters, at least non-essential ones, like the narrow no-break space.

Last edited 11 months ago by room34 (previous) (diff)

#9 @matt
7 months ago

#63585 was marked as a duplicate.

#10 follow-up: @dmsnell
7 months ago

For obvious reasons, there should never be a question mark in filenames in the Media Library.

As a side mention, I would pose that it’s not obvious that questions marks should be disallowed, or any character other than the ASCII “/”, since that is used across various filesystems as the path separator.

HTML has always supported URLs with arbitrary characters in them, so I think there are two issues at play.

The media library is not properly forming encoded URLs.

Even if we resolve the name transformation in sanitize_file_name() we would want to ensure that we build the proper URLs when rendering HTML to the browser. The proper way to create a URL for a filename with a question mark in it is to percent-escape it.

u = new URL( 'https://wordpress.org' );
u.pathname = '/wp-content/uploads/Screenshot 2025-02-19 at 2.17.33 PM.png';
u.href === 'https://wordpress.org/wp-content/uploads/Screenshot%202025-02-19%20at%202.17.33%E2%80%AFPM.png'

We can see how the browser is implementing the WHATWG URL spec for URL-path-segment string and "convert(ing) to percent-encoded bytes" the "Code points greater than U+007F."

There is some good news here. There have been some recent developments to add spec-compliant URL handling into PHP itself, and many of WordPress’ needs were incorporated into that work. For now we will start with a URL parser, but hopefully in the coming releases we will see added functionality for building URLs (because currently and previous to this new work, PHP has never had tools in its standard library to properly handle URLs).

This is its own bug and we could fix it today.

Filenames could be more conservatively renamed to avoid problems in buggy code elsewhere.

That being said, like most things I’ve encountered in WordPress where text or HTML encoding comes into play, there are likely a number of places that do not properly handle these valid filenames or URLs.

One thing I like about the wording in some of the linked tickets is the intentionality of the transformations. Something like shareable_filename() could communicate the intent to produce a filename which is more resilient to systems which apply their own non-standard rules. Each transformation inside that function can be linked to a particular known system or rule of thumb…

  • Facebook doesn’t allow…
  • AppEngine rejects control points…
  • Many HTML systems will trip up on question mark or hash sign/pound/octothorpe

Maybe I’ll doodle on some of this, but I think it would be helpful for us to consider both parts of why this is broken, as one is uncontroversially broken and the fix is clear (produce spec-compliant URLs to avoid sending browsers the wrong URL) while the other is more subjective and includes WordPress extending known specifications with its own rules.

#11 in reply to: ↑ 10 @room34
7 months ago

Replying to dmsnell:

For obvious reasons, there should never be a question mark in filenames in the Media Library.

As a side mention, I would pose that it’s not obvious that questions marks should be disallowed, or any character other than the ASCII “/”, since that is used across various filesystems as the path separator.

HTML has always supported URLs with arbitrary characters in them, so I think there are two issues at play.

I would respectfully disagree with this, since like the /, the ? also plays a specific function in URLs. You can't have a question mark in a filename in a URL unless you escape it. Better just to not support it at all, I think.

#12 follow-up: @dmsnell
7 months ago

Briefly confirmed:

  • sanitize_file_name() does in fact already remove the ? character from a filename.
  • The U+202F is replaced on save in the database for @room34 because the table’s collation does not have a way to represent that character and thus uses the replacement character ? instead.
    • This particular ticket is highlighting a symptom of a much broader problem, which is that WordPress is agnostic to text encodings. (see #62172)
    • This is not solved via blog_charset and solutions interacting with blog_charset risk introducing bigger issues
  • Many places perform HTML-escaping on the URL but do not attempt to perform URL-escaping on the URL.

It’s possible we could circumvent all sorts of issues simply by creating proper URLs for these attachments. The filenames are not causing the problems, and if we percent-escaped non-ASCII characters in a URL then we should avoid all sorts of issues for database tables with non-UTF-8 collations as well, because all more-or-less supported character encodings for WordPress are ASCII-compatible (with exceptions, of course).

In fact if we performed this computation we could even relax the transformation on the filename since we only need to try and avoid problems when sharing the files with other services which apply their own restrictions (see the second issue in my comment above).

sanitize_file_name() currently replaces a percent sign with a dash, but I don’t know if any system has problems with filenames including a percent sign. perhaps we could drop all of the complex logic in there and use something like this instead?

<?php

function url_encoded_filename( $raw_filename_bytes ) {
        $encoded = '';
        $length  = strlen( $raw_filename_bytes );

        for ( $i = 0; $i < $length; $i++ ) {
                $allowable_length = strspn( $raw_filename_bytes, '-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~', $i );
                if ( $allowable_length > 0 ) {
                        $encoded .= substr( $raw_filename_bytes, $i, $allowable_length );
                        $i += $allowable_length;
                        continue;
                }

                $hex      = dechex( $b );
                $encoded .= "%{$hex}";
        }

        return $encoded;
}

of course this is just some doodling. a real proposal should at a minimum adhere more closely to the WHATWG URL spec on what needs to be percent-encoded. but still, the output of this function will be:

  • ASCII safe
  • fully visible (no hidden characters)
  • URL safe (though the URLs, if properly encoded, will double-escape the % signs)

more thought is warranted here but I hope that this communicates we may have an opportunity to use a comprehensive solution that eliminates the problem altogether and doesn’t have to rely on repeated bug reports and individual appearances of failures from the same root problem.

#13 in reply to: ↑ 12 @siliconforks
7 months ago

Replying to dmsnell:

  • URL safe (though the URLs, if properly encoded, will double-escape the % signs)

Wouldn't that end up mangling all non-ASCII characters in filenames (and a number of ASCII characters too), to the point of unrecognizability?

For example, suppose I have a file named alpha-beta-gamma-αβγ.png.

Currently, if I upload this to WordPress, the filename is stored unchanged. The URL would be something like https://example.test/wp-content/uploads/2025/06/alpha-beta-gamma-αβγ.png.

If I understand your proposal correctly, the filename would become alpha-beta-gamma-%CE%B1%CE%B2%CE%B3.png? And the URL would be https://example.test/wp-content/uploads/2025/06/alpha-beta-gamma-%25CE%25B1%25CE%25B2%25CE%25B3.png?

Another example: consider a file named foo bar baz.png (with spaces).

Currently, if I upload this to WordPress, the filename becomes foo-bar-baz.png. The URL becomes https://example.test/wp-content/uploads/2025/06/foo-bar-baz.png.

Under your proposal, the filename would be foo%20bar%20baz.png? And the URL would be https://example.test/wp-content/uploads/2025/06/foo%2520bar%2520baz.png?

#14 @dmsnell
7 months ago

That’s right, @siliconforks.

I’d like to know if what happened to @matt in his report is the same thing that happened to @room34 — the database table collation doesn’t support the characters.

The URL would be something like

This is not always going to be the case, as highlighted in this ticket. If you attempt to save that and the database table is set to a limited character set such as one in the ISO-8859 family (latin1 for example), then the postmeta value will be transcoded using the replacement character for unsupported characters. Here your URL would end up like this…

alpha-beta-gamma-???.png

This is true regardless of blog_charset, which I try to stress to everyone because I think its inception and eventual use ended up conflicting.


Since the ultimate problem is that WordPress isn’t properly generating the URLs and that’s split among a thousand different places (proverbially speaking, I have not searched or counted), perhaps we could compromise and specifically percent-encode only the URL-relevant characters.

A bigger issue will exist that I think isn’t called out here, that if we don’t go all the way so to speak, we may introduce problems with wp_unique_filename() because what that function returns isn’t ultimately what we store in the database. If we want to resolve it, we need to encode the filename.

We could always decode when providing download links so that the process reverses, but hopefully the specific name of the file matters less and the encoding would be fine as is. WordPress already renames the file for a few related reasons.

This ticket was mentioned in PR #9103 on WordPress/wordpress-develop by @dmsnell.


7 months ago
#15

  • Keywords has-patch added

Trac ticket: Core-62995

The sanitize_file_name() function normalizes the no-break space to a normal space (U+0020) in order to prevent issues saving files with the no-break space in it.

This patch expands the replacement to all space characters, since it’s known that macOS stores a NARROW NO-BREAK SPACE (U+202F) in screenshot filenames between the time and the am/pm indicator.

There are deeper issues with the way this function works, but this patch resolves a known and common problem without raising any of the deeper refactoring questions.

#16 @dmsnell
7 months ago

In the linked patch #​9103 I have proposed a minor update to replace all Unicode space characters with the normal space (U+0020)

This circumvents the deeper issue but resolves the common case of macOS screenshots uploading with a ?.

@desrosj commented on PR #9103:


7 months ago
#17

@dmsnell Would you be able to add some unit tests for this? Changing anything related to formatting without tests makes me pretty nervous.

@dmsnell commented on PR #9103:


7 months ago
#18

Concerning tests I will attempt to add a few but I started a meetup yesterday and don’t have guarantees on my timing.

I welcome anyone wanting to submit some test cases — since this is a PR, feel free to fork my fork and propose a PR whose target branch is this one.

git remote add dmsnell git@github.com:dmsnell/wordpress-develop.git
git fetch dmsnell
git checkout my-tests
git reset --hard dmsnell/filenames/avoid-collation-corruption
# make edits
git push -u origin my-tests

This will create a PR against this branch in my repo, which I can merge into this branch/PR.

@dmsnell commented on PR #9103:


7 months ago
#19

@desrosj what do you think? On this one I would like to have agreement of at least two separate committers.

@jonsurrell commented on PR #9103:


7 months ago
#20

This does include tests now:

phpunit --testdox --group=16330

Difference with trunk (+3 assertions):

-OK (1 test, 4 assertions)
+OK (1 test, 7 assertions)

If wp-includes/formatting.php is reverted to trunk's version, the tests fail:

_Formatting_Sanitize File Name
 ✘ Replaces spaces
   ┐
   ├ Failed asserting that two strings are identical.
   ┊ ---·Expected
   ┊ +++·Actual
   ┊ @@ @@
   ┊ -'Screenshot-2025-02-19-at-2.17.33-PM.png'
   ┊ +'Screenshot-2025-02-19-at-2.17.33 PM.png'
   │
   ╵ …/tests/phpunit/tests/formatting/sanitizeFileName.php:52
   ┴

#21 @jonsurrell
7 months ago

  • Keywords has-unit-tests added

#22 @dmsnell
7 months ago

  • Milestone changed from Awaiting Review to 6.9

#23 @dmsnell
7 months ago

  • Owner set to dmsnell
  • Resolution set to fixed
  • Status changed from new to closed

In 60399:

sanitize_file_name(): Normalize all space characters to a space.

The sanitize_file_name() function normalizes the no-break space to a normal space (U+0020) in order to prevent issues saving files with the no-break space in it.

This patch expands the replacement to all space characters, since it’s known that macOS stores a NARROW NO-BREAK SPACE (U+202F) in screenshot filenames between the time and the am/pm indicator.

There are deeper issues with the way this function works, but this patch resolves a known and common problem without raising any of the deeper refactoring questions.

Developed in https://github.com/wordpress/wordpress-develop/pull/9103
Discussed in https://core.trac.wordpress.org/ticket/62995

Props audrasjb, desrosj, dmsnell, jonsurrell, matt, room34, siliconforks, zieladam.
Fixes #62995.

#24 @audrasjb
7 months ago

Given the limited scope of this changeset and especially if it fixes known issues with recent OSX default filenames, I think we may consider backporting it, for an earlier release. What do you think?

#25 @annezazu
7 months ago

I agree on backporting for 6.8.2. Especially without a known timeline for 6.9, I'd like to see this fixed sooner rather than later.

#26 @youknowriad
7 months ago

Ok to on 6.8.2 for me as well given the potential widespread of the issue and the simplicity of the patch.

#27 @audrasjb
7 months ago

  • Keywords dev-reviewed fixed-major added
  • Milestone changed from 6.9 to 6.8.2

Moving to 6.8.2.

This comment is also a second committer sign-off for a 6.8.2 backport.

#28 @audrasjb
7 months ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

#29 @audrasjb
7 months ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 60411:

Media: Normalize all space characters to a space in sanitize_file_name().

The sanitize_file_name() function normalizes the no-break space to a normal space (U+0020) in order to prevent issues saving files with the no-break space in it.

This patch expands the replacement to all space characters, since it’s known that macOS stores a NARROW NO-BREAK SPACE (U+202F) in screenshot filenames between the time and the am/pm indicator.

There are deeper issues with the way this function works, but this patch resolves a known and common problem without raising any of the deeper refactoring questions.

Reviewed by audrasjb, youknowriad.
Merges [60399] to the 6.8 branch.
Props audrasjb, desrosj, dmsnell, jonsurrell, matt, room34, siliconforks, zieladam, annezazu.
Fixes #62995.

Note: See TracTickets for help on using tickets.