#61182 closed enhancement (fixed)
Normalize UTF-8 charset slug detection.
Reported by: | dmsnell | Owned by: | dmsnell |
---|---|---|---|
Milestone: | 6.6 | Priority: | normal |
Severity: | normal | Version: | 6.6 |
Component: | General | Keywords: | has-unit-tests has-patch |
Focuses: | Cc: |
Description
There are several exist places in Core that attempt to detect if a blog charset is UTF-8. Each place attempts to perform the same check, except the logic is spread throughout and there's no single method provided to make this determination in a consistent way. The _canonical_charset() method exists, but is marked private for use.
In this patch the new unicode module provides is_utf8_charset() as a method taking an optional charset slug and indicating if it represents UTF-8, examining all of the allowable variants of that slug. Associated code is updated to use this new function, including _canonical_charset().
Finally, the test functions governing _canonical_charset() have been rewritten as a single test with a data provider instead of as separate test functions.
This patch raises a couple of questions:
Is there a need to normalize the ISO-8859-1 charsets since htmlspecialchars() does not need it?
Should WordPress normalize invalid variations of ISO-8859-1, such as "latin1", so that htmlspecialchars() does not choke?
One important question I'd like your help on, if you know the answer: is it still important to avoid calling get_option( 'blog_charset' ) multiple times in a row, or is that already cached? If it's already cached then I suspect further caching or use of static vars will only bloat the memory footprint of Core without speeding it up.
Change History (14)
This ticket was mentioned in PR #6535 on WordPress/wordpress-develop by @dmsnell.
4 months ago
#1
#2
in reply to:
↑ description
@
4 months ago
Replying to dmsnell:
One important question I'd like your help on, if you know the answer: is it still important to avoid calling get_option( 'blog_charset' ) multiple times in a row, or is that already cached?
It is already cached, so performance should not be an issue here.
#3
@
4 months ago
Thank you @SergeyBiryukov - I had hoped and thought so, but was not confident. I'll re-examine how it's used because I think we have some old code adding caches ontop of other caches, which in this case only seems like it would bloat the memory use, albeit insignificantly.
#4
@
4 months ago
- Owner set to dmsnell
- Resolution set to fixed
- Status changed from new to closed
In 58147:
#7
@
4 months ago
- Resolution fixed deleted
- Status changed from closed to reopened
Thanks for the commit!
Requiring an extra file just for a single function does not seem ideal to me, could is_utf8_charset()
be moved to either wp-includes/functions.php
or wp-includes/load.php
instead?
#8
@
4 months ago
it could @SergeyBiryukov, but there are a few reasons I think it may be fine to leave.
load.php
is already 2000 lines and can be slow in an IDE to work in if the IDE is applying styling lints and code sniffs.- there are many other common Unicode-related functions that I think fit there, including some that are already scattered and duplicated among
kses
andformatting
and elsewhere. - "unicode" is a somewhat well-formed category on its own. If looking for some related functionality I might look for a
string.php
,strings.php
,utf8.php
, orunicode.php
but wouldn't think to lookupload.php
Happy to adapt; just wanted to share that when I proposed this I did consider leaving it even inside of functions.php
where _canonical_charset()
lives, but for similar reasoning felt it was a good fit for a new home.
#9
@
4 months ago
- Keywords needs-patch added; has-patch removed
I don't think these are compelling enough reasons to create a new file just for this 1 function. If we want to move more similar functions together in a new file, then that can be done separately in one go.
Also, right now, this separate location causes a fatal error when a site is in maintenance mode.
<br /> <b>Fatal error</b>: Uncaught Error: Call to undefined function is_utf8_charset() in /Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/functions.php:7481 Stack trace: #0 /Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/functions.php(4310): _canonical_charset('utf-8') #1 /Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/functions.php(3804): _wp_die_process_input('Briefly unavail...', 'Maintenance', Array) #2 /Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/functions.php(3787): _default_wp_die_handler('Briefly unavail...', 'Maintenance', Array) #3 /Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/load.php(388): wp_die('Briefly unavail...', 'Maintenance', Array) #4 /Users/pascalb/Local Sites/media-experiments/app/public/wp-settings.php(76): wp_maintenance() #5 /Users/pascalb/Local Sites/media-experiments/app/public/wp-config.php(95): require_once('/Users/pascalb/...') #6 /Users/pascalb/Local Sites/media-experiments/app/public/wp-load.php(50): require_once('/Users/pascalb/...') #7 /Users/pascalb/Local Sites/media-experiments/app/public/wp-login.php(12): require('/Users/pascalb/...') #8 {main} thrown in <b>/Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/functions.php</b> on line <b>7481</b><br />
wp_maintenance()
is called early in wp-settings.php
, before unicode.php
is loaded, and exits. wp_maintenance()
manually loads functions.php
and then exits. Hence the error.
That's another reason why this should be moved to functions.php
This ticket was mentioned in PR #6568 on WordPress/wordpress-develop by @dmsnell.
4 months ago
#10
- Keywords has-patch added; needs-patch removed
Trac ticket: Core-61182
This caused issues in maintenance mode, and it's not warranted to have its own module. This will live alongside _canonical_charset()
, it's partner function.
Props: dmsnell, sergeybiryukov, swisspiddy.
Follow-up to: [58148].
#11
@
4 months ago
Nice catch @swissspidy - I wasn't aware of the need to test in maintenance mode. I'll try and remember that.
Since both of you felt it important anyway to merge this into functions.php
, I've done so in #6568
Trac ticket: Core-61182
There are several exist places in Core that attempt to detect if a blog charset is UTF-8. Each place attempts to perform the same check, except the logic is spread throughout and there's no single method provided to make this determination in a consistent way. The
_canonical_charset()
method exists, but is marked private for use.In this patch the new
unicode
module providesis_utf8_charset()
as a method taking an optional charset slug and indicating if it represents UTF-8, examining all of the allowable variants of that slug. Associated code is updated to use this new function, including_canonical_charset()
.Finally, the test functions governing
_canonical_charset()
have been rewritten as a single test with a data provider instead of as separate test functions.This patch raises a couple of questions:
htmlspecialchars()
does not need it?htmlspecialchars()
does not choke?One important question I'd like your help on, if you know the answer: _is it still important to avoid calling
get_option( 'blog_charset' )
multiple times in a row, or is that already cached?_ If it's already cached then I suspect further caching or use ofstatic
vars will only bloat the memory footprint of Core without speeding it up.