Make WordPress Core

Opened 3 weeks ago

Closed 10 days ago

Last modified 10 days ago

#61182 closed enhancement (fixed)

Normalize UTF-8 charset slug detection.

Reported by: dmsnell's profile dmsnell Owned by: dmsnell's profile dmsnell
Milestone: 6.6 Priority: normal
Severity: normal Version: trunk
Component: General Keywords: has-unit-tests has-patch
Focuses: Cc:

Description

There are several exist places in Core that attempt to detect if a blog charset is UTF-8. Each place attempts to perform the same check, except the logic is spread throughout and there's no single method provided to make this determination in a consistent way. The _canonical_charset() method exists, but is marked private for use.

In this patch the new unicode module provides is_utf8_charset() as a method taking an optional charset slug and indicating if it represents UTF-8, examining all of the allowable variants of that slug. Associated code is updated to use this new function, including _canonical_charset().

Finally, the test functions governing _canonical_charset() have been rewritten as a single test with a data provider instead of as separate test functions.

This patch raises a couple of questions:

Is there a need to normalize the ISO-8859-1 charsets since htmlspecialchars() does not need it?
Should WordPress normalize invalid variations of ISO-8859-1, such as "latin1", so that htmlspecialchars() does not choke?
One important question I'd like your help on, if you know the answer: is it still important to avoid calling get_option( 'blog_charset' ) multiple times in a row, or is that already cached? If it's already cached then I suspect further caching or use of static vars will only bloat the memory footprint of Core without speeding it up.

Change History (13)

This ticket was mentioned in PR #6535 on WordPress/wordpress-develop by @dmsnell.


3 weeks ago
#1

Trac ticket: Core-61182

There are several exist places in Core that attempt to detect if a blog charset is UTF-8. Each place attempts to perform the same check, except the logic is spread throughout and there's no single method provided to make this determination in a consistent way. The _canonical_charset() method exists, but is marked private for use.

In this patch the new unicode module provides is_utf8_charset() as a method taking an optional charset slug and indicating if it represents UTF-8, examining all of the allowable variants of that slug. Associated code is updated to use this new function, including _canonical_charset().

Finally, the test functions governing _canonical_charset() have been rewritten as a single test with a data provider instead of as separate test functions.

This patch raises a couple of questions:

  • Is there a need to normalize the ISO-8859-1 charsets since htmlspecialchars() does not need it?
  • Should WordPress normalize invalid variations of ISO-8859-1, such as "latin1", so that htmlspecialchars() does not choke?

One important question I'd like your help on, if you know the answer: _is it still important to avoid calling get_option( 'blog_charset' ) multiple times in a row, or is that already cached?_ If it's already cached then I suspect further caching or use of static vars will only bloat the memory footprint of Core without speeding it up.

#2 in reply to: ↑ description @SergeyBiryukov
3 weeks ago

Replying to dmsnell:

One important question I'd like your help on, if you know the answer: is it still important to avoid calling get_option( 'blog_charset' ) multiple times in a row, or is that already cached?

It is already cached, so performance should not be an issue here.

#3 @dmsnell
3 weeks ago

Thank you @SergeyBiryukov - I had hoped and thought so, but was not confident. I'll re-examine how it's used because I think we have some old code adding caches ontop of other caches, which in this case only seems like it would bloat the memory use, albeit insignificantly.

#4 @dmsnell
2 weeks ago

  • Owner set to dmsnell
  • Resolution set to fixed
  • Status changed from new to closed

In 58147:

Normalize UTF-8 charset slug detection.

There are several exist places in Core that attempt to detect if a blog charset
is UTF-8. Each place attempts to perform the same check, except the logic is
spread throughout and there's no single method provided to make this
determination in a consistent way. The _canonical_charset() method exists,
but is marked private for use.

In this patch the new unicode module provides is_utf8_charset() as a method
taking an optional charset slug and indicating if it represents UTF-8,
examining all of the allowable variants of that slug. Associated code is
updated to use this new function, including _canonical_charset(). If no slug
is provided, it will look up the current get_option( 'blog_charset' ).

Finally, the test functions governing _canonical_charset() have been
rewritten as a single test with a data provider instead of as separate test
functions.

Developed in https://github.com/WordPress/wordpress-develop/pull/6535
Discussed in https://core.trac.wordpress.org/ticket/61182

Fixes #61182.
Props dmsnell, jonsurrell.

#5 @dmsnell
2 weeks ago

In 58148:

Quick-fix: Normalize UTF-8 charset slug detection.

In the merge of [58417] the new file was missed. This commit adds the missing required file.

Developed in https://github.com/WordPress/wordpress-develop/pull/6535
Discussed in https://core.trac.wordpress.org/ticket/61182

Fixes #61182.
Follow-up to [58147].
Props dmsnell, jonsurrell.

#7 @SergeyBiryukov
2 weeks ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

Thanks for the commit!

Requiring an extra file just for a single function does not seem ideal to me, could is_utf8_charset() be moved to either wp-includes/functions.php or wp-includes/load.php instead?

#8 @dmsnell
2 weeks ago

it could @SergeyBiryukov, but there are a few reasons I think it may be fine to leave.

  • load.php is already 2000 lines and can be slow in an IDE to work in if the IDE is applying styling lints and code sniffs.
  • there are many other common Unicode-related functions that I think fit there, including some that are already scattered and duplicated among kses and formatting and elsewhere.
  • "unicode" is a somewhat well-formed category on its own. If looking for some related functionality I might look for a string.php, strings.php, utf8.php , or unicode.php but wouldn't think to lookup load.php

Happy to adapt; just wanted to share that when I proposed this I did consider leaving it even inside of functions.php where _canonical_charset() lives, but for similar reasoning felt it was a good fit for a new home.

#9 @swissspidy
12 days ago

  • Keywords needs-patch added; has-patch removed

I don't think these are compelling enough reasons to create a new file just for this 1 function. If we want to move more similar functions together in a new file, then that can be done separately in one go.

Also, right now, this separate location causes a fatal error when a site is in maintenance mode.

<br />
<b>Fatal error</b>:  Uncaught Error: Call to undefined function is_utf8_charset() in /Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/functions.php:7481
Stack trace:
#0 /Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/functions.php(4310): _canonical_charset('utf-8')
#1 /Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/functions.php(3804): _wp_die_process_input('Briefly unavail...', 'Maintenance', Array)
#2 /Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/functions.php(3787): _default_wp_die_handler('Briefly unavail...', 'Maintenance', Array)
#3 /Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/load.php(388): wp_die('Briefly unavail...', 'Maintenance', Array)
#4 /Users/pascalb/Local Sites/media-experiments/app/public/wp-settings.php(76): wp_maintenance()
#5 /Users/pascalb/Local Sites/media-experiments/app/public/wp-config.php(95): require_once('/Users/pascalb/...')
#6 /Users/pascalb/Local Sites/media-experiments/app/public/wp-load.php(50): require_once('/Users/pascalb/...')
#7 /Users/pascalb/Local Sites/media-experiments/app/public/wp-login.php(12): require('/Users/pascalb/...')
#8 {main}
  thrown in <b>/Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/functions.php</b> on line <b>7481</b><br />

wp_maintenance() is called early in wp-settings.php, before unicode.php is loaded, and exits. wp_maintenance() manually loads functions.php and then exits. Hence the error.

That's another reason why this should be moved to functions.php

This ticket was mentioned in PR #6568 on WordPress/wordpress-develop by @dmsnell.


11 days ago
#10

  • Keywords has-patch added; needs-patch removed

Trac ticket: Core-61182

This caused issues in maintenance mode, and it's not warranted to have its own module. This will live alongside _canonical_charset(), it's partner function.

Props: dmsnell, sergeybiryukov, swisspiddy.
Follow-up to: [58148].

#11 @dmsnell
11 days ago

Nice catch @swissspidy - I wasn't aware of the need to test in maintenance mode. I'll try and remember that.

Since both of you felt it important anyway to merge this into functions.php, I've done so in #6568

#12 @dmsnell
10 days ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 58169:

Move is_utf8_charset() into functions.php

This caused issues in maintenance mode, and it's not warranted to have
its own module. This will live alongside _canonical_charset(), it's
partner function.

Fixes: #61182.
Props: dmsnell, sergeybiryukov, swisspiddy.
Follow-up to: [58148].

Note: See TracTickets for help on using tickets.