WordPress.org

Make WordPress Core

Opened 16 months ago

Last modified 3 months ago

#44793 reviewing defect (bug)

remove_accents() doesnt escape all versions of "i"

Reported by: bagosm Owned by: SergeyBiryukov
Milestone: Future Release Priority: normal
Severity: normal Version:
Component: Formatting Keywords: has-patch dev-feedback 2nd-opinion
Focuses: Cc:
PR Number:

Description

The version with both dieresis and accent is missing. Suggested addition is the following

plus	                '΅Ι' => 'I', 'ΐ' => 'i',

Attachments (5)

44793.diff (260.0 KB) - added by sebastienserre 16 months ago.
44793-1.diff (411 bytes) - added by sebastienserre 16 months ago.
44793-2.diff (1.0 KB) - added by iamdmitrymayorov 16 months ago.
Added tests and relocated accents to the correct block under Greek Coptic.
44793-3.diff (1.0 KB) - added by iamdmitrymayorov 16 months ago.
Minor comment fix for previous patch
44793.4.diff (1.8 KB) - added by SergeyBiryukov 16 months ago.

Download all attachments as: .zip

Change History (26)

#1 @sebastienserre
16 months ago

  • Keywords has-patch dev-feedback added

Hello @bagosm
I've created a Patch with your modification.
Many thanks to opened this ticket and improve WordPress.

#2 @sebastienserre
16 months ago

  • Component changed from General to Formatting

#3 @SergeyBiryukov
16 months ago

  • Keywords needs-refresh added
  • Milestone changed from Awaiting Review to 4.9.9
  • Owner set to SergeyBiryukov
  • Status changed from new to reviewing

Hi @bagosm, welcome to WordPress Trac! Thanks for the report.

@sebastienserre, thanks for the patch :) Could you please submit a patch without all the formatting changes?

#4 @sebastienserre
16 months ago

Hello @SergeyBiryukov
HUuummm I don't know hos to do :-( I asked to PHPStorm to create a patch after my changes and it gave me this :-(
I will check

#5 @sebastienserre
16 months ago

  • Keywords needs-refresh removed

Patch refreshed.

@iamdmitrymayorov
16 months ago

Added tests and relocated accents to the correct block under Greek Coptic.

#6 @iamdmitrymayorov
16 months ago

Uploaded a patch that adds tests for these letters. Also relocated them to a separate category under Greek Coptic letters.

@iamdmitrymayorov
16 months ago

Minor comment fix for previous patch

#7 @iamdmitrymayorov
16 months ago

Updated the patch that fixes a comment.

#8 @SergeyBiryukov
16 months ago

It looks like the ticket is about the Greek letter iota with dialytika and tonos: https://en.wiktionary.org/wiki/ΐ

44793.4.diff is an attempt to cover other similar symbols for consistency, including letter upsilon and standalone dialytika and tonos.

@dyrer, @ifrountas, @kosvrouvas, could you take a look at these changes to see if they are valid and remove_accents() should indeed handle these letters? Are there any other symbols that should be added?

Last edited 16 months ago by SergeyBiryukov (previous) (diff)

#10 @pento
14 months ago

  • Milestone changed from 4.9.9 to 5.0.1

#11 @pento
12 months ago

  • Milestone changed from 5.0.1 to 5.0.2

#12 @pento
12 months ago

  • Milestone changed from 5.0.2 to 5.0.3

#13 @audrasjb
12 months ago

  • Keywords needs-testing added

Hi,

This ticket is triaged in milestone 5.0.3. Now it will needs commit and backport to the related branch. Adding dev-feedback needs-testing keywords to see if it can land in 5.0.3 at the very beginning of January.

This ticket was mentioned in Slack in #core by desrosj. View the logs.


11 months ago

#15 @SergeyBiryukov
11 months ago

  • Milestone changed from 5.0.3 to 5.1

#16 @SergeyBiryukov
11 months ago

  • Milestone changed from 5.1 to 5.2

#17 @SergeyBiryukov
9 months ago

  • Milestone changed from 5.2 to Future Release

Still needs feedback from the Greek translation team (comment:8).

#18 @SergeyBiryukov
9 months ago

  • Milestone changed from Future Release to 5.3

#19 @xkon
9 months ago

Hey there!

I just noticed this ticket as it was turned into a Future Release (good call imho as it might need a lot more discussion as well!). I can handle the Greek letters but I need some clarifications first please because I'm not sure where exactly remove_accents() is needed & used for. I saw in core it's within sanitize_title() and sanitize_user() but it might be on various other places or used in more broader scopes as well that I can't know at the moment.

Please bear with me and let me explain my thinking process and "issues" before creating a patch and if there's an outcome on what's actually needed I'll be more than happy to provide a patch for Greek letters.

The function itself is called "remove_accents()" that literally means removing accents. The description though in our Handbook says Converts all accent characters to ASCII characters. and this is something totally different, in many languages removing an accent and converting to ASCII (Latin) is whole different story and changes everything.

For example:
Removing accent ( literally so both are Greek letters ): ί = ι
Converting to ASCII ( so altering the locale ): ί = i

Questions:

Where would remove_accents() be actually used and for what purposes?
As for example if it's used to create "slugs" or titles if we only add accented characters people will end up with a mixed slug or title by having Greek/Latin letters. So most likely in this case a "full" exchange of Greek -> Latin would be needed.

What is actually needed here?
If it's simply changing from an accented Greek letter to a non accented one it would be ok (I believe). If it's to remove accents for another reason and it's used to try and convert everything into Latin maybe we need to split things up into remove_accents() and convert_to_latin() for example by introducing something new (if something equivalent doesn't already exist that I'm not aware of).

For me this isn't something as simple as it looks as for various Languages removing accents (if used to alter actual readable text) ends up on changing the meaning of the word as well.

Thanks!

#20 @johnbillion
3 months ago

  • Keywords 2nd-opinion added; needs-testing removed

#21 @azaozz
3 months ago

  • Milestone changed from 5.3 to Future Release

As far as I understand remove_accents() is meant to replace accented letters from Latin based languages with their non-accented equivalent (as much as possible) only for the Latin-1 Supplement, Latin Extended-A, and Latin Extended-B Unicode blocks.

It is not meant to "reduce" a string containing chars from the above Unicode blocks to an "ASCII equivalent", however in practice this is often the case. Looking back, seems it was added to both sanitize and make permalinks and user names easy to url-encode and look better (SEO related?), although there may have been other reasons.

Looking at the questions @xkon raises, remove_accents() would probably need better documentation, and in the longer run, a re-evaluation and refactoring.

In that terms moving to "future release" for now. Please feel free to add back to 5.3 if feasible.

Note: See TracTickets for help on using tickets.