#16530 closed enhancement (maybelater)
Implement locale based sorting
Reported by: | cyberskull | Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | 3.1 |
Component: | I18N | Keywords: | |
Focuses: | Cc: |
Description
While looking at my tags list I noticed that they are not sorted to the English locale, but rather to the ordinal value of the characters.
For example, this is how WordPress currently sorts things:
- A Book on Software
- all those things that bug me
- an apple a day keeps bill gates away
- books, books and more books!
- history and stuff
- lies, damned lies and statistics
- the history of stuff
- this tag is totally made up
- those guys
- under the bridge
- zaroo bugs found
A natural English local sort would be:
- all those things that bug me
- an apple a day keeps bill gates away
- A Book on Software
- books, books and more books!
- history and stuff
- the history of stuff
- lies, damned lies and statistics
- this tag is totally made up
- those guys
- under the bridge
- zaroo bugs found
For English, this can be done with the reasonably simple regex: /^((a|the)\s+)?(.+)/i
and using the value of $3
for the comparison.
I was thinking that a function like either locale_sort(…) or wp_sort(…) that would take the same arguments as the built in sort function. When invoked, it would get the locale from either the blog or the logged in user and if there is a custom sort for that local to then apply it before the strings are compared.
Here is a pseudocode approach:
function local_sort(…) { local function $locale_format = $locale_sorts{locale()}; #Gets the appropriate formatter function from a hash of functions based #on the locale() or returns null/undef. if($locale_format) { $a = $locale_format($a); #format a $b = $locale_format($b); #format b } return sort($a, $b); }
In the case of English, the function would look something like this:
en_locale_sort_format($string) { $string =~ /^((a|the)\s+)?(.+)/i; return $3; }
Change History (9)
#2
in reply to:
↑ 1
@
14 years ago
Replying to pavelevap:
Related?
I would have to say that it is tangentially related. The issue there has more to do with direct ordinal sorting. In issue #11740 the character Ř (0x0158) will never come just after R (0x0052), as there is a difference of 262 (0x106) characters between the two. What I was trying to get at is that there are some words that in any given locale that are ignored when sorting (a & the in English titles, nouns, etc).
So it looks like the correct thing to do is first implement proper alphabetical sorting according to the locale, then implement proper grammatical sorting.
Yeah, that is the term I have been looking for: Grammatical Sorting.
Back to the main point. In English alphabetical sorting is ordinal sorting (whether case-sensitive or not). But for languages like Czech ordinal ≠ alphabetical. Implementation-wise, these can be done independent of each other. But I see this issue as a bit easier to implement as building a mechanism for sorting alphabetically is more work than sorting based on the rules of grammar, though the alphabetical would be a big boost to the grammatical search.
That's enough of my ramblings for now. I hope this is of some use.
#3
follow-up:
↓ 4
@
14 years ago
It's a tiny bit trickier, actually:
http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html
#4
in reply to:
↑ 3
;
follow-up:
↓ 5
@
14 years ago
Replying to Denis-de-Bernardy:
It's a tiny bit trickier, actually:
http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html
I was deliberately ignoring sorting numbers at this point. I wanted more to get a framework in place to handle natural locale comparisons. With the framework in place, we could start plugging in functions for each language to do the sorts.
I wonder if there is a natural sort PHP package?
#5
in reply to:
↑ 4
@
14 years ago
Replying to cyberskull:
Replying to Denis-de-Bernardy:
It's a tiny bit trickier, actually:
http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html
I was deliberately ignoring sorting numbers at this point. I wanted more to get a framework in place to handle natural locale comparisons. With the framework in place, we could start plugging in functions for each language to do the sorts.
I wonder if there is a natural sort PHP package?
Good news everybody'''
PHP has a natural sort functions built in. natsort for arrays and strnatcmp for a natural string comparison.
So to update my examples from above:
function local_compare($a, $b) { local function $locale_format = $locale_sorts{locale()}; #Gets the appropriate formatter function from a hash of functions based #on the locale() or returns null/undef. if($locale_format) { $a = $locale_format($a); #format a $b = $locale_format($b); #format b } return strnatcmp($a, $b); }
And the formatter:
en_locale_compare_format($string) { $string =~ /^((a|the)\s+)?(.+)/i; return $3; }
#6
follow-up:
↓ 7
@
14 years ago
Just like to mention here, That sorting is currently mainly done at the database level, and then a subselect is done with a LIMIT.
So at present, using your list, (And a lower limit, lets say 4 per page) this is what would be retrieved in chunks:
- A Book on Software
- all those things that bug me
- an apple a day keeps bill gates away
- books, books and more books!
- history and stuff
- lies, damned lies and statistics
- the history of stuff
- this tag is totally made up
- those guys
- under the bridge
- zaroo bugs found
So each of those chunks would be a page, which could then be sorted in PHP.
Loading All items into memory, natural sorting, and then displaying a subselection of those is highly innefficient and would likely exceed the memory available on many, if not all, web hosts.
Which basically leaves you either sorting the paged "sorted" options by the standard english/language-specific sorting method, Or, Applying the ordering at the database level.
MySQL does not have a Grammatical sorting system that I'm aware of, and the Natural sorting methods it offers is mainly for strings containing Numbers.
#7
in reply to:
↑ 6
@
14 years ago
Replying to dd32:
Just like to mention here, That sorting is currently mainly done at the database level, and then a subselect is done with a LIMIT.
Hehe. My point entirely, if any. ;-)
That said, we *could* add some kind of sort_post column here or there to work around this.
Here's a function I use in postgresql to deal with this in case there is any interest to replicate it using MySQL and a trigger:
CREATE OR REPLACE FUNCTION natsort(text) RETURNS text AS $$ DECLARE _str text := $1; _pad int := 15; -- Maximum precision for PostgreSQL floats BEGIN -- Bail if the string is empty IF trim(_str) = '' THEN RETURN ''; END IF; -- Strip accents and lower the case _str := lower(unaccent(_str)); -- Replace nonsensical characters _str := regexp_replace(_str, E'[^a-z0-9$¢£¥₤€@&%\\(\\)\\[\\]\\{\\}_:;,\\.\\?!\\+\\-]+', ' ', 'g'); -- Trim the result _str := trim(_str); -- Todo: we'd ideally want to strip leading articles/prepositions ('a', 'the') at this stage, -- but to_tsvector() also strips common words (e.g. 'all'). -- We're done if the string contains no numbers IF _str !~ '[0-9]' THEN RETURN _str; END IF; -- Force spaces between numbers, so we can use regexp_split_to_table() _str := regexp_replace(_str, E'((?:[0-9]+|[0-9]*\\.[0-9]+)(?:e[+-]?[0-9]+\\M)?)', E' \\1 ', 'g'); -- Pad zeros to obtain a reasonably natural looking sort order RETURN array_to_string(ARRAY ( SELECT CASE WHEN val !~ E'^\\.?[0-9]' THEN -- Not a number; return as is val ELSE -- Do our best... COALESCE(lpad(substring(val::numeric::text from '^[0-9]+'), _pad, '0'), '') || COALESCE(rpad(substring(val::numeric::text from E'\\.[0-9]+'), _pad, '0'), '') END FROM regexp_split_to_table(_str, E'\\s+') as val WHERE val <> '' ), ' '); END; $$ IMMUTABLE STRICT LANGUAGE plpgsql COST 1;
Related?
http://core.trac.wordpress.org/ticket/11740