Make WordPress Core

Opened 14 years ago

Closed 13 years ago

Last modified 13 years ago

#16530 closed enhancement (maybelater)

Implement locale based sorting

Reported by: cyberskull's profile cyberskull Owned by:
Milestone: Priority: normal
Severity: normal Version: 3.1
Component: I18N Keywords:
Focuses: Cc:

Description

While looking at my tags list I noticed that they are not sorted to the English locale, but rather to the ordinal value of the characters.

For example, this is how WordPress currently sorts things:

  • A Book on Software
  • all those things that bug me
  • an apple a day keeps bill gates away
  • books, books and more books!
  • history and stuff
  • lies, damned lies and statistics
  • the history of stuff
  • this tag is totally made up
  • those guys
  • under the bridge
  • zaroo bugs found

A natural English local sort would be:

  • all those things that bug me
  • an apple a day keeps bill gates away
  • A Book on Software
  • books, books and more books!
  • history and stuff
  • the history of stuff
  • lies, damned lies and statistics
  • this tag is totally made up
  • those guys
  • under the bridge
  • zaroo bugs found

For English, this can be done with the reasonably simple regex: /^((a|the)\s+)?(.+)/i and using the value of $3 for the comparison.

I was thinking that a function like either locale_sort(…) or wp_sort(…) that would take the same arguments as the built in sort function. When invoked, it would get the locale from either the blog or the logged in user and if there is a custom sort for that local to then apply it before the strings are compared.

Here is a pseudocode approach:

function local_sort(…)
{
	local function $locale_format = $locale_sorts{locale()};
	#Gets the appropriate formatter function from a hash of functions based
	#on the locale() or returns null/undef.
	
	if($locale_format)
	{
		$a = $locale_format($a); #format a
		$b = $locale_format($b); #format b
	}
	
	return sort($a, $b);
}

In the case of English, the function would look something like this:

en_locale_sort_format($string)
{
	$string =~ /^((a|the)\s+)?(.+)/i;
	return $3;
}

Change History (9)

#1 follow-up: @pavelevap
14 years ago

  • Cc pavelevap@… added

#2 in reply to: ↑ 1 @cyberskull
14 years ago

Replying to pavelevap:

Related?

http://core.trac.wordpress.org/ticket/11740

I would have to say that it is tangentially related. The issue there has more to do with direct ordinal sorting. In issue #11740 the character Ř (0x0158) will never come just after R (0x0052), as there is a difference of 262 (0x106) characters between the two. What I was trying to get at is that there are some words that in any given locale that are ignored when sorting (a & the in English titles, nouns, etc).

So it looks like the correct thing to do is first implement proper alphabetical sorting according to the locale, then implement proper grammatical sorting.

Yeah, that is the term I have been looking for: Grammatical Sorting.

Back to the main point. In English alphabetical sorting is ordinal sorting (whether case-sensitive or not). But for languages like Czech ordinal ≠ alphabetical. Implementation-wise, these can be done independent of each other. But I see this issue as a bit easier to implement as building a mechanism for sorting alphabetically is more work than sorting based on the rules of grammar, though the alphabetical would be a big boost to the grammatical search.

That's enough of my ramblings for now. I hope this is of some use.

#4 in reply to: ↑ 3 ; follow-up: @cyberskull
14 years ago

Replying to Denis-de-Bernardy:

It's a tiny bit trickier, actually:

http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html

I was deliberately ignoring sorting numbers at this point. I wanted more to get a framework in place to handle natural locale comparisons. With the framework in place, we could start plugging in functions for each language to do the sorts.

I wonder if there is a natural sort PHP package?

#5 in reply to: ↑ 4 @cyberskull
14 years ago

Replying to cyberskull:

Replying to Denis-de-Bernardy:

It's a tiny bit trickier, actually:

http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html

I was deliberately ignoring sorting numbers at this point. I wanted more to get a framework in place to handle natural locale comparisons. With the framework in place, we could start plugging in functions for each language to do the sorts.

I wonder if there is a natural sort PHP package?

Good news everybody'''

PHP has a natural sort functions built in. natsort for arrays and strnatcmp for a natural string comparison.

So to update my examples from above:

function local_compare($a, $b)
{
	local function $locale_format = $locale_sorts{locale()};
	#Gets the appropriate formatter function from a hash of functions based
	#on the locale() or returns null/undef.
	
	if($locale_format)
	{
		$a = $locale_format($a); #format a
		$b = $locale_format($b); #format b
	}
	
	return strnatcmp($a, $b);
}

And the formatter:

en_locale_compare_format($string)
{
	$string =~ /^((a|the)\s+)?(.+)/i;
	return $3;
}

#6 follow-up: @dd32
14 years ago

Just like to mention here, That sorting is currently mainly done at the database level, and then a subselect is done with a LIMIT.

So at present, using your list, (And a lower limit, lets say 4 per page) this is what would be retrieved in chunks:

  • A Book on Software
  • all those things that bug me
  • an apple a day keeps bill gates away
  • books, books and more books!
  • history and stuff
  • lies, damned lies and statistics
  • the history of stuff
  • this tag is totally made up
  • those guys
  • under the bridge
  • zaroo bugs found

So each of those chunks would be a page, which could then be sorted in PHP.

Loading All items into memory, natural sorting, and then displaying a subselection of those is highly innefficient and would likely exceed the memory available on many, if not all, web hosts.

Which basically leaves you either sorting the paged "sorted" options by the standard english/language-specific sorting method, Or, Applying the ordering at the database level.

MySQL does not have a Grammatical sorting system that I'm aware of, and the Natural sorting methods it offers is mainly for strings containing Numbers.

#7 in reply to: ↑ 6 @Denis-de-Bernardy
14 years ago

Replying to dd32:

Just like to mention here, That sorting is currently mainly done at the database level, and then a subselect is done with a LIMIT.

Hehe. My point entirely, if any. ;-)

That said, we *could* add some kind of sort_post column here or there to work around this.

Here's a function I use in postgresql to deal with this in case there is any interest to replicate it using MySQL and a trigger:

CREATE OR REPLACE FUNCTION natsort(text)
	RETURNS text
AS $$
DECLARE
	_str	text := $1;
	_pad	int := 15; -- Maximum precision for PostgreSQL floats
BEGIN
	-- Bail if the string is empty
	IF	trim(_str) = ''
	THEN
		RETURN '';
	END IF;
	
	-- Strip accents and lower the case
	_str := lower(unaccent(_str));
	
	-- Replace nonsensical characters
	_str := regexp_replace(_str, E'[^a-z0-9$¢£¥₤€@&%\\(\\)\\[\\]\\{\\}_:;,\\.\\?!\\+\\-]+', ' ', 'g');
	
	-- Trim the result
	_str := trim(_str);
	
	-- Todo: we'd ideally want to strip leading articles/prepositions ('a', 'the') at this stage,
	-- but to_tsvector() also strips common words (e.g. 'all').
	
	-- We're done if the string contains no numbers
	IF	_str !~ '[0-9]'
	THEN
		RETURN _str;
	END IF;
	
	-- Force spaces between numbers, so we can use regexp_split_to_table()
	_str := regexp_replace(_str, E'((?:[0-9]+|[0-9]*\\.[0-9]+)(?:e[+-]?[0-9]+\\M)?)', E' \\1 ', 'g');
	
	-- Pad zeros to obtain a reasonably natural looking sort order
	RETURN array_to_string(ARRAY (
	SELECT	CASE
			WHEN val !~ E'^\\.?[0-9]'
			THEN
				-- Not a number; return as is
				val
			ELSE
				-- Do our best...
				COALESCE(lpad(substring(val::numeric::text from '^[0-9]+'), _pad, '0'), '') ||
				COALESCE(rpad(substring(val::numeric::text from E'\\.[0-9]+'), _pad, '0'), '')
			END
	FROM	regexp_split_to_table(_str, E'\\s+') as val
	WHERE	val <> ''
	), ' ');
END;
$$ IMMUTABLE STRICT LANGUAGE plpgsql COST 1;

#8 @nacin
13 years ago

  • Resolution set to maybelater
  • Status changed from new to closed

Can't see this happening any time soon.

#9 @nacin
13 years ago

  • Milestone Awaiting Review deleted
Note: See TracTickets for help on using tickets.