WordPress.org

Make WordPress Core

Opened 6 years ago

Closed 6 years ago

Last modified 6 years ago

#32136 closed defect (bug) (wontfix)

strip_invalid_text removes all russian utf8 chars

Reported by: Fahrain Owned by: pento
Milestone: Priority: normal
Severity: normal Version: 4.2
Component: Database Keywords:
Focuses: Cc:

Description

wordpress now updated to 4.1.3

i have some custom tables inside wordpress database and can not insert data into them beacuse function strip_invalid_text removes all russian characters from input arrays with data.

into $wpdb->insert data array is

array(5) {
  ["name"]=>
  string(15) "Земля Испытаний"
...
}

format array is

 ('%s' ...) 

when function strip_invalid_text return data result is

array(5) {
  ["name"]=>
  array(4) {
    ["value"]=>
    string(1) " "
    ["format"]=>
    string(2) "%s"
    ["charset"]=>
    string(4) "utf8"
    ["ascii"]=>
    bool(false)
  }
...
}

Problem is in regular expression inside block

if ( 'utf8' === $charset || 'utf8mb3' === $charset || 'utf8mb4' === $charset ) {
...
	$value['value'] = preg_replace( $regex, '$1', $value['value'] );
}

this preg_replace fails bacause input value "name" is not in urf-8 encoding at all. It is windows-1251 encoding.
I'm attached file with example. It is in windows-1251 encoding, so, if you use iconv to utf8 on input string - all works fine, but if you remove iconv then result will contain only numbers and ascii chars

At now i removed 'utf8' === $charset || from if. It helps, but...

Attachments (2)

tt.php (740 bytes) - added by Fahrain 6 years ago.
32136.diff (998 bytes) - added by pento 6 years ago.

Download all attachments as: .zip

Change History (13)

@Fahrain
6 years ago

#1 @SergeyBiryukov
6 years ago

  • Component changed from General to Database
  • Milestone changed from Awaiting Review to 4.2.1

#2 @DrewAPicture
6 years ago

  • Keywords needs-patch added

Hi @Fahrain, welcome to Trac. So are you seeing this with both 4.1.3 and 4.2? Sounds like we need to test it against 4.2 and latest trunk, and probably get a patch here.

#3 @Fahrain
6 years ago

tested it with 4.1.2 and 4.1.3. As i can see - 4.2 uses the same code in this function.
I'm planning to upgrade to 4.2 at the next few days, need to do some tests.

I think the problem is in incorrect encoding detection so when i disable 'utf8' in this 'if' all input data will be send to database to test it:

...
// We couldn't use any local conversions, send it to the DB.
$value['db'] = $db_check_string = true;
...

where this encoding detection works right or just didn't changes input data

@pento
6 years ago

#4 follow-up: @pento
6 years ago

I haven't been able to build a unit test to reproduce this behaviour.

Could I get you to try 32136.diff, and see if that works for you?

#5 @samuelsidler
6 years ago

  • Milestone changed from 4.2.2 to 4.2.3

#6 in reply to: ↑ 4 @Fahrain
6 years ago

Replying to pento:

I haven't been able to build a unit test to reproduce this behaviour.

Could I get you to try 32136.diff, and see if that works for you?

sorry for delay.

I'd upgraded wordpress to current stable version - 4.2.2.
Then apply this patch and all works just fine! :)

P.S.: may be this helps:
my db is utf8, input data and all my custom script files is in windows-1251 encoding, so i use

  $wpdb->query("SET CHARACTER_SET_CLIENT='cp1251'");
  $wpdb->query("SET CHARACTER_SET_RESULTS='cp1251'");

to correct input/output results before calling wpdb functions.

#7 @Fahrain
6 years ago

i've solved this problem.

You can reproduce this bug with this steps:

  1. install wordpress & database in utf8
  2. create some file in windows-1251 encoding and try to call $wpdb->insert with text in this encoding inside this file. Function strip_invalid_text will cut any text in windows-1251 encoding because this function will think that there are utf-8 encoding. Patch 32136 doesn't fix this problem, because it uses mb_internal_encoding() which - of course - returns utf-8

i've used

  $wpdb->query("SET CHARACTER_SET_CLIENT='cp1251'");
  $wpdb->query("SET CHARACTER_SET_RESULTS='cp1251'");

to correct input/output encoding before calling wpdb functions and this working (and writes to db data in correct encoding - utf8), but with this new function (strip_invalid_text) this can't work

I converted all my included files to utf8 and fixed encoding of input data - this solved problem

I think that this bug report can be closed

#8 @obenland
6 years ago

  • Owner set to pento
  • Status changed from new to assigned

#9 @pento
6 years ago

@Fahrain - Would you mind trying 32165.3.diff:ticket:32165 as a fix for this? I'm starting to suspect it's the same issue.

#10 @pento
6 years ago

  • Keywords needs-patch removed
  • Milestone 4.2.3 deleted
  • Resolution set to wontfix
  • Status changed from assigned to closed

On further consideration, I don't think we can solve this specific problem - when the input encoding doesn't match the MySQL connection encoding, there's no way for us to know what the actual encoding should be.

#11 @Fahrain
6 years ago

i don't think this changes will help. I'll try it on holidays on my test site
i think that there only one correct way to detect correct encoding - get current connection settings from db: wordpress kernel uses $this->set_charset( $this->dbh, $charset ); to set encoding (and similar function to get this back). But everywhere in code this encoding taken from configuration file and if it is wrong (not the same that really used for database/query) we have problem. You can try to check - is database queries encoding was changed with $wpdb->query("SET CHARACTER_SET_CLIENT='cp1251'"); or similar sql commands - it can be more correct way to detect actual encoding which was used on wpdb query, but it can't be done without more sql requests. Function strip_invalid_text trying not use sql queries for encoding detection if it can detect encoding but it really can't detect correct encoding without sql request. So...

Note: See TracTickets for help on using tickets.