Make WordPress Core

Opened 11 years ago

Closed 9 years ago

#24408 closed defect (bug) (worksforme)

HTTPd Error Log: body.xml:1: parser error : Document labelled UTF-16 but has UTF-8 content

Reported by: crashnet's profile crashnet Owned by:
Milestone: Priority: normal
Severity: normal Version: 3.5.1
Component: XML-RPC Keywords:
Focuses: Cc:

Description

I run a few pt_BR blogs and I regularly see "body.xml:1: parser error : Document labelled UTF-16 but has UTF-8 content" in the Apache error log

I finally found a clue why this happens:

(because) "...é and á characters which are not valid UTF-8 according to Xerces. I can’t send the text as UTF-16 because XML-RPC doesn’t allow it – it only allows USASCII. So I think I have to find any non USASCII characters and convert them to their XML Hex equivalent." (from http://tersesystems.com/2003/04/29/xml-rpc-encoding-of-utf-16)

So the proper handling seems to be "find any non USASCII characters and convert them to their XML Hex equivalent."

Change History (11)

#1 @crashnet
11 years ago

  • Cc Eddie@… added

#2 @crashnet
11 years ago

  • Keywords 2nd-opinion added

#3 @SergeyBiryukov
11 years ago

Not sure how to reproduce this. Is your blog served in UTF-16? If so, why?

XML-RPC in WordPress works fine with UTF-8, and é and á characters are also perfectly valid UTF-8, otherwise they wouldn't have been displayed on this very page.

#4 @crashnet
11 years ago

I am not sure if I get the error because the blogs are run in pt_BR, but AFAIK none of them is served UTF-16. At least the part meant for humans.

#5 follow-up: @SergeyBiryukov
11 years ago

  • Keywords reporter-feedback added

Could you provide the steps to reproduce the issue on a clean install?

#6 in reply to: ↑ 5 @crashnet
11 years ago

Replying to SergeyBiryukov:

Could you provide the steps to reproduce the issue on a clean install?

I could say "Install wp in pt-br, create posts, have comments and monitor your logs". Or monitor your logs for this warning on any installs you currently run that are not in english.

Sorry I can't be of more help.

#7 @SergeyBiryukov
11 years ago

FWIW, I have a few ru_RU installs, but never saw this error in any of them.

#8 @theresa95
10 years ago

It is due to attempts to smuggle bad XML in request body.

E.g. a similar line is logged in ModSecurity logs for some requests made to a WP install, when ModSecurity XML body parser is enabled:

[msg "Failed to parse request body."] [data "XML parser error: XML: Failed parsing document."][uri "/xmlrpc.php"]
body.xml:1: parser error : Document labelled UTF-16 but has UTF-8 content
<?xml version="1.0" encoding="utf-16" standalone="yes"?>

Here is a how to raise the same error, in some PHP versions (where the passed in argument is in UTF-8):

<?php
$xml = simplexml_load_string('<?xml version="1.0" encoding="utf-16"?>');

#9 @SergeyBiryukov
10 years ago

  • Keywords close added

If some external requests cause a warning in ModSecurity, it doesn't sound like something we can fix in core.

#10 @theresa95
10 years ago

just to clarify, I am not saying it is _because_ modsecurity. It is the underlying XML parser raising the error when fed bad XML. So it may very well be the case that some logic in WP using the same parser (and is fed similar input) causing error logged.

Last edited 10 years ago by theresa95 (previous) (diff)

#11 @chriscct7
9 years ago

  • Keywords 2nd-opinion reporter-feedback close removed
  • Milestone Awaiting Review deleted
  • Resolution set to worksforme
  • Status changed from new to closed

No additional reports in 2 years, so no need to warrant a audit. Feel free to reopen if anyone can provide steps to reproduce on the latest core + translations

Note: See TracTickets for help on using tickets.