WordPress.org

Make WordPress Core

Opened 21 months ago

Last modified 15 months ago

#24408 new defect (bug)

HTTPd Error Log: body.xml:1: parser error : Document labelled UTF-16 but has UTF-8 content

Reported by: crashnet Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 3.5.1
Component: XML-RPC Keywords: 2nd-opinion reporter-feedback close
Focuses: Cc:

Description

I run a few pt_BR blogs and I regularly see "body.xml:1: parser error : Document labelled UTF-16 but has UTF-8 content" in the Apache error log

I finally found a clue why this happens:

(because) "...é and á characters which are not valid UTF-8 according to Xerces. I can’t send the text as UTF-16 because XML-RPC doesn’t allow it – it only allows USASCII. So I think I have to find any non USASCII characters and convert them to their XML Hex equivalent." (from http://tersesystems.com/2003/04/29/xml-rpc-encoding-of-utf-16)

So the proper handling seems to be "find any non USASCII characters and convert them to their XML Hex equivalent."

Change History (10)

comment:1 @crashnet21 months ago

  • Cc Eddie@… added

comment:2 @crashnet21 months ago

  • Keywords 2nd-opinion added

comment:3 @SergeyBiryukov21 months ago

Not sure how to reproduce this. Is your blog served in UTF-16? If so, why?

XML-RPC in WordPress works fine with UTF-8, and é and á characters are also perfectly valid UTF-8, otherwise they wouldn't have been displayed on this very page.

comment:4 @crashnet21 months ago

I am not sure if I get the error because the blogs are run in pt_BR, but AFAIK none of them is served UTF-16. At least the part meant for humans.

comment:5 follow-up: @SergeyBiryukov20 months ago

  • Keywords reporter-feedback added

Could you provide the steps to reproduce the issue on a clean install?

comment:6 in reply to: ↑ 5 @crashnet20 months ago

Replying to SergeyBiryukov:

Could you provide the steps to reproduce the issue on a clean install?

I could say "Install wp in pt-br, create posts, have comments and monitor your logs". Or monitor your logs for this warning on any installs you currently run that are not in english.

Sorry I can't be of more help.

comment:7 @SergeyBiryukov20 months ago

FWIW, I have a few ru_RU installs, but never saw this error in any of them.

comment:8 @theresa9515 months ago

It is due to attempts to smuggle bad XML in request body.

E.g. a similar line is logged in ModSecurity logs for some requests made to a WP install, when ModSecurity XML body parser is enabled:

[msg "Failed to parse request body."] [data "XML parser error: XML: Failed parsing document."][uri "/xmlrpc.php"]
body.xml:1: parser error : Document labelled UTF-16 but has UTF-8 content
<?xml version="1.0" encoding="utf-16" standalone="yes"?>

Here is a how to raise the same error, in some PHP versions (where the passed in argument is in UTF-8):

<?php
$xml = simplexml_load_string('<?xml version="1.0" encoding="utf-16"?>');

comment:9 @SergeyBiryukov15 months ago

  • Keywords close added

If some external requests cause a warning in ModSecurity, it doesn't sound like something we can fix in core.

comment:10 @theresa9515 months ago

just to clarify, I am not saying it is _because_ modsecurity. It is the underlying XML parser raising the error when fed bad XML. So it may very well be the case that some logic in WP using the same parser (and is fed similar input) causing error logged.

Last edited 15 months ago by theresa95 (previous) (diff)
Note: See TracTickets for help on using tickets.