WordPress.org

Make WordPress Core

Opened 13 months ago

Closed 3 months ago

Last modified 2 months ago

#40560 closed defect (bug) (invalid)

REST API: Unicode characters are escaped in the response

Reported by: rilwis Owned by:
Milestone: Priority: normal
Severity: normal Version: 4.7
Component: REST API Keywords: close
Focuses: rest-api Cc:

Description

When working with languages with unicode characters (Vietnamese, to be precisely), the returned result of REST API requests contains escaped characters.

This is an example:

{"hoten":"Test ko dien thong tin thi sinh","email":"","sodienthoai":"123123","nganhhoc":"CNTT - \u1ee8ng d\u1ee5ng ph\u1ea7n m\u1ec1m","ma_nganhhoc":"CNTT","diadiemhoc":"H\u00e0 N\u1ed9i","ma_diadiemhoc":"HN","tongtien":8925000}

The expected result is:

{"hoten":"Test ko dien thong tin thi sinh","email":"","sodienthoai":"123123","nganhhoc":"CNTT - Ứng dụng phần mềm","ma_nganhhoc":"CNTT","diadiemhoc":"Hà Nội","ma_diadiemhoc":"HN","tongtien":8925000}

When looking at the WP_REST_Server::serve_request function, I see that it uses wp_json_encode function. While this function accepts options, they're never been used. In this case, the option JSON_UNESCAPED_UNICODE is all we need.

Although the option is added in PHP 5.3 and we can't guarantee it work in all situation. But we should make an option in the REST API so developers can have a way to accomplish the job.

Change History (9)

#1 @subrataemfluence
13 months ago

To my understanding this is reasonable enough! Specially when we deal with characters like double quote ("), forward slash (/), single quote (') etc. If such characters are not escaped they might give rise to unwanted situation when parsing.

For example I have a post with the following content:

<p>It was raining when I reached the "Amritsar ISBT" at around 9:30am </p>

When JSON-fied by WP REST API it becomes

"content":{"rendered":"<p>It was raining when I reached the &#8220;Amritsar ISBT&#8221; at around 9:30am...<\/p>"}

which is reasonable because unescaped double quotes will have wrong interpretation to WP parser.

So I think it is OK to have unicode characters to get escaped upfront.

#2 @johnbillion
13 months ago

  • Focuses rest-api added
  • Keywords dev-feedback added
  • Version changed from trunk to 4.7

#3 @lovelucy
13 months ago

I have encountered this one too. In my use case, there's a translation service to translate my blog into another language. WordPress API output uses "\u" representation for UTF-8 characters, which fails my translation service.

Since the encoding "UTF-8" is announced via "Content-Encoding" charset, we can safely output non-"\u" representations. In fact, I found that the legacy JSON_API plugin supports output-modifying arguments

Setting json_unescaped_unicode will replace unicode-escaped characters with their unescaped equivalents (e.g., \u00e1 becomes á)

Hope there's someone who can patch this. Thanks :)

#4 follow-up: @rmccue
13 months ago

  • Keywords close added; dev-feedback removed

Using the Unicode-encoded format ensures the greatest interoperability with all clients, which is part of the reason it's the default encoding used internally by PHP. The Unicode-escaped format is a valid part of JSON, and client decoders should correctly handle this.

Apart from not being as nice when you're viewing the raw JSON data, the data is a valid representation of the data (that is, it's not a lossy output format) and decoders should correctly understand these characters.

I'm in favour of closing this as wontfix.

#5 in reply to: ↑ 4 @subrataemfluence
13 months ago

@rmccue I agree with you.

Replying to rmccue:

Using the Unicode-encoded format ensures the greatest interoperability with all clients, which is part of the reason it's the default encoding used internally by PHP. The Unicode-escaped format is a valid part of JSON, and client decoders should correctly handle this.

Apart from not being as nice when you're viewing the raw JSON data, the data is a valid representation of the data (that is, it's not a lossy output format) and decoders should correctly understand these characters.

I'm in favour of closing this as wontfix.

#6 @jnylen0
12 months ago

  • Milestone Awaiting Review deleted
  • Resolution set to wontfix
  • Status changed from new to closed

The Unicode-escaped format is a valid part of JSON, and client decoders should correctly handle this.

This is the key here... any library or service that claims to understand JSON, but doesn't like \u escaping, is broken.

#7 @hwgehring
3 months ago

  • Resolution wontfix deleted
  • Severity changed from normal to major
  • Status changed from closed to reopened

@subrataemfluence @rmccue @jnylen0

This is actually a major problem and tagging as wontfix sends a very concerning message about how legitimate this API actually is.

Unicode-escaped text is only valid JSON in certain exceptional cases. Whether the language itself deviates away from the standard by defaulting to escaped unicode is irrelevant. Whether API consumers can/can't/won't/don't handle unicode-escaped characters is irrelevant.

https://tools.ietf.org/html/rfc8259#section-8.1

As it stands the JSON response returned from the WP REST API is unusable, incorrectly tagged as UTF-8 charset, and conforms to no standard at all.

#8 @jnylen0
3 months ago

  • Resolution set to invalid
  • Severity changed from major to normal
  • Status changed from reopened to closed

@hwgehring sorry, but your claim that "Unicode-escaped text is only valid JSON in certain exceptional cases" is a misreading of the RFC you quoted.

From https://tools.ietf.org/html/rfc8259#section-7:

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence ... Alternatively, there are two-character sequence escape representations of some popular characters.

So even the string "\u0048\u0065\u006c\u006c\u006f" is perfectly valid JSON, equivalent to "Hello". Proof: https://jsonlint.com/?json=%22\u0048\u0065\u006c\u006c\u006f%22

Again, if this is an issue for you, it means the library or technique you are using to parse JSON is incorrect.

Last edited 3 months ago by jnylen0 (previous) (diff)

#9 @rmccue
2 months ago

There are two separate parts really: valid characters, and normalisation. As §7 notes:

All Unicode characters may be placed within the quotation marks, except for the characters that MUST be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

That is, all characters except for those characters are valid. However, the spec also notes:

Any character may be escaped.

The normalisation part is covered in §8.3 (String Comparison).

If any character apart from the characters is escaped, that's still valid JSON, it's simply denormalised. Your JSON parser should be normalising this data when it is parsed out.

Note: See TracTickets for help on using tickets.