WordPress.org

Make WordPress Core

Opened 5 months ago

Closed 4 months ago

#40560 closed defect (bug) (wontfix)

REST API: Unicode characters are escaped in the response

Reported by: rilwis Owned by:
Milestone: Priority: normal
Severity: normal Version: 4.7
Component: REST API Keywords: close
Focuses: rest-api Cc:

Description

When working with languages with unicode characters (Vietnamese, to be precisely), the returned result of REST API requests contains escaped characters.

This is an example:

{"hoten":"Test ko dien thong tin thi sinh","email":"","sodienthoai":"123123","nganhhoc":"CNTT - \u1ee8ng d\u1ee5ng ph\u1ea7n m\u1ec1m","ma_nganhhoc":"CNTT","diadiemhoc":"H\u00e0 N\u1ed9i","ma_diadiemhoc":"HN","tongtien":8925000}

The expected result is:

{"hoten":"Test ko dien thong tin thi sinh","email":"","sodienthoai":"123123","nganhhoc":"CNTT - Ứng dụng phần mềm","ma_nganhhoc":"CNTT","diadiemhoc":"Hà Nội","ma_diadiemhoc":"HN","tongtien":8925000}

When looking at the WP_REST_Server::serve_request function, I see that it uses wp_json_encode function. While this function accepts options, they're never been used. In this case, the option JSON_UNESCAPED_UNICODE is all we need.

Although the option is added in PHP 5.3 and we can't guarantee it work in all situation. But we should make an option in the REST API so developers can have a way to accomplish the job.

Change History (6)

#1 @subrataemfluence
5 months ago

To my understanding this is reasonable enough! Specially when we deal with characters like double quote ("), forward slash (/), single quote (') etc. If such characters are not escaped they might give rise to unwanted situation when parsing.

For example I have a post with the following content:

<p>It was raining when I reached the "Amritsar ISBT" at around 9:30am </p>

When JSON-fied by WP REST API it becomes

"content":{"rendered":"<p>It was raining when I reached the &#8220;Amritsar ISBT&#8221; at around 9:30am...<\/p>"}

which is reasonable because unescaped double quotes will have wrong interpretation to WP parser.

So I think it is OK to have unicode characters to get escaped upfront.

#2 @johnbillion
5 months ago

  • Focuses rest-api added
  • Keywords dev-feedback added
  • Version changed from trunk to 4.7

#3 @lovelucy
5 months ago

I have encountered this one too. In my use case, there's a translation service to translate my blog into another language. WordPress API output uses "\u" representation for UTF-8 characters, which fails my translation service.

Since the encoding "UTF-8" is announced via "Content-Encoding" charset, we can safely output non-"\u" representations. In fact, I found that the legacy JSON_API plugin supports output-modifying arguments

Setting json_unescaped_unicode will replace unicode-escaped characters with their unescaped equivalents (e.g., \u00e1 becomes á)

Hope there's someone who can patch this. Thanks :)

#4 follow-up: @rmccue
5 months ago

  • Keywords close added; dev-feedback removed

Using the Unicode-encoded format ensures the greatest interoperability with all clients, which is part of the reason it's the default encoding used internally by PHP. The Unicode-escaped format is a valid part of JSON, and client decoders should correctly handle this.

Apart from not being as nice when you're viewing the raw JSON data, the data is a valid representation of the data (that is, it's not a lossy output format) and decoders should correctly understand these characters.

I'm in favour of closing this as wontfix.

#5 in reply to: ↑ 4 @subrataemfluence
5 months ago

@rmccue I agree with you.

Replying to rmccue:

Using the Unicode-encoded format ensures the greatest interoperability with all clients, which is part of the reason it's the default encoding used internally by PHP. The Unicode-escaped format is a valid part of JSON, and client decoders should correctly handle this.

Apart from not being as nice when you're viewing the raw JSON data, the data is a valid representation of the data (that is, it's not a lossy output format) and decoders should correctly understand these characters.

I'm in favour of closing this as wontfix.

#6 @jnylen0
4 months ago

  • Milestone Awaiting Review deleted
  • Resolution set to wontfix
  • Status changed from new to closed

The Unicode-escaped format is a valid part of JSON, and client decoders should correctly handle this.

This is the key here... any library or service that claims to understand JSON, but doesn't like \u escaping, is broken.

Note: See TracTickets for help on using tickets.