#40560 closed defect (bug) (invalid)
REST API: Unicode characters are escaped in the response
Reported by: | rilwis | Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | 4.7 |
Component: | REST API | Keywords: | close |
Focuses: | rest-api | Cc: |
Description
When working with languages with unicode characters (Vietnamese, to be precisely), the returned result of REST API requests contains escaped characters.
This is an example:
{"hoten":"Test ko dien thong tin thi sinh","email":"","sodienthoai":"123123","nganhhoc":"CNTT - \u1ee8ng d\u1ee5ng ph\u1ea7n m\u1ec1m","ma_nganhhoc":"CNTT","diadiemhoc":"H\u00e0 N\u1ed9i","ma_diadiemhoc":"HN","tongtien":8925000}
The expected result is:
{"hoten":"Test ko dien thong tin thi sinh","email":"","sodienthoai":"123123","nganhhoc":"CNTT - Ứng dụng phần mềm","ma_nganhhoc":"CNTT","diadiemhoc":"Hà Nội","ma_diadiemhoc":"HN","tongtien":8925000}
When looking at the WP_REST_Server::serve_request
function, I see that it uses wp_json_encode
function. While this function accepts options, they're never been used. In this case, the option JSON_UNESCAPED_UNICODE
is all we need.
Although the option is added in PHP 5.3 and we can't guarantee it work in all situation. But we should make an option in the REST API so developers can have a way to accomplish the job.
Change History (11)
#2
@
7 years ago
- Focuses rest-api added
- Keywords dev-feedback added
- Version changed from trunk to 4.7
#3
@
7 years ago
I have encountered this one too. In my use case, there's a translation service to translate my blog into another language. WordPress API output uses "\u" representation for UTF-8 characters, which fails my translation service.
Since the encoding "UTF-8" is announced via "Content-Encoding" charset, we can safely output non-"\u" representations. In fact, I found that the legacy JSON_API plugin supports output-modifying arguments
Setting json_unescaped_unicode will replace unicode-escaped characters with their unescaped equivalents (e.g., \u00e1 becomes á)
Hope there's someone who can patch this. Thanks :)
#4
follow-up:
↓ 5
@
7 years ago
- Keywords close added; dev-feedback removed
Using the Unicode-encoded format ensures the greatest interoperability with all clients, which is part of the reason it's the default encoding used internally by PHP. The Unicode-escaped format is a valid part of JSON, and client decoders should correctly handle this.
Apart from not being as nice when you're viewing the raw JSON data, the data is a valid representation of the data (that is, it's not a lossy output format) and decoders should correctly understand these characters.
I'm in favour of closing this as wontfix.
#5
in reply to:
↑ 4
@
7 years ago
@rmccue I agree with you.
Replying to rmccue:
Using the Unicode-encoded format ensures the greatest interoperability with all clients, which is part of the reason it's the default encoding used internally by PHP. The Unicode-escaped format is a valid part of JSON, and client decoders should correctly handle this.
Apart from not being as nice when you're viewing the raw JSON data, the data is a valid representation of the data (that is, it's not a lossy output format) and decoders should correctly understand these characters.
I'm in favour of closing this as wontfix.
#6
@
7 years ago
- Milestone Awaiting Review deleted
- Resolution set to wontfix
- Status changed from new to closed
The Unicode-escaped format is a valid part of JSON, and client decoders should correctly handle this.
This is the key here... any library or service that claims to understand JSON, but doesn't like \u
escaping, is broken.
#7
@
7 years ago
- Resolution wontfix deleted
- Severity changed from normal to major
- Status changed from closed to reopened
@subrataemfluence @rmccue @jnylen0
This is actually a major problem and tagging as wontfix sends a very concerning message about how legitimate this API actually is.
Unicode-escaped text is only valid JSON in certain exceptional cases. Whether the language itself deviates away from the standard by defaulting to escaped unicode is irrelevant. Whether API consumers can/can't/won't/don't handle unicode-escaped characters is irrelevant.
https://tools.ietf.org/html/rfc8259#section-8.1
As it stands the JSON response returned from the WP REST API is unusable, incorrectly tagged as UTF-8 charset, and conforms to no standard at all.
#8
@
7 years ago
- Resolution set to invalid
- Severity changed from major to normal
- Status changed from reopened to closed
@hwgehring sorry, but your claim that "Unicode-escaped text is only valid JSON in certain exceptional cases" is a misreading of the RFC you quoted.
From https://tools.ietf.org/html/rfc8259#section-7:
Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence ... Alternatively, there are two-character sequence escape representations of some popular characters.
So even the string "\u0048\u0065\u006c\u006c\u006f"
is perfectly valid JSON, equivalent to "Hello"
. Proof: https://jsonlint.com/?json=%22\u0048\u0065\u006c\u006c\u006f%22
Again, if this is an issue for you, it means the library or technique you are using to parse JSON is incorrect.
#9
@
7 years ago
There are two separate parts really: valid characters, and normalisation. As §7 notes:
All Unicode characters may be placed within the quotation marks, except for the characters that MUST be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).
That is, all characters except for those characters are valid. However, the spec also notes:
Any character may be escaped.
The normalisation part is covered in §8.3 (String Comparison).
If any character apart from the characters is escaped, that's still valid JSON, it's simply denormalised. Your JSON parser should be normalising this data when it is parsed out.
#10
@
5 years ago
Hey there,
I totally understand your argumentation on this issue. And from a pure technical view it doesn't matter at all if the characters are encoded or not.
Unfortunately there are some not-so-technical clients out there for whom the JSON looks like an error with those escaped characters.
As the reporter @rilwis already pointed out: Since PHP 5.3 there is the option JSON_UNESCAPED_UNICODE
. And the Wordpress function wp_json_encode
also accepts options to be passed to json_encode
. So it would be great to just have the possibility to set those options which the WP_REST_Server
will use in its serve_request
method :) Even a filter would be enough for those purposes.
To my understanding this is reasonable enough! Specially when we deal with characters like double quote ("), forward slash (/), single quote (') etc. If such characters are not escaped they might give rise to unwanted situation when parsing.
For example I have a post with the following content:
<p>It was raining when I reached the "Amritsar ISBT" at around 9:30am </p>
When JSON-fied by WP REST API it becomes
"content":{"rendered":"<p>It was raining when I reached the “Amritsar ISBT” at around 9:30am...<\/p>"}
which is reasonable because unescaped double quotes will have wrong interpretation to WP parser.
So I think it is OK to have unicode characters to get escaped upfront.