Opened 3 years ago
Last modified 22 months ago
#55189 new defect (bug)
Automatic removal of "Zero-width non-joiner" in URL
Reported by: | man4toman | Owned by: | |
---|---|---|---|
Milestone: | Future Release | Priority: | normal |
Severity: | critical | Version: | 5.9 |
Component: | Permalinks | Keywords: | 2nd-opinion has-patch changes-requested needs-unit-tests |
Focuses: | Cc: |
Description
There is a big problem in recent version of WordPress, "Zero-width non-joiner" in the URLs is removed automatically.
This will have a very bad effect on SEO (and has affected many sites) because the page with "Zero-width non-joiner" in their URL will goes to be 404.
Change History (20)
#2
@
3 years ago
- Milestone changed from Awaiting Review to 5.9.2
If I'm reading [51984] correctly, those replacements only run in the save
context, so should only apply to new posts and should not affect any existing slugs.
This is in line with how similar replacements were added in the past, without breaking any URLs. So a deeper investigation would be appreciated here. Do the URLs start working if that commit is reverted?
Related: #55117.
#3
@
3 years ago
More information:
In older version of WordPress, I published a post with slug "تست-نیمفاصله".
After upgrade to new version, I go to edit/update my post, then the slug automatically changed to "تست-نیمفاصله".
#4
@
3 years ago
Thanks for the follow-up!
I was able to reproduce the issue with the steps from comment:3, and [51984] indeed appears to be the related changeset. It's worth noting this only happens with pages, because the old slug redirect feature does not work for them yet, see #4328 for that. In testing with posts, the old slug does redirect as expected to the new slug.
Related: #50924, which is apparently now fixed as of [51984] / #47912.
I'm not quite sure how to proceed here, as #50924 suggested this character should be removed from URLs, while this one suggests that it should not (if I understand correctly). Maybe fixing #4328 would be the way to go here, so that the old slug redirect feature works for pages too?
#5
follow-up:
↓ 6
@
3 years ago
#50924 focused on the unsightly encoding of ZWNJ in the sample paths (e.g. /پیامک%e2%80%8cها/
). The update provided by #47912 addresses that, but does so in a heavy-handed way: it strips them all out, disregarding context.
As @SergeyBiryukov noted, #4328 has the potential to address the SEO/redirect concern reported. Though the age of that ticket isn't encouraging 😔 A simpler short-term fix could help impacted site owners.
Also, apart from the [important] redirect concern, is there a concern about how this change to slugs may impact the "meaning" of paths in URLs? How important is it that the intended word(s) be maintained in the slug? It's suggested that keywords in URLs these days don't impact SERPs, but that doesn't mean the words in paths should be inaccurate.
For example, to expand on @man4toman's report, removing ZWNJ from slugs (or a -
replacement) can change the meaning of the words used*:
"Word"
کلمهای
Removal of ZWNJ:
"Cabbages"
کلمهای
In a post:
Post title is changed to a different word in slug.
This can also impact categories and tags, for example:
Potentially misleading category slug.
(*My apologies to Persian speakers for any translation inaccuracies above 😅. I used DuckDuckGo's translator for the meanings to illustrate this point.)
Related: According to Iran's NIC, IRNIC, the use of ZWNJ is acceptable in Persian domain names. (Sample strings above were taken from this site.)
I couldn't find domain rules related to ZWNJ in Arabic, but understand its use is less common than in Persian.
#6
in reply to:
↑ 5
@
3 years ago
Replying to ironprogrammer:
Also, apart from the [important] redirect concern, is there a concern about how this change to slugs may impact the "meaning" of paths in URLs? How important is it that the intended word(s) be maintained in the slug? It's suggested that keywords in URLs these days don't impact SERPs, but that doesn't mean the words in paths should be inaccurate.
Good point, we might need to check the locale and keep the character in certain locales if it changes the meaning of a post slug.
#7
@
3 years ago
- Milestone changed from 5.9.2 to 5.9.3
Moving to milestone 5.9.3 since we're about to release 5.9.2.
This ticket was mentioned in Slack in #core by audrasjb. View the logs.
3 years ago
#9
@
3 years ago
- Milestone changed from 5.9.3 to 5.9.4
Given 5.9.3 is supposed to be released soon, and as this ticket is still waiting for a patch, let's move it to milestone 5.9.4.
This ticket was mentioned in Slack in #core by mike. View the logs.
3 years ago
This ticket was mentioned in PR #2664 on WordPress/wordpress-develop by emojized.
3 years ago
#11
- Keywords has-patch added
…d "Auflage" https://core.trac.wordpress.org/ticket/55189
Trac ticket:
#13
@
3 years ago
- Milestone changed from 5.9.4 to 6.1
Moving this ticket to next major release since it wasn't addressed during this cycle. Anyone is welcome to move it back to 6.0.x minor releases cycle if a patch is ready to ship.
#14
@
2 years ago
I suggest making use of the old-save
context to avoid this issue. See source code reference.
However, I think this is probably too late for the 6.1 release now that it is in beta.
#15
@
2 years ago
- Milestone changed from 6.1 to 6.2
It looks like the patch needs an update to address comment:6 and comment:14. Including some unit tests would also be a good idea.
With 6.1 RC1 coming today, moving to 6.2 for now.
This ticket was mentioned in Slack in #core by costdev. View the logs.
22 months ago
#17
@
22 months ago
- Keywords changes-requested needs-unit-tests added
Marking for:
changes-requested
to update to address comment:6 and comment:14.needs-unit-tests
to note it needs unit tests.
This ticket was mentioned in Slack in #core by costdev. View the logs.
22 months ago
#19
@
22 months ago
@SergeyBiryukov @hellofromTonya @peterwilsoncc I'm actually thinking that (most of?) [51984] should be reverted. Removing accents is one thing, but removing or replacing characters that change the meaning of the content is pretty much "tampering" with the user content. Afaik several languages use such non-visible characters and they are integral part of writing.
To me it seems that the problem in #47912 that was solved with [51984] is far less serious. Generally it is quite easy to see when there are unneeded or encoded characters (after copy/paste) in the post slug and to edit the title or the slug. On the other hand enforcing removal of characters that the users have typed is not appropriate.
Introduced in 51984.
Pinging @audrasjb as 5.9.1 Core Tech Lead for
2nd-opinion
on whether or not this needs to be fixed.If so, I have a patch ready.