Make WordPress Core

Opened 3 years ago

Last modified 22 months ago

#55189 new defect (bug)

Automatic removal of "Zero-width non-joiner" in URL

Reported by: man4toman's profile man4toman Owned by:
Milestone: Future Release Priority: normal
Severity: critical Version: 5.9
Component: Permalinks Keywords: 2nd-opinion has-patch changes-requested needs-unit-tests
Focuses: Cc:

Description

There is a big problem in recent version of WordPress, "Zero-width non-joiner" in the URLs is removed automatically.

This will have a very bad effect on SEO (and has affected many sites) because the page with "Zero-width non-joiner" in their URL will goes to be 404.

Change History (20)

#1 @costdev
3 years ago

  • Keywords 2nd-opinion added
  • Version set to 5.9

Introduced in 51984.

Pinging @audrasjb as 5.9.1 Core Tech Lead for 2nd-opinion on whether or not this needs to be fixed.

If so, I have a patch ready.

#2 @SergeyBiryukov
3 years ago

  • Milestone changed from Awaiting Review to 5.9.2

If I'm reading [51984] correctly, those replacements only run in the save context, so should only apply to new posts and should not affect any existing slugs.

This is in line with how similar replacements were added in the past, without breaking any URLs. So a deeper investigation would be appreciated here. Do the URLs start working if that commit is reverted?

Related: #55117.

Last edited 3 years ago by SergeyBiryukov (previous) (diff)

#3 @man4toman
3 years ago

More information:

In older version of WordPress, I published a post with slug "تست-نیم‌فاصله".
After upgrade to new version, I go to edit/update my post, then the slug automatically changed to "تست-نیمفاصله".

Last edited 3 years ago by man4toman (previous) (diff)

#4 @SergeyBiryukov
3 years ago

Thanks for the follow-up!

I was able to reproduce the issue with the steps from comment:3, and [51984] indeed appears to be the related changeset. It's worth noting this only happens with pages, because the old slug redirect feature does not work for them yet, see #4328 for that. In testing with posts, the old slug does redirect as expected to the new slug.

Related: #50924, which is apparently now fixed as of [51984] / #47912.

I'm not quite sure how to proceed here, as #50924 suggested this character should be removed from URLs, while this one suggests that it should not (if I understand correctly). Maybe fixing #4328 would be the way to go here, so that the old slug redirect feature works for pages too?

#5 follow-up: @ironprogrammer
3 years ago

#50924 focused on the unsightly encoding of ZWNJ in the sample paths (e.g. /پیامک%e2%80%8cها/). The update provided by #47912 addresses that, but does so in a heavy-handed way: it strips them all out, disregarding context.

As @SergeyBiryukov noted, #4328 has the potential to address the SEO/redirect concern reported. Though the age of that ticket isn't encouraging 😔 A simpler short-term fix could help impacted site owners.

Also, apart from the [important] redirect concern, is there a concern about how this change to slugs may impact the "meaning" of paths in URLs? How important is it that the intended word(s) be maintained in the slug? It's suggested that keywords in URLs these days don't impact SERPs, but that doesn't mean the words in paths should be inaccurate.


For example, to expand on @man4toman's report, removing ZWNJ from slugs (or a - replacement) can change the meaning of the words used*:

"Word"
کلمه‌ای

Removal of ZWNJ:

"Cabbages"
کلمهای

In a post:

https://cldup.com/0OmYevLiq0.thumb.jpg
Post title is changed to a different word in slug.

This can also impact categories and tags, for example:

https://cldup.com/5kjEmgVY0Q.png
Potentially misleading category slug.

(*My apologies to Persian speakers for any translation inaccuracies above 😅. I used DuckDuckGo's translator for the meanings to illustrate this point.)


Related: According to Iran's NIC, IRNIC, the use of ZWNJ is acceptable in Persian domain names. (Sample strings above were taken from this site.)

I couldn't find domain rules related to ZWNJ in Arabic, but understand its use is less common than in Persian.

Last edited 3 years ago by ironprogrammer (previous) (diff)

#6 in reply to: ↑ 5 @SergeyBiryukov
3 years ago

Replying to ironprogrammer:

Also, apart from the [important] redirect concern, is there a concern about how this change to slugs may impact the "meaning" of paths in URLs? How important is it that the intended word(s) be maintained in the slug? It's suggested that keywords in URLs these days don't impact SERPs, but that doesn't mean the words in paths should be inaccurate.

Good point, we might need to check the locale and keep the character in certain locales if it changes the meaning of a post slug.

#7 @audrasjb
3 years ago

  • Milestone changed from 5.9.2 to 5.9.3

Moving to milestone 5.9.3 since we're about to release 5.9.2.

This ticket was mentioned in Slack in #core by audrasjb. View the logs.


3 years ago

#9 @audrasjb
3 years ago

  • Milestone changed from 5.9.3 to 5.9.4

Given 5.9.3 is supposed to be released soon, and as this ticket is still waiting for a patch, let's move it to milestone 5.9.4.

This ticket was mentioned in Slack in #core by mike. View the logs.


3 years ago

This ticket was mentioned in PR #2664 on WordPress/wordpress-develop by emojized.


3 years ago
#11

  • Keywords has-patch added

…d "Auflage" https://core.trac.wordpress.org/ticket/55189

Trac ticket:

#12 @theode
3 years ago

I just killed it, might be wrong but now you know where to search it!

#13 @audrasjb
3 years ago

  • Milestone changed from 5.9.4 to 6.1

Moving this ticket to next major release since it wasn't addressed during this cycle. Anyone is welcome to move it back to 6.0.x minor releases cycle if a patch is ready to ship.

#14 @peterwilsoncc
2 years ago

I suggest making use of the old-save context to avoid this issue. See source code reference.

However, I think this is probably too late for the 6.1 release now that it is in beta.

#15 @SergeyBiryukov
2 years ago

  • Milestone changed from 6.1 to 6.2

It looks like the patch needs an update to address comment:6 and comment:14. Including some unit tests would also be a good idea.

With 6.1 RC1 coming today, moving to 6.2 for now.

Last edited 2 years ago by SergeyBiryukov (previous) (diff)

This ticket was mentioned in Slack in #core by costdev. View the logs.


22 months ago

#17 @hellofromTonya
22 months ago

  • Keywords changes-requested needs-unit-tests added

Marking for:

  • changes-requested to update to address comment:6 and comment:14.
  • needs-unit-tests to note it needs unit tests.

This ticket was mentioned in Slack in #core by costdev. View the logs.


22 months ago

#19 @azaozz
22 months ago

@SergeyBiryukov @hellofromTonya @peterwilsoncc I'm actually thinking that (most of?) [51984] should be reverted. Removing accents is one thing, but removing or replacing characters that change the meaning of the content is pretty much "tampering" with the user content. Afaik several languages use such non-visible characters and they are integral part of writing.

To me it seems that the problem in #47912 that was solved with [51984] is far less serious. Generally it is quite easy to see when there are unneeded or encoded characters (after copy/paste) in the post slug and to edit the title or the slug. On the other hand enforcing removal of characters that the users have typed is not appropriate.

Last edited 22 months ago by azaozz (previous) (diff)

#20 @costdev
22 months ago

  • Milestone changed from 6.2 to Future Release

This ticket was discussed in the bug scrub and it was agreed to move this ticket to Future Release.

Note: See TracTickets for help on using tickets.