Opened 6 years ago
Last modified 3 months ago
#44958 assigned defect (bug)
  character in title generates a permalink (and slug) with space
Reported by: | ace2_heart | Owned by: | |
---|---|---|---|
Milestone: | 6.7 | Priority: | normal |
Severity: | normal | Version: | 4.8 |
Component: | Permalinks | Keywords: | needs-testing has-patch has-testing-info has-screenshots needs-unit-tests |
Focuses: | Cc: |
Description
If a title has an   character copy pasted or if the permalink has a   copy pasted
It will generate a permalink and slug with an space breaking some elements on the site
I generated an empty wordpress install without any plugins to test it, and it has the same problem
Attachments (3)
Change History (42)
#1
follow-up:
↓ 4
@
6 years ago
- Keywords reporter-feedback added
Hi @ace2_heart thanks for the ticket and welcome to WordPress Trac,
I can't reproduce the issue on a fresh install (see screen capture above). I tested it in the permalink edition and in the title field and both are correctly escaping the non-breaking space. Can you provide us some further information to reproduce the bug please?
Cheers,
Jb
#2
@
6 years ago
Hi @ace2_heart, thanks for the ticket!
is supposed to be converted to a hyphen by sanitize_title_with_dashes().
Are you sure the issue still happens on a clean install with all plugins disabled and a default theme (Twenty Seventeen) activated?
#3
@
6 years ago
Hi, I created an empty install here, you can access it
- user: miguelesquirol
- password: miguelesquirol
https://bug-testing-miguelesquirol.c9users.io/wp-ad- min
These are the pages created
The parmalink are as following:
/e -securite/
/ -test/
#4
in reply to:
↑ 1
@
6 years ago
Replying to audrasjb:
I just saw the demo you posted, instead of using the text on the title, tried to copy/paste the actual space character.
e -securite
Hi @ace2_heart thanks for the ticket and welcome to WordPress Trac,
I can't reproduce the issue on a fresh install (see screen capture above). I tested it in the permalink edition and in the title field and both are correctly escaping the non-breaking space. Can you provide us some further information to reproduce the bug please?
Cheers,
Jb
#5
follow-up:
↓ 6
@
6 years ago
Ok thanks for your answer.
So the issue happens with control marks like %e2%80%
. Indeed, I'm able to reproduce this specific issue.
By the way, you are running WP 4.8.7 which is quite an old version. But the issue also happens in 4.9.8.
Thanks,
Jb
#6
in reply to:
↑ 5
@
6 years ago
So it's a really a bug? It's my first time submitting one.
Replying to audrasjb:
Ok thanks for your answer.
So the issue happens with control marks like
%e2%80%
. Indeed, I'm able to reproduce this specific issue.
By the way, you are running WP 4.8.7 which is quite an old version. But the issue also happens in 4.9.8.
Thanks,
Jb
#8
@
6 years ago
I'm hoping to submit a patch for this ticket in the next couple of days, I have seen it pop up in other places with other characters with a clear case for being sanitized out of URIs, such as the unicode line separator.
#11
@
14 months ago
- Keywords has-patch added
Here's a patch to get around any %e2%80
constructions like:
%e2%80 %e2%80test %e2%80% %e2%80%93 %e2%80%94 %e2%80%943 %e2%80%3 %e2%80%94slug my%e2%80%94slug my%e28094slug my%e2%80%A8slug ...
While I think the regex itself is quite robust, this needs testing to fit well into the sanitize_title_with_dashes
function, but I hope it helps on the way
#12
@
14 months ago
- Keywords has-testing-info added; reporter-feedback removed
- Milestone changed from Awaiting Review to 6.4
Thanks for the patch! Moving for 6.4 consideration.
This ticket was mentioned in Slack in #core-test by oglekler. View the logs.
13 months ago
#14
@
13 months ago
Reproduction Report
This report validates that the issue cannot be reproduced.
Environment
- OS: Windows 11 (22H2)
- Web Server: nginx/1.25.1
- PHP: 7.4.33
- WordPress: 6.4-alpha-56267-src
- Browser: Chrome Version 113.0.5672.126 (Official Build) (64-bit)
- Theme: Twenty Twenty-Three
Actual Results
- ❌ Was not able to reproduce the issue. Even after modifying the title with a space, the slug seems to be okay.
Supplemental Artifacts
#15
@
13 months ago
Test Report
This report validates that the indicated issue can not be reproduced
Environment
- OS: Windows 11 Pro (22H2)
- Web Server: nginx/1.25.1
- PHP: 7.4.33
- WordPress: 6.4-alpha-56267-src
- Browser: Chrome Version 116.0.5845.98 (Official Build) (64-bit)
- Theme: Twenty Twenty-Three
Actual Results
- ❌ Was not able to reproduce the issue. Tried to edit the permalinks as guided, found nothing.
Supplemental Artifacts
#16
follow-up:
↓ 17
@
13 months ago
it can be reproduced, but not with a space alone, see comment No 5
#17
in reply to:
↑ 16
@
13 months ago
I did check that comment but didn't quite understand how to test it out. Can you please clarify how you have tested it?
Replying to Presskopp:
it can be reproduced, but not with a space alone, see comment No 5
This ticket was mentioned in Slack in #core by zunaid321. View the logs.
13 months ago
#21
@
11 months ago
I checked all cases that are mentioned in the comment#11 and the patch is working as expected, but I cannot evaluate that these are all possible cases. This patch is preventing invisible symbols to be shown in the URL, and the most accurate way to test it would have been to copy-paste text from somewhere with this invisible symbols inside while having the whole list of these symbols.
This ticket was mentioned in Slack in #core by nicolefurlan. View the logs.
11 months ago
#24
@
11 months ago
the regex is replacing %e280
and then even more sequences with an optional percent sign but none of those are valid URL-encodings. do we have any examples of any system producing those outputs? At a minimum that should be interpreted/rendered as �80
but it would be valid to simply ignore decoding / leave it as %e280
.
given that %e2
by itself is invalid UTF-8/invalid URL encoding, it's a bit easier to discuss here, but I'd be worried about consuming too much input because it looks like it shouldn't be there and then further corrupting data that's otherwise fine. in other words, if we don't have any examples of systems producing output like %e280
then it might be a good call for us to limit unintentional breakage by only replacing %e2%80
and successive sequences that all carry the percent-sign.
#25
@
11 months ago
- Milestone changed from 6.4 to 6.5
This ticket was discussed in today's bug scrub.
We're going to bump this to 6.5.
Per #comment:21, it would be great if we could also find more test cases so that we can be more confident in this fix.
Props to @hellofromTonya :)
#26
@
11 months ago
After some additional testing I was unable to reproduce through the block editor, but when attempting to paste some of these characters into the title field, another bug appeared, which I'll file in the Gutenberg repository (the contents of the title duplicate).
With the Classic Editor plugin I was able to reproduce, and I assume that API calls and manual calls from within PHP would have similarly exhibited the broken behavior.
In deeper review of this patch I do think it's worth taking out the "also %e280 to be on the safe side," though on a second thought it's not unrealistic to keep it in there, but for the different reason that "%e280" only appears when something else is already broken. This is more about the question of whether we want to make a bad situation potentially worse or leave it only as bad as it is; I generally find that leaving broken things the way they are leads to less confusion and corruption.
---
If we want to explore the broader issue for 6.5 I will volunteer my time to look into it, as the function is already fairly ad-hoc and this particular ticket seems part of the broader problem that we're hitting specific cases instead of taking a systematic or formal approach in handling the slug creation.
I've started exploring a PCRE-based approach in https://github.com/WordPress/wordpress-develop/pull/5466 that relies on Unicode character classes. If that approach works, it could simplify the function while boosting its reliability and maintainability. There are thankfully a fair number of unit tests and there's a fair bit of work to do there, but it seems tractable that we could collect a lot of currently-scattered logic and cover many bases in one go.
This ticket was mentioned in Slack in #core by chaion07. View the logs.
7 months ago
This ticket was mentioned in Slack in #core by oglekler. View the logs.
3 months ago
#30
@
3 months ago
Possibly Unit tests needs to be checked if they are covering all possible cases and then last patch can be tested as the way forward.
#31
@
3 months ago
Test Report
Patch tested 44958.diff
Environment
- WordPress: 6.6-alpha-57778-src
- PHP: 8.3.7
- Server: nginx/1.25.4
- Database: mysqli (Server: 8.3.0 / Client: mysqlnd 8.3.7)
- Browser: Chrome 125.0.0.0
- OS: macOS
- Theme: Twenty Fifteen 3.7
- MU Plugins: None activated
- Plugins:
- Classic Editor 1.6.3
- FakerPress 0.6.6
- Test Reports 1.1.0
Actual Results
- ✅ Issue resolved with patch.
Before Patch
After Patch
This ticket was mentioned in PR #5466 on WordPress/wordpress-develop by @dmsnell.
3 months ago
#32
Trac ticket: Core-44958
From time to time new issues arise with the sanitization of a post title for slug creation. The existing algorithm builds takes a hit-and-miss approach of handling specific cases of known string elements that cause problems and replacing them with normalized characters.
In this patch a given title is first converted into a normalized form and then processed with a Unicode-aware PCRE pattern which formalizes the _kinds_ of replacements which are supposed to occur. For example, instead of removing "%c2%ab" the code now removes "invisible characters" as defined by Unicode itself.
This update, if it works without breaking existing dependencies, poses a more comprehensive solution to the problem of slug generation, one that updates with advancements to the Unicode specification provided by system libraries and PHP itself instead of through custom WordPress code.
Trac ticket:
This ticket was mentioned in Slack in #core by nhrrob. View the logs.
3 months ago
#36
@
3 months ago
Today I took some time to expand the work in #5466. The new slugify()
function is the important part to examine. Here are some thoughts:
- converting all forms of allowable input escaping into Unicode code points does indeed collapse much of the logic and eliminate by default cases where something could escape the checks, for example, because it's a numeric character reference.
- this is another potentially great use case for the
WP_Token_Map
for rewriting given characters in the title.
slugify()
runs in one single pass through the input string and its main loop represents all of the core logic at each decoded cahracter. to my head this is clearer than jumping between multiple full passes and string-replaces
There's a lot to consider, but I think this has promise, and the changes to the HTML API in WordPress 6.6 are going to make it much easier to do the conversion. Looks like we could improve Core's UTF-8 functions though.
This ticket was mentioned in Slack in #core by nhrrob. View the logs.
3 months ago
#38
@
3 months ago
- Milestone changed from 6.6 to 6.7
This is ticket is for a bug report. 44958.diff is a bugfix.
However, the current direction with this PR is an enhancement which may also resolve the bug.
In discussing during today's bug scrub, @davidbaumwald suggested creating a new ticket for the enhancement and moving PR 5466 to that new ticket:
Right, but that's a separate task. "Bug: Fix this thing with simple code change" versus "ENH: Refactor this code to be better even though simpler code fixed original issue"
All that to say, I think current ticket is OK as-is, but maybe add a comment to encourage a new ticket for the ENH part
@dmsnell want to open a new Trac ticket for the enhancement scope of work?
#39
@
3 months ago
@hellofromTonya we can open a new ticket for 6.7 if we need to, but I don't understand why we are calling PR #5466 an enhancement instead of a bug fix.
The smaller patch here proposes fixing one very specific incarnation of a broader bug, and unfortunately at the current time introduces more breakage (and so I think it would be best to ensure that if we merge it, we fix that remaining issue before doing so).
The goal of the alternative patch in the PR is to comprehensively solve the category of defects that this reported issue is a manifestation of. It's goal has been to maintain the same interface and intent of the existing code, though I created sluggify()
the new function mostly to make it possible to examine the older and newer code at the same time during review.
So I don't really care what goes on, but if we create a new ticket I feel like it should still be a bug fix, since the core issue reported here is that the slug generation does not adequately parse the inbound content and therefore all of the replacement logic is fragile and prone to only work in very specific cases. Do we classify a patch as a bug fix if it resolves 1 bug, but an enhancement if it resolves 1000?
Works for me