Make WordPress Core

Opened 16 months ago

Closed 8 months ago

Last modified 8 months ago

#55117 closed defect (bug) (fixed)

Possible 5.9 Bug: Unknown character ( or %ef%bf%bc ) on content title

Reported by: cantuaria's profile cantuaria Owned by: audrasjb's profile audrasjb
Milestone: 6.1 Priority: normal
Severity: normal Version: 5.9
Component: Permalinks Keywords: has-testing-info has-screenshots has-patch needs-testing
Focuses: Cc:

Description

Since WordPress 5.9, one of my clients revistabula.com started having an akward bug. Sometimes, the character  (%ef%bf%bc) is being automatically inserted in the post title after a publish or update using the visual editor.

He uses an environment different from mine, so I couldn't reproduce the issue myself, he uses Chrome and Mac. As such, I don't know if this is a bug from 5.9 itself or from a plugin. In any case, I've fixed the issue with a simple string replace at wp_insert_post_data action.

But yesterday, one of the blogs I follow, published an article containing the same bug: https://macmagazine.com.br/post/2022/02/07/promocoes-na-app-store-swim-out-911-operator-navidys-dislexia-spelling-help-e-mais%ef%bf%bc/ . It's a Mac centered blog, so I suppose they also use Mac for editing (Ps.: not sure if they will fix it)

I've then searched on Google for this issue, and it have a few complains from a couple of years ago, related with Yoast SEO plugin. But none recent. I've then searched for that symbol in *.wordpress.* domains in Google at whole 2021 and for the last weeks... no results in 2021, 3 pages of results for last week.

Anyone else having this issue or can better test on a Mac trying to reproduce the issue? Also, not sure if this should be filled at Gutenberg, as I'm not sure if the issue happens in old editor as well.

Attachments (1)

55117.png (69.6 KB) - added by SergeyBiryukov 16 months ago.

Download all attachments as: .zip

Change History (38)

#1 @costdev
16 months ago

  • Keywords reporter-feedback needs-testing added

@cantuaria Thanks for opening this ticket.

I was unable to reproduce this issue on a Brazilian Portuguese website, with and without Yoast SEO. However, I'm on Windows and you mentioned that it may be an issue when editing on Mac.

Can you post any plugins / themes that exist on both websites, and any other information that might be helpful in trying to reproduce the issue?

Thanks!

Last edited 16 months ago by costdev (previous) (diff)

#2 @audrasjb
16 months ago

I wasn't able to reproduce the issue either.
The steps I followed:

I'm using MacOS/Chrome.

Last edited 16 months ago by audrasjb (previous) (diff)

#3 @audrasjb
16 months ago

ah, Sergey has the answer. There is an hidden character in the title. This often occurs when copy-pasting from PDFs or other documents :)

#4 @SergeyBiryukov
16 months ago

%ef%bf%bc appears to be an object replacement character in UTF-8, used as a placeholder in text for an otherwise unspecified object.

Since it does not add any significant value to the post title, we should be able to just remove it from the URL in sanitize_title_with_dashes(), like we already do with a bunch of other characters.

That said, looking at the example above, it is also displayed in the actual post title: 55117.png, and we don't have a precedent of filtering out characters like that from there. Maybe removing it from the URL would be enough?

#5 @costdev
16 months ago

  • Keywords reporter-feedback needs-testing removed

I agree that we should add %ef%bf%bc to the " Strip these characters entirely." list in sanitize_title_with_dashes().

A good-first-bug candidate perhaps?

#6 @cantuaria
16 months ago

Hey @costdev , yeah I thought a lot prior to opening this ticket due to not having too much information to share and because I couldn't reproduce the error myself... I was hoping that someone here was also facing similar issue. I don't have access to the second site mentioned, I just follow their RSS... about my client, I have other websites which share same plugins and theme structure, but I believe he's the only one using Mac.

My client claims he's typing the title instead of copying and pasting, during my tests I saw that the editor doesn't even allow to paste that character in the title field on my environment. My client also claims that he noted a different font family in the editor since the last update, but I didn't found any reference about it nor saw any difference.

@SergeyBiryukov Adding it to the filter should be enough for URL, but the character will still be displayed at titles. I used the string replace in both 'sanitize_title' and 'wp_insert_post_data' filters to fix the issue entirely for my client.

#7 @costdev
16 months ago

@cantuaria One way I've been able to consistently paste the character is:

  1. Enter a title.
  2. Move to the start of the title.
  3. Paste the character.
  4. Publish.
  5. View the post. The character should show.

#8 @archon810
16 months ago

Not sure if this is related, but the timing is very suspect. We have a custom taxonomy, and randomly, it seems some slugs got changed to add these weird characters all of a sudden. For example, clubhouse-drop-in-audio-cha%e2%80%aat. This broke the urls that use those slugs too for some reason, and they return as 404. Another variant: accu%e2%80%8bbattery.

So where are these %e2%80%aa and %e2%80%8b characters coming from all of a sudden - for published taxonomies no less, not new posts.

#9 @SergeyBiryukov
16 months ago

Just noting that the %e2%80%aa and %e2%80%8b characters are different from the one above:

As of [51984] / #47912, WordPress already strips the former from URLs, but not the latter. Related: #50924.

Both seem common enough when copying titles from other sources, but I don't think this is something new in 5.9.

#10 follow-up: @archon810
16 months ago

Digging around, it looks like the relevant commit that may have introduced these issues may be https://github.com/WordPress/WordPress/commit/43644069eab4209ef03c2faa476219849c3f6d6a from #47912.

Looking at our data, for example, here's a pair of term/slug that was saved pre-5.9:
Accu​Battery accu%e2%80%8bbattery

The term actually has E2 80 8B after "u" which explains why the slug was created like that. So 5.9 will start stripping these characters from the slug for newly created terms, which is great. But something about this change also had the effect of breaking existing urls, which now 404.

#11 in reply to: ↑ 10 ; follow-up: @SergeyBiryukov
16 months ago

Replying to archon810:

The term actually has E2 80 8B after "u" which explains why the slug was created like that. So 5.9 will start stripping these characters from the slug for newly created terms, which is great. But something about this change also had the effect of breaking existing urls, which now 404.

Hmm, that commit does strip these characters for newly created terms, but I'm not sure it's related to 404 errors, as the replacement only runs on saving, and is in line with how similar replacements were added in the past, without breaking any URLs. So a deeper investigation would be appreciated here. Do the URLs start working if that commit is reverted?

#12 in reply to: ↑ 11 @maciejmackowiak
16 months ago

Replying to SergeyBiryukov:

Replying to archon810:

The term actually has E2 80 8B after "u" which explains why the slug was created like that. So 5.9 will start stripping these characters from the slug for newly created terms, which is great. But something about this change also had the effect of breaking existing urls, which now 404.

Hmm, that commit does strip these characters for newly created terms, but I'm not sure it's related to 404 errors, as the replacement only runs on saving, and is in line with how similar replacements were added in the past, without breaking any URLs. So a deeper investigation would be appreciated here. Do the URLs start working if that commit is reverted?

I've been able to reproduce the issue with the following steps:

  1. create category with name Accu​Battery (with the zero width space of course) using wp 5.8.3
  2. the slug is accu%E2%80%8Bbattery
  3. go to category page /category/accu%E2%80%8Bbattery/ - it works
  4. update wp to 5.9
  5. the category page returns 404(and the link in wp-admin to view the category is also broken)
  6. downgrade to 5.8.3 and the category page works

#13 @archon810
16 months ago

Thank you for the steps, Maciej.

We are going to fix the data and remove the invisible characters from both slugs and titles, but this is probably a bug that should be fixed regardless.

#14 @SergeyBiryukov
16 months ago

  • Component changed from Editor to Permalinks
  • Milestone changed from Awaiting Review to 5.9.2

Moving to 5.9.2 to investigate the 404 issues.

Related: #55189.

#15 @audrasjb
15 months ago

  • Milestone changed from 5.9.2 to 5.9.3

Moving to milestone 5.9.3 since we're about to release 5.9.2.

This ticket was mentioned in Slack in #core by audrasjb. View the logs.


15 months ago

#17 @audrasjb
15 months ago

  • Owner set to audrasjb
  • Status changed from new to assigned

self assigning for further investigation.

#18 @cantuaria
15 months ago

Just an update, looking at Google results again, the number of results containing this symbol at the URL in WordPress sites, increased substantitally, currently there is almost 30 pages of results for last week (when I created the ticket there had only 3 pages of results for the previous week), I assume this is due to more and more sites using the latest version.

This ticket was mentioned in Slack in #core by audrasjb. View the logs.


15 months ago

#20 @audrasjb
14 months ago

  • Milestone changed from 5.9.3 to 5.9.4

This issue still needs a patch and WordPress 5.9.3 RC1 is scheduled tomorrow.
Moving for 5.9.4 consideration.

#21 @BaneD
14 months ago

Hi I wanted to add that we have noticed through Google Console that all of a sudden some links have a bit different URL. For example one of the links now has the following appended to it: "/%EF%BF%BC%EF%BF%BC%EF%BF%BCKind"

This in turn makes the page show 404 error.

The blog post was created in 2013 and the problem started to happen recently. The post is OK for something posted so long ago, however it is not something that we have linked to any time recently. It seems to be created somewhere within WP in for us unknown place. I am guessing that most likely in some archive.

Website is in en-US language only.

  • In our process we do copy->paste as we write outside of WordPress, however it is unformatted paste (CTRL+SHIFT+V) every time. Additionally if I copy->paste this:  into our editor it is clearly shown and would be obvious that it was copied.

I would recommend stripping it from permalink functions, however that would be just fixing the symptom, would be best to find what is adding it.

#22 @markparnell
13 months ago

  • Keywords needs-patch added

We've started seeing this on a couple of client sites recently too. I believe in each case they also copy the content across from outside WordPress, but it's definitely something new that has only been introduced since WP 5.9.

One small thing I noticed is that the  character is visible in the title in the posts list, but not in the block editor when editing an individual post. This makes it really hard to manually remove the extra character. Not sure if that helps narrow down the cause or if it's just due to a different font being used or something.

#23 @audrasjb
13 months ago

  • Milestone changed from 5.9.4 to 6.1

Moving this ticket to next major release since it wasn't addressed during this cycle. Anyone is welcome to move it back to 6.0.x minor releases cycle if a patch is ready to ship.

#24 @ironprogrammer
11 months ago

  • Keywords has-testing-info added

Testing Information

I have been able to reproduce this issue consistently. Please refer to the steps below to test in other environments.

Important note: Please view this comment/ticket in Firefox (I'm using macOS) to most easily identify the character in question. The object replacement character will be collapsed/hidden in Safari, and display as a blank space in Chrome, making following this ticket more difficult.

Copying the Character

In Firefox, this character will appear visually as https://ironprogrammer.com/wp-content/uploads/2022/06/object-replacement-character.png (image representation), or can be described as the letters OBJ inside a box with a broken outline.

The following section includes text blocks and styled headings that can be copied and used for testing and reproducing this issue.

Character in Isolation


Character Styled with Heading Tags

H1 Heading

H2 Heading

The character can also be viewed or copied from https://apps.timwhitlock.info/unicode/inspect?s=%EF%BF%BC.

If you're having trouble viewing the character, see this gist for an image of how this section appears in Firefox on macOS 12.4.

Viewing the Copied Contents

@dmsnell has provided a browser-based Clipboard Viewer utility that displays the contents of the clipboard. Take particular note of the text/html content, and how elements may wrap the character depending on the copy source.

Testing Instructions

Steps to Reproduce

  1. From WP admin navigate to Posts > Add New.
  2. Enter a title for the post.
  3. From the "Copying the Character" section above, select and copy a character from either of the H1 or H2 styled text blocks. This will copy the character with HTML as part of the pasteboard content.
  4. To verify that the character was copied with HTML, paste into the Clipboard Viewer page and refer to the text/html row. It should include a wrapping <h1> or <h2> tag (the character will appear as a blank space between tags).
  5. 🐞 In the post, paste the character at the end of the title field. An unexpected margin will appear below the title.
  6. 🐞 Move the cursor to the start of the title and paste the character again. An unexpected margin will appear above the title.
  7. Click Publish (and Publish again if pre-publish checks are enabled).
  8. 🐞 Observe that the post title is surrounded by characters.
  9. 🐞 Copy the post address and observe that the slug name is wrapped with the UTF-8 character code (%ef%bf%bc), e.g. .../%ef%bf%bctest-trac-55117%ef%bf%bc/.
  10. Optionally hover over the View Post button and observe that the preview URL (bottom left of browser) includes the Unicode character code (%uFFFC), e.g. .../%uFFFCtest-trac-55117%uFFFC/.
  11. Optionally click to view the post and observe that the URL in the address bar includes the URL-encoded characters from Step 9.

Expected Results

  • ✅ Pasting text that may include the character should not affect styling of the title field in the editor.
  • ✅ The character should not appear in the title of a published post.
  • ✅ The character should not be part of the post slug, encoded or otherwise.

Test Report Icons:
🐞 <= Indicates where issue ("bug") occurs.
✅ <= Behavior is expected.
❌ <= Behavior is NOT expected.

#25 @ironprogrammer
11 months ago

  • Keywords has-screenshots added

Reproduction Report

I was able to reproduce this issue. Observations are with Firefox browser.

Environment

  • OS: macOS 12.4
  • Server: nginx/1.23.0
  • PHP: 7.4.30
  • WordPress: v6.1-alpha-53344-src, 6.0, 5.9.3
  • Browser: Mozilla Firefox 102.0
  • Theme: twentytwentytwo v1.2
  • Gutenberg plugin: NOT active

Actual Results

  • ❌ Pasting the character that has HTML styling applied causes the editor to format the post title in unexpected ways (e.g. added margin).
  • ❌ The character appears in the title of the published post.
  • ❌ The character is encoded and becomes part of the post slug.

Supplemental Artifacts

Title without the character (clean starting point):
https://cldup.com/I2l3QpB5td.thumb.jpg

character pasted at end character pasted at start
https://cldup.com/_uU6IbFu-e.thumb.jpg https://cldup.com/Ml8rQ9H09y.thumb.jpg

Post Title including undesired characters:
https://cldup.com/IegDEBNXBj.thumb.jpg

Published post page, shows character represented in slug and title:
https://cldup.com/8HX3Gcia4d.thumb.jpg

#26 @dmsnell
11 months ago

Thanks for the detailed reproducibility steps @ironprogrammer. Unfortunately I think we need to track a different sequence of steps because there's a difference between intentionally entering the object-replacement character and the object-replacement character unexpectedly appearing in a post title, which I believe is the real problem tracked in this issue (but maybe I'm wrong).

So for all involved I think there's a conflation of a few different issues here:

  • Non-ASCII characters in a slug/URL are percent-encoded. This is standard practice and "necessary" if we want to represent text people enter. If my post is named "Bücher" the appropriate URL is "B%C3%BCcher". There's another practice we don't use but could, which I think deserves its own Trac ticket and eventually I would love to see us use - Punycode, where the same "Bücher" slug would become "xn--bcher-kva" but in the browser URL bar would appear at "Bücher".
  • [OBJ] characters which are stored in the database are rendered on page view. This is probably suspect enough that we should strip them out, at least for the post title. It's debatable whether this is a problem with WordPress or not because technically we could argue that if it's there in the data it should be displayed (at least it has print=yes in its Unicode properties).
  • The [OBJ] character is appearing unintentionally in post titles which generates the slugs which stand out because of the percent-encoding.

I'd like to address the third point in #38637 if we can since it's a Gutenberg bug. The first two are decisions more for Core and maybe more appropriate for Trac. On that point I'm going to update that issue with some findings that I found while working with @ironprogrammer yesterday.

This ticket was mentioned in PR #2937 on WordPress/wordpress-develop by ironprogrammer.


11 months ago
#27

  • Keywords has-patch added; needs-patch removed

Prevents object replacement characters from being added to slugs.

Trac ticket: https://core.trac.wordpress.org/ticket/55117

#28 @ironprogrammer
11 months ago

  • Keywords needs-testing added

Thanks, @dmsnell!

Issue at Hand

So for all involved I think there's a conflation of a few different issues here:

Yes, as also suggested earlier, I agree we should limit the scope of this ticket to the impact on slugs, and file separate tickets for when this character is [usually unintentionally] stored in the title field, or the encoding update in 5.9 that was reported to cause 404s.

As for the slug issue, I've drafted PR #2937 for consideration, which removes the object replacement character () from published URLs.

To clarify what the above PR addresses, there haven't been any suggestions to modify general URL-encoding, but to only account for the object replacement character (). Other incidental URL-encoded characters (like ü) would remain unaffected.

Testing Steps

Unfortunately I think we need to track a different sequence of steps because there's a difference between intentionally entering the object-replacement character and the object-replacement character unexpectedly appearing in a post title, which I believe is the real problem tracked in this issue (but maybe I'm wrong).

Nope, you're right. Going through this ticket and the related Gutenberg issue, there has been some difficulty in consistently reproducing this problem "naturally", which is why explicit and intentional steps to inserting this character can be useful toward thinking about and reproducing the unexpected results. (I likes me a good crowbar 😉.) But your point is well taken.

That being said, further cross-browser testing has highlighted the inconsistencies between browsers for creating and observing this issue, and I've generated an updated set of reproduction steps that focuses only on the impact to slugs/URLs. Instructions to follow.

#29 @ironprogrammer
11 months ago

Testing Instructions (Updated)

In my local testing, these steps demonstrate the unintended appending of the object replacement character at the end of a published post slug. The issue was reproducible when creating a new post using Chrome, but I was unable to reproduce the problem in Safari or Firefox. More test reports are welcome!

These instructions supersede the previous testing instructions, and focus only on the unexpected effect of this character on published post slugs.

💡 Good to Know: In Chrome, the "object replacement character" will appear as a "blank" space, both in the address bar and in page content. However, in Safari and Firefox the same URL will appear with the character URL-encoded as %ef%bf%bc.

💡 Also Good to Know: How the cursor enters the title field matters. See the Additional Information section for more information.

Preparation and reproduction updates have been adapted from related testing by @dmsnell.

Steps to Reproduce (Chrome)

  1. Create a Google Doc with two lines of text, separated by a single line break. This is the source document.
  2. On your WordPress site, in WP admin navigate to Posts > Add New.
  3. Switch back to the Google Doc, and select and copy the first line. Be sure to highlight the entire line, including the "blank" space that represents the line break at the end of the line. See Figure 1.
  4. Switch back to the New Post screen.
  5. Click in the title field. See Additional Information below for why this matters.
  6. Paste in the copied text from the Google Doc.
  7. Click Publish (and Publish again if pre-publish checks are enabled).
  8. 🐞 Click View Post and observe in the address bar that the slug name ends with a hyphen and what appears to be a blank space (e.g. .../test-trac-55117- /).
  9. 🐞 Copy the address from Chrome and paste it into the address bar of Safari or Firefox. Observe that instead of a blank space, the slug ends with "-%ef%bf%bc" (e.g. .../test-trac-55117-%ef%bf%bc/). (After pressing Return in the address bar, the browser may capitalize the encoded character sequence to "-%EF%BF%BC".)

Expected Results

  • ✅ The saved post slug should not end with the object replacement character (), whether displayed as a blank (Chrome) or encoded (Safari and Firefox).

Supplemental Artifacts

https://cldup.com/De-s4EvRkX.gif
Figure 1.

Additional Information

The editor (or Chrome) appears to treat the pasted text differently based on how the cursor enters the title field 🙃

To cause the "[OBJ]" character to be added to the slug on Publish:

  • Paste after direct navigation to the New Post page. Then select the title and paste again.
  • Paste after clicking the arrow keys Down, then Up into field. Then select the title and paste again.
  • Paste after clicking the arrow keys Right, then Left into field (only paste once).
  • Paste after clicking the field (only paste once). Not shown in video, but resembles selecting and pasting again.
  • Paste after switching from another tab/window (only paste once).

The last two scenarios perhaps most closely resemble the use case underlying reports of this issue.

See this demonstration video for more detail: https://cloudup.com/cDHHsWDLofx.

#30 follow-up: @ironprogrammer
11 months ago

Reproduction Report

I was able to reproduce this issue with the updated test instructions. Observations are with Chrome browser.

Environment

  • OS: macOS 12.4
  • Server: nginx/1.23.0
  • PHP: 7.4.30
  • WordPress: 6.0
  • Browser: Google Chrome 103.0.5060.53
  • Theme: twentytwentytwo v1.2
  • Gutenberg plugin: NOT active

Actual Results

  • ❌ The character is appended to the post slug.

#31 @nikkigagency
10 months ago

Hi All! I'm seeing the same weird characters added at the end of some recently created URLs. Here's one example.

https://claibournecounseling.com/what-type-of-therapy-is-used-for-depression%EF%BF%BC/

If I remove the characters at the end, the URL still resolves without issue. The characters just show up when I copy and paste the correct URL...

I haven't tried any troubleshooting or redirects as it still resolves to the right page without issue. But, would be nice to know what's going on... Just thought I'd add a piece to the puzzle.

I'm editing WordPress 6.0.1, Divi theme, on Safari, Mac Monterey 12.5.

#32 in reply to: ↑ 30 @nikkigagency
10 months ago

Replying to ironprogrammer:

Reproduction Report

I was able to reproduce this issue with the updated test instructions. Observations are with Chrome browser.

Environment

  • OS: macOS 12.4
  • Server: nginx/1.23.0
  • PHP: 7.4.30
  • WordPress: 6.0
  • Browser: Google Chrome 103.0.5060.53
  • Theme: twentytwentytwo v1.2
  • Gutenberg plugin: NOT active

Actual Results

  • ❌ The character is appended to the post slug.

#33 @ironprogrammer
10 months ago

Hi, @nikkigagency -- Thank you for sharing your experience with this issue. Based on the example URL provided, it appears to be redirecting (HTTP 302) to the "clean" URL. Until the patch is released, the problem can be fixed by cleaning up the slug before or after publishing, if you catch it. (Per comment:29 "Good to Know", the issue can be harder to spot in Chrome.)

Please note that the underlying issue in the editor that added the special character (represented by %ef%bf%bc) was addressed separately by PR 42321, which looks like it shipped with Gutenberg 13.8.

This ticket's patch adds resiliency to the backend so that the sneaky character doesn't get saved to the post's slug in the first place.

#34 @webprom
8 months ago

This is happening not only in Firefox but also on Edge on PC.

#35 @ironprogrammer
8 months ago

PR 2937 has been refreshed against trunk.

#36 @audrasjb
8 months ago

  • Resolution set to fixed
  • Status changed from assigned to closed

In 54474:

Formatting: Strip object replacement characters from slugs.

This changeset prevents object replacement characters – UTF-8 %ef%bf%bc, used as a placeholder in text for an otherwise unspecified object – from being added to slugs.

Props cantuaria, costdev, audrasjb, SergeyBiryukov, archon810, maciejmackowiak, BaneD, markparnell, ironprogrammer, dmsnell, nikkigagency, webprom.
Fixes #55117.

Note: See TracTickets for help on using tickets.