Make WordPress Core

Opened 4 years ago

Closed 4 years ago

#19292 closed defect (bug) (fixed)

Not found errors due to sanitization in sanitize_title_with_dashes

Reported by: xknown Owned by:
Milestone: 3.3 Priority: highest omg bbq
Severity: blocker Version: 3.3
Component: General Keywords: has-patch needs-testing commit
Focuses: Cc:


As a consecuence of #10797, the patch to remove some characters like nbsp, ndash, mdash from slugs, existing posts that contained these characters are not longer accessible.

This new code should only execute on every post/page save. However, when preparing the SQL query needed to find posts by slug, WP.org core uses the sanitize_title function, which assumes by default that we are saving the slug. Thus, if an existing post has "mymasscom%c2%a0class" as slug, then starting from that date these posts are not accesible because the searched slug is "mymasscom-class".

Attachments (5)

sanitize_title.diff (1.4 KB) - added by xknown 4 years ago.
Use sanitize_title_for_query instead of sanitize_title when querying by slug
19292.diff (1.7 KB) - added by nacin 4 years ago.
19292-dont-update-slug-sanitize-type-on-update.diff (1.2 KB) - added by dd32 4 years ago.
19292.2.diff (2.0 KB) - added by markjaquith 4 years ago.
19292.3.diff (2.7 KB) - added by nacin 4 years ago.
uses old-save instead of query, cleans up docs.

Download all attachments as: .zip

Change History (24)

4 years ago

Use sanitize_title_for_query instead of sanitize_title when querying by slug

#1 @SergeyBiryukov
4 years ago

  • Milestone changed from Awaiting Review to 3.3

#2 @nacin
4 years ago

  • Severity changed from normal to blocker

Nice catch, Alex.

#3 @sorich87
4 years ago

  • Keywords has-patch added

#4 @dd32
4 years ago

I can't fault this with posts & pages.

Using a copyright symbol and a mdash, before, pages created in 3.0~3.2 could not be viewed, posts were fine, new pages were fine. After patch, posts & pages created in 3.0~3.3 work fine.

I've not tested the Taxonomy branches yet, Attached patch fixes it for Tag slugs at first glance, however other functions such as get_category_by_path() and most Taxonomy functions uses sanitize_title() directly too.

#5 @nacin
4 years ago

Discussed this further with dd32.

The second (and worse) blocker here is as follows: Take a page (or any hierarchical post type) that was created in 3.0~3.2 that has a special character in the slug. (You can rig this to work without doing svn switches by hacking sanitize_title_with_dashes() to prevent the code inside the 'save' == $context conditional from firing, then remove the hack.)

At this point, you will be able to view the page under the bad URL. Now go into the page and save it. The page now has a good, clean URL. But the bad URL no longer works.

To fix this, one of the following needs to occur: 1) a hack to prevent sanitization from happening to hierarchical post types that are being updated, or 2) we need to implement _wp_old_slug that works for pages.

(2) Is actually not terrible to do as long as we don't care about parent changes (the focus of #4328), and focus squarely on slug changes. We can then introduce _wp_old_path to properly handle actual path changes in 3.4. On the other hand, I don't know what the heck would happen to a child page's original link when a parent page is changed. Yeah, so that won't really work.

(1) is doable, but kind of messy. Let me explain:

At first, I recommended to dd32 that sanitize_title() could be hacked to allow an integer for $context, and in that case, $context could then be assumed to be a post ID, and then $context would be set to 'save' only if the post type was not hierarchical.

On one hand, this is backwards compatible. ($context was introduced specifically for 'save' versus 'query' in 3.1.) On the other hand, this is really low level and totally lame.

The better way to do this — still hacky, but I'm warming up to it — is to modify every call to sanitize_title() that affects $post_name generation. This happens in only a few instances — three, and all in wp_insert_post(). (Also, these would need to be modified anyway to pass post IDs, as proposed and rejected above.)

The modification is simple: If you're running and update AND it's a hierarchical post type, then don't run the sanitization on save context. This means that our feature will work for all new content, and old content that has _wp_old_slug. 19292.diff reflects this.

dd32 cooked up a slightly different patch that results in it only working for new content (or old content that is sanitized identically to new content), rather than falling back onto canonical redirections via _wp_old_slug. While conservative, our original intent of the rolling upgrade was that it would simply work, so I think I would rather go with the route I've described.

4 years ago

#6 @nacin
4 years ago

I'm becoming too tired to run through the thought experiment on this, but the only other issue I see is that while we don't want the save context code in sanitize_title_with_dashes() to run, I think we *do* want the remove_accents() to be run (see sanitize_title() when in save context), in the case of building a post_name off a title. Does that sound right to anyone else groking this?

#7 @dd32
4 years ago

attachment 19292-dont-update-slug-sanitize-type-on-update.diff added

For comparison purposes, here's my thought process for it.

Put simply: (note: "post" is used here to refer to any post type, This ticket is primarily focusing on pages, but posts suffer the same problem, canonical just stuck a bit of wallpaper over the crack in the wall)

  • If we've got no slug; it's a new post, it gets new sanitization
  • If we've got a slug; it's either a pre-3.3 post, or a 3.3+ post
    • First of all, check to see if the current slug passed IS the same as what would be used for a Query, If it is, it's a post with a older (pre-3.3)pre-sanitized name OR the slug sanitizes using both algorithms the same OR the name doesn't require sanitization (ie. single word)
    • Otherwise, lets sanitize it using the new 3.3+ stuff

I believe there's a edge case in there however that I've not accounted for, for example, "tricking" it into using the older sanitization by altering the the post_name field directly.. while i can't seem to cause a problem with that, it feels like it's possilble

My prime thought process here was: Do not update slugs for existing content, Canonical is there to help when the slug was specifically changed, but we shouldn't use it to change existing URL's.

Of course, next time we alter something, my approach may fail once again, because we'll have multiple "save" contexts/remove_accent contexts. but nacin's approach has a similar problem, the slug sanitization just uses the old style for hierarchical post types, rendering any enhancements to the slug generation moot, we can add canonicalisation for the pages, but this isn't going to help old content, unless we process every page on upgrade.

Ultimately: Nacin's approach feels hacky to me, and something that should only be used as a bandaid, but it does prevent yet-another-sanitization-type happening to hierarchical post types slugs, allowing for us to have easier migrations in the future to a _wp_old_slug/_wp_old_path process for pages

#8 @nacin
4 years ago

nacin's approach has a similar problem, the slug sanitization just uses the old style for hierarchical post types, rendering any enhancements to the slug generation moot

Worth pointing out that it only abandons the new style sanitization for existing posts of hierarchical post types. New pages do get to benefit from the enhancements. All posts, old or new, are able to benefit, due to canonicalization — in this case, the wallpaper is a feature. :-)

#9 @dd32
4 years ago

Worth pointing out that it only abandons the new style sanitization for existing posts of hierarchical post types.

Ah, my apologies, I didn't catch that block of code only executes on non-publish -> publish.

#10 @ryan
4 years ago

  • Version set to 3.3

#11 @ryan
4 years ago

  • Priority changed from normal to highest omg bbq

4 years ago

#12 @markjaquith
4 years ago

I liked dd32's approach better. But it doesn't use the newest sanitization if you manually update the slug. My patch adds that. Also adds a JS fix for if you click "Edit" for an old slug with bad chars, and then click "OK" without changing it. If you don't change it, it should be like Cancel, and not fire an XHR that will show you a new-style slug (which will not be used when you click "Update").

#13 @markjaquith
4 years ago

For testing:

  1. Manually insert this into a post_name in the database:

The first char is "å", and the ones around "quoted-text" are curly quotes from Microsoft Word. This was the original source:

foo å bar “quoted text”
  1. Once you have that in a post_name, open that post for editing. First thing to do is hit Update without making any changes. The permalink slug should stay the same. Verify the post is still accessible.
  1. Click "Edit" next to the slug and then immediately press "OK" without making any changes. The slug stay the same. Try another Update to verify.
  1. Next, manually edit the slug, and append something (like "-appended"). On pressing OK, you should get a new, fully sanitized, like "foo-a-bar-quoted-text-appened". Save and verify. Make sure the 301 works from the old URL.
  1. Try creating a new post or page with a title of foo å bar “quoted text”. Make sure you get the newest sanitization when publishing.

#14 @DrewAPicture
4 years ago

Sanitizes as advertised. All steps succeeded.

Also tested single curly quotes to the mix (my biggest troublemaker).

Last edited 4 years ago by DrewAPicture (previous) (diff)

#15 @nacin
4 years ago

  • Keywords needs-testing commit added

Per IRC, ignoring the canonical benefits. Could be done later if we iron out everything. (An additional thing to look out for is imports.)

Thus, as long as the patch works as described, let's go for it. Commit candidate.

#16 @nacin
4 years ago

Let's also not forget about xknown's other sanitize_title_for_query() changes.

#17 @nacin
4 years ago

I would rather this line:

$check_name = sanitize_title( $post_name, '', 'query'); // sanitize_title_for_query(), but spelt out here for readability

$check_name = sanitize_title( $post_name, '', 'old-save');

'query' is semantically incorrect, and for it to work, it just needs to be somrthing other than 'save'.

4 years ago

uses old-save instead of query, cleans up docs.

#18 @dd32
4 years ago

In [19444]:

Switch to sanitize_title_for_query() for Query sanitization (allows for pre-3.3 page slugs to be viewable), Don't update page slugs to new slug-types when the slug is not being changed, Don't issue a XHR if the page slug hasn't changed. Group effort props xknown, markjaquith, nacin. See #19292

#19 @dd32
4 years ago

  • Resolution set to fixed
  • Status changed from new to closed

In [19444]:

Note: See TracTickets for help on using tickets.