Make WordPress Core

Opened 13 years ago

Closed 12 years ago

Last modified 12 years ago

#16687 closed task (blessed) (fixed)

%postname% permalinks should not trigger verbose rewrite rules

Reported by: nacin's profile nacin Owned by:
Milestone: 3.3 Priority: normal
Severity: normal Version:
Component: Rewrite Rules Keywords: needs-testing
Focuses: Cc:

Description

The %postname% permalink structure is very common. As it currently stands, this introduces a performance penalty with large numbers of posts.

I think we can avoid verbose rules here by mapping (.+?) not to pagename=, but to a new, temporary query variable that would trigger a query by post_name for both pages and posts.

We can recommend against it all we want, but this structure isn't going away on the internet. Fighting it only causes perception issues relating to scaling. We should instead embrace it and make at least this structure scalable.

Attachments (9)

non-verbose-page-rules.php (1.5 KB) - added by markjaquith 13 years ago.
Partial solution
non-verbose-page-rules.002.php (3.0 KB) - added by duck_ 13 years ago.
non-verbose-003.php (2.2 KB) - added by greuben 13 years ago.
16687.diff (5.5 KB) - added by dd32 13 years ago.
no_verbose_ver1.diff (3.8 KB) - added by Otto42 13 years ago.
Make verbose page rules not quite so verbose anymore
no_verbose_ver2.diff (3.5 KB) - added by Otto42 13 years ago.
Take two - fixes problems with feeds, refreshed for trunk
no_verbose_ver3.diff (6.6 KB) - added by Otto42 13 years ago.
Take 3 - optimizes the query to make it only one extra query for any number of levels
polish.16687.diff (5.1 KB) - added by scribu 13 years ago.
16687.minor-typo-fix.patch (673 bytes) - added by SergeyBiryukov 13 years ago.

Download all attachments as: .zip

Change History (72)

#1 @Denis-de-Bernardy
13 years ago

Suggestion:

select *
from posts
where post_parent is null
and post_name = %
order by post_type <> 'post'
limit 1

and for /%category%/%postname%/:

select *
from posts
where (post_parent is null) = (post_type = 'post')
and post_name = %
order by post_type <> 'post'
limit 1

limit 1 is, arguably, only useful only if we don't want to double check the result.

#2 @nacin
13 years ago

I like both.

I'm not sure we can apply the same line of thought I'm thinking here, to /%category%/%postname%/. That'll require a lot more work to get right and match our current handling.

#3 @linuxologos
13 years ago

  • Cc linuxologos@… added

#4 @scribu
13 years ago

  • Cc scribu added

Related: #9824

#5 @dd32
13 years ago

For /%postname%/ can we just not do if ( no slashes in the page query string ) then set type = post||page else set type = page (as a slash in the pagename, means' a hierarchical type, so not a post..)

That'll probably require a extra check for is_page/etc conditionals to set it to the correct one..

#6 @scribu
13 years ago

That could work, but I'm not sure it's worth it. You might stomp on other rewrite rules.

#7 @markjaquith
13 years ago

scribu — it would be last, just like our current page query catchall.

#8 @mikeschinkel
13 years ago

  • Cc mikeschinkel@… added

Suggestion: Rather than focus on this one use-case how about instead add a hook that allows this and other use-cases to be addressed by both core and by plugins, and then address this use-case in core using the new hook.

I've added ticket #16692 with a patch that can get the discussion started.

#9 @nacin
13 years ago

  • Milestone changed from Future Release to 3.2

Mark floated around a plugin POC that implemented this. It could be turned into a patch pretty easily I imagine. This is a really good enhancement in line with 3.2, moving there.

@markjaquith
13 years ago

Partial solution

#10 @markjaquith
13 years ago

That's the plugin I worked on briefly. I don't recall the specifics, but it's not complete. Something wasn't working. I throw it up here with that caveat just to get some ideas flowing.

#11 @greuben
13 years ago

A page and a post can have the same slug. If /%postname%/ permalink is used, the post is not reachable. Maybe wp_unique_post_slug() can be modified but that might break existing links.

#12 follow-up: @bike
13 years ago

I am overstepping my knowledge here, so ignore me if the following does not make sense:
I just tested the number of rewrite rules with AskApache rewrite viewer plugin. Indeed a change from an 'allowed' permalink to a %postname or %category permalink increases both WP rewrites as well as page rewrites with 11 rules each (per page!).

But every additional group of 11 rules is exactly the same (they have no reference to page/post ID's etc) according to AskApache plugin, so is there not a way to just not generate all those duplicate rules?
Anyway, I am glad you are working on this as it does prevent acceptance for CMS users. Cheers.

#13 in reply to: ↑ 12 @Otto42
13 years ago

Replying to bike:

But every additional group of 11 rules is exactly the same (they have no reference to page/post ID's etc) according to AskApache plugin, so is there not a way to just not generate all those duplicate rules?

The rules are not the same, the AskApache plugin is only showing you the end result, not what it's matching against. It's showing you the values, not the keys.

Edit the AskApache plugin on line 347 to change this:
foreach($v as $vv) {

To this:
foreach($v as $vv=>$res) {

And then take a look. The rule keys will be shown instead, and you'll see why the rules are the way they are. The actual end rules may be the same, but the regular expressions they're matching against are not, and those are the more important part.

#14 @kyounger
13 years ago

  • Cc kyounger added

#15 @azizur
13 years ago

  • Cc azizur added

#16 @duck_
13 years ago

Attaching a new PoC plugin (non-verbose-page-rules.002.php) to fix the problems Mark referred to. The problem was with sub-pages being blocked by the post attachment rule:

[^/]+/([^/]+)/?$ => index.php?attachment=$matches[1]
// ... and similar for trackbacks, feeds, comment pages

I worked around this by removing these attachment rules and dealing with them as part of the page catch-all. If there is a / then it's either a page or an attachment, it's the latter if matched by any of the attachment rewrite rules that were previously unset (transformed into non-regex form, this is where it gets a bit uglier/hackish[1]) *and* the attachment piece -- ([^/]+) -- is actually an attachment... otherwise it's a page. On the other hand we have a post or a page which can be dealt with as before.

I also added a check to only change the rules specifically for the /%postname%/ structure as it will just break other verbose structures.

I also tried a variation for the non-/ post or page branch to get the ID in advance and reduce the number of queries for top level pages by one compared to trunk. For a full table of query counts in some quick testing see http://pastie.org/1813700.

Performance tests I made before I added the OR branches to deal with post attachment trackbacks, feeds and comment pages showed that with the plugin response time remained the same as the number of pages grew (0.17s at both 59 pages and 1297 pages). Without the plugin it was a different story. 0.18s response time at 59 pages and 0.37s at 1297 pages. The figures I'm quoting are from tests with one concurrent user over 30 seconds requesting the comment feed of a post attachment (e.g. http://wp.dev/post-with-attachment/attachment-slug/feed/rss/) so very low on the rewrite stack. With the plugin enabled the /%postname%/ permalink structure has the least number of rewrite rules by default.

Also run this against the WordPress tests for WP_Query and WP_Rewrite with no failures.

[1] I thought I had managed a relatively neat solution before realising that I had forgotten the trackback, feeds and comment page rules for post attachments which turns the condition into a bit of a monster!

Last edited 13 years ago by duck_ (previous) (diff)

#17 follow-up: @scribu
13 years ago

Wouldn't it be better to look for the 'attachment' part? That's how the regexes look.

Also, you can use $wp_rewrite->feeds instead of the hardcoded array.

#18 in reply to: ↑ 17 ; follow-up: @duck_
13 years ago

Replying to scribu:

Wouldn't it be better to look for the 'attachment' part? That's how the regexes look.

Also, you can use $wp_rewrite->feeds instead of the hardcoded array.

Initial reaction is good point. However, I think it might mean just replacing it with 5 regular expressions since ^[^/]+/([^/]+) could match a request like page-1/page-2/page-3 where page-2 is also an attachment slug (although attachments are supposed to be globally unique this might not be the case see #17170). But you're right, I think there is probably a cleaner way of getting it working if someone thinks about it a bit more and actually tries it.

#19 @scribu
13 years ago

Related: #17185

#20 @duck_
13 years ago

I tested using a single regular expression and then checking the matched portion against attachment post_names instead of the set of OR'd conditions:

if ( preg_match( '#^[^/]+/([^/]+)#', $request['post_or_page'], $matches ) &&
    $id = $wpdb->get_var( $wpdb->prepare( "SELECT ID FROM $wpdb->posts WHERE post_name = %s AND post_type = 'attachment' LIMIT 1", $matches[1] ) ) )
  $request['attachment_id'] = $id;

This works well in most cases. But as I had thought there can be problems with conflicting attachment/page slugs. To test try this set-up:

  • 3 pages with slugs: page-1, conflict, page-2. Where the succeeding page is a child of the previous
  • 1 attachment (attached to a post) with slug conflict (has to be added before the page see #17170)

Now visit /page-1/conflict/ and /page-1/conflict/page-2/ in both cases the attachment is displayed when using the plugin with the above mod. On trunk the pages are displayed (though a notice is triggered for /page-1/conflict/ if the post/attachment is tested by get_page_by_path first, wp-includes.php line 3157). Plugin v002 as uploaded previously will show the attachment for the first but not the second since it will recognise there are three parts the last of which is not 'trackback', a feed or 'comment-page-xx'.

This shouldn't be a problem if attachments were guaranteed to have globally unique slugs. If this were the case then the single regular expression might be the way to go.

#21 @duck_
13 years ago

Just thought of being able to use the matches that have already been made by the rewrite rules to avoid extra slug conflicts (like I did in v002). So checking isset($request['tb']) and the like. I think exploding and counting will be faster, but something that needs to be investigated.

#22 in reply to: ↑ 18 @greuben
13 years ago

Replying to duck_:

But you're right, I think there is probably a cleaner way of getting it working if someone thinks about it a bit more and actually tries it.

I've tried it. See non-verbose-003.php

Version 0, edited 13 years ago by greuben (next)

#23 @solarissmoke
13 years ago

Related: #13459

Last edited 13 years ago by solarissmoke (previous) (diff)

#24 @jane
13 years ago

  • Milestone changed from 3.2 to Future Release

Didn't make it in before feature freeze last week, revisit 3.3 early.

@dd32
13 years ago

#25 @dd32
13 years ago

attachment 16687.diff added

Initial WIP patch, This doesn't touch the fact that the url's may not be unique (page and post with the same slug, Attachment with the same slug as a 2nd level page, 2nd level page with a same slug as an attachment, etc). Seems to work in all cases i throw at it.

The priority order for matching items appears to be: Posts > Attachments > Pages

#26 @nacin
13 years ago

  • Milestone changed from Future Release to 3.3

#27 @billerickson
13 years ago

  • Cc bill.erickson@… added

#28 @travisnorthcutt
13 years ago

  • Cc travis@… added

#29 @Otto42
13 years ago

How to make verbose page rules less painful.

For the cases that would normally invoke verbose_page_rules, use this mechanism instead:

  1. Move the page rewrite rule immediately above normal post custom rewrite rules in the rewrite_rules.
  1. Special case the page rewrite rule to use a callback or something to do the following:
  1. Split the url along slashes (post_name's can't contain slashes). Reverse number them. So they look like this: page4/page3/page2/page1 (the page you want, page1, is always at the beginning of your list.
  1. Recursive subquery fun!
select ID from wp_posts where post_name = page1 AND post_parent = 
( select ID from wp_posts where post_name = page2 AND post_parent =
( select ID from wp_posts where post_name = page3 AND post_parent =
( select ID from wp_posts where post_name = page4 
) 
) 
)

Note that this is actually pretty fast due to indexing, although if somebody decides to go too deep into the dream state, they could fall into limbo.

Alternative could use joins. Limbo here is a lot deeper down, probably.

SELECT ID FROM wp_posts a 
JOIN wp_posts b ON a.post_parent = b.ID AND post_name = page2
JOIN wp_posts c ON b.post_parent = c.ID AND post_name = page3
JOIN wp_posts d ON c.post_parent = d.ID AND post_name = page4
WHERE a.post_name = page1
  1. If this post exists, check its post_type for page and set vars accordingly. If post_type is attachment, check parent's type instead (to handle page attachments).
  1. If post does not exist, then fail and get on with life.

Advantage: This adds one extra (ugly) query to the case of using verbose page rules, and doesn't require all the verbose rules to be so verbose. It moves the performance issue to this one query only, and the performance issue here is keyed to the depth of nested pages instead of to the total number of pages. Log N improvement for most all cases.

#30 @Otto42
13 years ago

By using get_page_by_path, I was able to implement this basic idea much simpler. I don't know how efficient get_page_by_path actually is though, so I can't speak as to performance.

use_verbose_page_rules still exists, but with this it doesn't create huge rewrite rulesets anymore.

See no_verbose_ver1.diff.

@Otto42
13 years ago

Make verbose page rules not quite so verbose anymore

#31 @johanee
13 years ago

  • Cc johan.eenfeldt@… added

Very nice, good work!

Tested performance of no_verbose_ver1.diff. Time is average of 5 tests, run on a decent VPS.

Site 1 (72 pages & 3714 attachments) WP 3.2.1

  • rewrite rules: 5638 (666KB serialized)
  • flush_rewrite_rules(): 1.452s
  • failed url_to_postid(): 0.052s

Site 1 WP 3.2.1 + no_verbose_ver1

  • rewrite rules: 72 (6,5KB serialized)
  • flush_rewrite_rules(): 0.005s
  • failed url_to_postid(): 0.005s

Site 2 (27 pages & 1350 attachments) WP 3.2.1

  • rewrite rules: 15418 (1,8MB serialized)
  • flush_rewrite_rules(): 9.068s
  • failed url_to_postid(): 0.150s

Site 2 WP 3.2.1 + no_verbose_ver1

  • rewrite rules: 72 (6,6KB serialized)
  • flush_rewrite_rules(): 0.005s
  • failed url_to_postid(): 0.005s

Do you want anything else tested or profiled?

#32 @Otto42
13 years ago

Basically, this patch makes page rules not really "verbose" anymore, and so the verbose flag is now just used to determine if pages should take precedence or not and whether extra queries need to be done for that case.

So also be sure to check for unexpected side effects. Make sure that Pages and Attachments all show up properly and aren't throwing 404's or errors or what have you.

Also be aware than in the event of a URL conflict (such as if you use %postname% for the custom string and a page and post both have the same slug), then the page is going to take precedence and get shown instead of the post. Me and Nacin talked about that Monday and decided that that made the most sense. So be sure that works properly too on your sites.

With the current patch, it's doing unnecessary extra data retrieval and then discarding it. This can be optimized away, although I don't think it would result in noticeable speed improvement. One thing that would be interesting to see is the results on long hierarchical page URLs. The speed of the lookup is dependent on the depth of the URL. So a page with /page would be found faster than a page with /page/url/is/really/really/deep/here/oh/noes. It might be interesting to see how deep you need to go before it becomes a problem.

#33 @johanee
13 years ago

The patch has a bug in comment feeds for pages and attachments (/some/page/feed). You get either 404 or redirected to the actual page/attachment. If there is a post with same name as page you get comments for the post.

The rest seem to work. I get the correct stuff for a random selection of posts, pages, attachments and categories. Pages get precedence over posts. Post, post comment and category feeds work.

Performance for page lookup:

  • depth 1: 0.009s, 11 queries (was 0,055s and 5 queries)
  • depth 5: 0,021s, 47 queries (was 0,055s and 17 queries)
  • depth 10: 0,034s, 92 queries (was 0,056s and 32 queries)
  • depth 15: 0,049s, 137 queries (was 0,065s and 47 queries)
  • depth 20: 0,062s, 182 queries (was 0,076s and 62 queries)

Seems reasonable for any even close to sane depth. A number of queries but they seem fast enough.

#34 @nacin
13 years ago

  • Type changed from feature request to task (blessed)

#35 @kawauso
13 years ago

  • Cc kawauso added

#36 @tetele
13 years ago

  • Cc tm.sandu@… added

#37 @ryno267
13 years ago

  • Cc chuck@… added

#39 @nerrad
13 years ago

  • Cc nerrad added

#40 @aaroncampbell
13 years ago

  • Cc aaroncampbell added

@Otto42
13 years ago

Take two - fixes problems with feeds, refreshed for trunk

#41 @Otto42
13 years ago

Version 2 of patch uploaded. This uses a slightly more complex page query test which handles the feed problems reported by johanee.

In testing with a post with the same slug as a page while using a /%postname%/ permalink string, the page and page feeds took precedence over the post and post feeds. I believe this is the desired behavior in such a case.

Number of extra queries are not yet optimized in this patch, however they are still reasonable for sane page depths and *way* faster than the verbose rules.

#42 @Otto42
13 years ago

Version 3 of the patch creates a new find_page_by_path function to find the page given the path, using only one query.

Also optimized get_page_by_path to use this new function, which makes it actually result in less queries when retrieving deep pages than it was using before.

johanee: If you want to test this with your previous data set to get numbers or find bugs, I'd appreciate it.

#43 @dougwrites
13 years ago

  • Cc heymrpro@… added

#44 @Otto42
13 years ago

Tested with 1000 pages and /%postname%/ permalinks. Trunk code resulted in 2045 queries on the permalink settings page, and a "MySQL server has gone away" message when trying to save the resulting rewrite array. Page requests thus took 2000+ queries.

With the ver3 patch, rewrite array was 71 rules. Page requests took 17 queries.

@Otto42
13 years ago

Take 3 - optimizes the query to make it only one extra query for any number of levels

#45 @scribu
13 years ago

Coding style: Instead of:

if ( $wp_rewrite->use_verbose_page_rules == true && preg_match('/pagename=\$([^&\[]+)\[([0-9]+)\]/',$query,$varmatch) ) { 

you can just write:

if ( $wp_rewrite->use_verbose_page_rules && preg_match('/pagename=\$([^&\[]+)\[([0-9]+)\]/',$query,$varmatch) ) { 

and it's pretty tricky to figure out what this line does:

if ( ! get_page_by_path(${$varmatch[1]}[$varmatch[2]]) )

More importantly, why does this block of code appear twice:

if ( $wp_rewrite->use_verbose_page_rules == true && preg_match('/pagename=\$([^&\[]+)\[([0-9]+)\]/',$query,$varmatch) ) { 
  // this is a verbose page match, lets check to be sure about it 
  if ( ! get_page_by_path(${$varmatch[1]}[$varmatch[2]]) ) 
    continue; 
}

If it's really needed in two places, it should be factored into a helper function.

#46 @Otto42
13 years ago

A lot of the parse_request function is reproduced in url_to_postid. That's a refactor for another time, I think.

The line of code you're calling tricky is, indeed, tricky. It's basically looking at a string like this:

index.php?pagename=$matches[1]&page=$matches[2]

And trying to extract the "$matches[1]" from that string in a way that it can then figure out how to get the value in $matches[1]. Simplification would be gladly accepted, I banged my head for a while on that one.

#47 @scribu
13 years ago

A lot of the parse_request function is reproduced in url_to_postid. That's a refactor for another time, I think.

Yeah, that's what the next guy says and it gets worse and worse.

Simplification would be gladly accepted, I banged my head for a while on that one.

Ok, I'll sleep on it and see if I dream up a better solution. :)

#48 follow-up: @scribu
13 years ago

What I mean is that existing duplication doesn't justify additional duplication.

#49 in reply to: ↑ 48 @Otto42
13 years ago

I get ya, but I think that removing that duplication is best done in a separate ticket.

#50 @markjaquith
13 years ago

In [18541]:

Eliminate verbose rewrite rules for ambiguous rewrite structures, resulting in massive performance gains. props andy, otto42, duck_. Nice work everyone! see #16687

#51 @scribu
13 years ago

Couldn't sleep. :)

And trying to extract the "$matches[1]" from that string in a way that it can then figure out how to get the value in $matches[1].

The thing is that we only have $matches available as a variable. Trying to use anything else won't work anyway. Patch on the way.

@scribu
13 years ago

#52 @scribu
13 years ago

polish.16687.diff:

Last edited 13 years ago by scribu (previous) (diff)

#53 @duck_
13 years ago

In [18566]:

Fix rewrite documentation typo and clarify with "rules array" over "this/the structure". props SergeyBiryukov. see #16687

#54 follow-up: @SergeyBiryukov
13 years ago

[18541] broke get_page_by_path() for non-English slugs: ticket:10249:42.

#55 in reply to: ↑ 54 @ocean90
13 years ago

  • Keywords needs-testing added

Replying to SergeyBiryukov:

[18541] broke get_page_by_path() for non-English slugs: ticket:10249:42.

That's right, related PHP warnings which I get:

  • NOTICE: wp-includes/post.php:3175 - Undefined index: 0
  • NOTICE: wp-includes/post.php:3171 - Trying to get property of non-object
  • NOTICE: wp-includes/post.php:3178 - Trying to get property of non-object

I think [18627] is related to this ticket.

Last edited 13 years ago by ocean90 (previous) (diff)

#56 @dd32
13 years ago

That's right, related PHP warnings which I get:

Those are somewhat unrelated, same function, but not the cause there. See #17670 for fixing those. (The changes here did increase the number of notices though)

#57 follow-up: @rockstyle
13 years ago

So obviously /%postname%/ gets improved. What about /%category%/%postname%/? You could easily detect if first segment is category and if yes, look for entry, if not check page.

#58 in reply to: ↑ 57 @Otto42
13 years ago

Replying to rockstyle:

So obviously /%postname%/ gets improved. What about /%category%/%postname%/? You could easily detect if first segment is category and if yes, look for entry, if not check page.

The patch that has already been committed accounts for *all* cases. Not just the postname case.

#60 @ryan
12 years ago

  • Resolution set to fixed
  • Status changed from new to closed

#61 @scribu
12 years ago

Follow-up: #18877

#62 @landwire
12 years ago

  • Cc landwire added

Hi all,
as you seem to be the experts on this field, I come straight to you:

Is there a possibility of adding either of those two structure tags %userrole% and %posttype% to create permalinks? Or is this an utterly stupid idea?
It would help me a lot making the following permalink structure possible, especially if you have 1000+ authors it would be nice to get them structured according to their role:

%role%/%author%/%posttype%/%postname%

I appreciate any answers. If I know it is an absolute silly idea, then I can at least drop it!
Thanks for doing all this magic with wordpress,
Sascha

#63 @scribu
12 years ago

Yes, it should be possible, but trac is not the right place for asking such questions.

Please try the support forums or WordPress Answers.

Note: See TracTickets for help on using tickets.