WordPress.org

Make WordPress Core

Opened 6 years ago

Closed 2 years ago

#8384 closed feature request (maybelater)

Improving WP hierarchical data structure / Use SQL trees

Reported by: hailin Owned by: hailin
Milestone: Priority: low
Severity: minor Version:
Component: Optimization Keywords:
Focuses: Cc:

Description

Currently WordPress uses simple Adjacency List Model to manage hierarchical data, including pages, categories, terms.

As the size of data sets grow larger and larger, we run into inherent limitations imposed by our current data structure.

For example:

When there are 5500 pages, in order to list a particular subset of pages,
we have to retrieve all of them from the db (step 1), and enumerate through each one to construct a sub-tree (step 2).

We did some algorithmic improvement before to improves part (#2) by reducing the complexity from O(n2) to O(n) in
function page_rows($pages, $pagenum = 1, $per_page = 20)

it reduced the processing time in step 2 from 20 sec to 0.3 sec in a case when there are ~2000 pages.

However, step 1 is still a major hurdle, because limitations of our current pages data structure.

In the case of one blog with 5500+ pages, it takes 33 seconds to display wp-admin/edit-pages.php. In particular, 17 seconds are spent in update_post_caches(), and 11 seconds are spent in apply_filters('the_posts', $this->posts), simply because we are operating over all 5500+ pages! And because the we only store rudimentary parent-child information in the table, we had to query the whole 5500 pages.

If we improve the data structure and store the pages in efficient hierarchical order in db, then for every operation, we could retrieve only the ones we want, eg, 30 pages at a time. This can bring down the page load time from 33 seconds to sub-second, substantially improving user experience.

The above is just one example, the same can be applied to any other cases involving
hierarchical data sets.

The potential change will have a lot of ripple effects, as it may affect a lot of other functions or even maybe themes, which depend on the existing behavior. So we should proceed cautiously and pay great attention to backward compatibility, etc.

We could consider:
Modified Preorder Tree Traversal Algorithm
http://www.sitepoint.com/article/hierarchical-data-database/

Or the ones recommended in the classical book:

http://www.amazon.com/Hierarchies-Smarties-Kaufmann-Management-Systems/dp/1558609202/ref=pd_bxgy_b_img_b

Besides, WordPress is evolving into a CMS, and that mandates us having a better foundation on which to manipulate various data formats. Such a solid, efficient, elegant, robust data structure will serve as a strong foundation for us to evolve well into the future.

Attachments (1)

pages.sql (21.8 KB) - added by Denis-de-Bernardy 6 years ago.

Download all attachments as: .zip

Change History (18)

comment:1 @ryan6 years ago

  • Priority changed from high to normal
  • Type changed from enhancement to feature request

comment:2 @FFEMTcJ6 years ago

  • Milestone changed from 2.8 to Future Release

comment:3 follow-up: @Denis-de-Bernardy6 years ago

  • Component changed from General to Optimization

I wish... but someone would need to give this some love, and it's not easy to maintain such an index without triggers.

comment:4 @Denis-de-Bernardy6 years ago

  • Priority changed from normal to low
  • Severity changed from normal to minor
  • Summary changed from Improving WP hierarchical data structure to Improving WP hierarchical data structure / Use SQL trees

comment:5 @hailin6 years ago

  • Owner changed from anonymous to hailin
  • Status changed from new to assigned

Two GoSC students are working on MPTT, one with Ryan and the other with Tott. It's going to affect the core data structure substantially and many areas. Let's see how it goes and I can help out if Ryan wants me to

comment:6 follow-up: @hailin6 years ago

There are some existing research work on how to speed up operations on Adjacency List Model. Before we jump to MPTT (which would be a major heart surgery),
it'd be a good idea to fully explore those research studies to see if we can improve the existing models.

comment:7 @filosofo6 years ago

Is there any public information about the GSoC work? Unlike past years, this year I wasn't able to find a public mailing list, but I'd be interested to see what the students are doing with WP and MPTT.

comment:8 in reply to: ↑ 6 @Denis-de-Bernardy6 years ago

Replying to hailin:

There are some existing research work on how to speed up operations on Adjacency List Model. Before we jump to MPTT (which would be a major heart surgery),
it'd be a good idea to fully explore those research studies to see if we can improve the existing models.

The main issues relate to needing to maintain the index in order. If you move two elements in one go, in such a way that their lft/rgt fields before and after share values, it can get potentially nasty.

Attached is an implementation in pgsql for reference, to give you some idea of how tricky it can get.

@Denis-de-Bernardy6 years ago

comment:9 @Denis-de-Bernardy6 years ago

quick notes on the pages.sql file I uploaded:

  • it's best read using pgadmin
  • it contains more than just rgt/lft -- also revisions and permissions for a page tree

the key points of interest is the page_tree() function, when then gets used in the various triggers. with PG and triggers, you can get a serialized transaction and ensure you've a properly indexed table at all times. with MyISAM, you're pretty much guaranteed to get none of that.

comment:10 @Denis-de-Bernardy6 years ago

Oh, adding to this, that code I uploaded has been abandoned and never rolled out. Since then, PG has gotten a new contrib module that makes this stuff built-in.

comment:11 @diegocaro6 years ago

Hello! The two students are me and Dan Larkin. We were working on MPTT for WordPress in pages and categories.

In a short resume, I only get performance in admin panel (comparision between trunk WP and WP-MPTT with the new metadata in the database). For any reason, I can't get more improvement in the other sections of WordPress. I think that when WP get the data from mysql, WP take more time beacause get the metadata (left, right and depth fields). May if you read the code, you can get the reason of the non-perfermance.

Dan Larking work was great, him make code with acceleration in normal WordPress page, but no in admin panel (I guess).

I think that a good idea is put my code of admin panel, and put the code of Dan in WordPress Core.

My patch is in http://google-summer-of-code-2009-wordpress.googlecode.com/files/Diego_Caro.tar.bz2 and Dan Larkin patch is in http://google-summer-of-code-2009-wordpress.googlecode.com/files/Daniel_Larkin.tar.bz2

You can see the history of our work in MPTT-WordPress in http://gsoc2009wp.wordpress.com/tag/mptt/

Bye bye!

comment:13 in reply to: ↑ 3 @miqrogroove5 years ago

Replying to Denis-de-Bernardy:

it's not easy to maintain such an index without triggers.

That was my overall impression of this thread. Reading MPTT made the algorithm sound like it had two back-ends; one has the parent-child relationships stored in a table, and the other should be maintained as an index at the storage layer. Otherwise, you're potentially locking the posts table for a many-row UPDATE command every time you save a new page draft. And is it just my imagination, or do the Left and Right columns have to be indexed anyway? If that's the case, then using sequential Left and Right values is a mistake.

You could take 1 to MAX(Right) and distribute the values, evenly for the sake of this example. If you want to add a value with parent Fruit in the original version, you're stuck with (2)Fruit(11) and (7)Yellow(10). But if you had distributed values like (200)Fruit(1100) and (700)Yellow(1000), now you can freely insert (1025)Green(1075) without locking the table.

comment:14 @hakre5 years ago

Before I can make up my mind about 5000 Pages and their relationship, I would ask myself how I would display them on the post editor. I mean, the dropdown isn't fitting for such large numbers.

Related to dropdowns: #11433, #8592

comment:15 @hakre5 years ago

Related: #10277

comment:16 @voyagerfan57615 years ago

  • Cc WordPress@… added

comment:17 @westi2 years ago

  • Milestone Future Release deleted
  • Resolution set to maybelater
  • Status changed from assigned to closed

Closing this ticket off for now, we have had 3 years without any activity and the MPTT GSOC projects didn't lead to anything that was really suitable for core.

We probably need to review this again at some point but we should probably start with a clean slate and a fresh set of research into the most practical solutions.

Closing as maybelater

Note: See TracTickets for help on using tickets.