Make WordPress Core

Opened 17 years ago

Closed 17 years ago

#4189 closed defect (bug) (fixed)

Taxonomy Schema and API

Reported by: ryan's profile ryan Owned by:
Milestone: 2.3 Priority: normal
Severity: normal Version: 2.2
Component: General Keywords: term taxonomy tags categories
Focuses: Cc:

Description

Create a generic taxonomy and term API and schema that can be used for tags, post categories, link categories, and arbitrary future taxonomies.

See the hackers list discussion:

http://www.nabble.com/Taxonomy-Schema-Proposal-tf3582783.html

See also the discussion on the tagging ticket, #3723

Attachments (7)

taxonomy.php (6.8 KB) - added by majelbstoat 17 years ago.
WPTaxonomy skeleton
taxonomy-schema.diff (3.1 KB) - added by majelbstoat 17 years ago.
Schema alterations
4189b.dff (10.7 KB) - added by majelbstoat 17 years ago.
Second iteration, no taxonomy table, more functions.
4189c.diff (10.5 KB) - added by ryan 17 years ago.
4189b + syntax fixes
terms.diff (9.3 KB) - added by ryan 17 years ago.
terms.2.diff (15.4 KB) - added by ryan 17 years ago.
installation_errors_at_r5602.txt (1.7 KB) - added by johnbillion 17 years ago.
Installation errors at r5602

Download all attachments as: .zip

Change History (92)

#1 @majelbstoat
17 years ago

Seeing as categories are going to be changing so much, I wondered if it might be a good time to consider an alternative hierarchical implementation in the database? This article makes a good case for nested set categories:

http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

To implement this, we'd need a left and right column instead of a parent column. I quite like this anyway, as it also allows for more complicated taxonomy schemes, such as linked lists. There are lots of benefits to using this as opposed to our current method, the main of which is that it is optimised for retrieving data, rather than inserting.

#2 @majelbstoat
17 years ago

That being said, here is a basic schema (still with the parent column) and the beginnings of a WPTaxonomy class. I stayed away from WPTerm, because, even though it is a nice construct, you don't want to have to create a new instance of a term every time you want to do an operation. Taxonomies are stored in a separate table for permanence, but will be retrieved from get_options for minimal db impact. I think that's what was agreed on on the list? is_int() is used to distinguish between term_ids and term_slugs.

add_term_relationship() is worth a mention as it is designed to be quite flexible. It can take an array of term ids or slugs, an object_id and an array of taxonomy ids or taxonomy slugs. It should create the terms that don't exist (though, in general this shouldn't be required), then create or increment the term-taxonomy relationship (inefficient at the moment as it is db queries in a loop), then finally create the object-term-taxonomy relationships. An example call might be

$wp_taxonomy->add_term_relationship(array('cities', 'vacation', 5), 3, array('post_category', 'post_tag'));

Which would add both tags and categories for cities, vacation and the term with id 5 to the post with id 3. The eventual idea is to be able to add a whole load of taxonomy to an object with only one or two queries, rather than repeated calls for each category/tag etc. I'm thinking this will be a very common case, for example when saving a new post. To realise this fully, add_term_taxonomy() will have to be modified to accept arrays as well, so that we don't end up calling 3 queries per term per taxonomy.

This is just a starting point - possibly misguided and probably buggy, but it's something to work from...

Cheers,

Jamie.

@majelbstoat
17 years ago

WPTaxonomy skeleton

@majelbstoat
17 years ago

Schema alterations

#3 @ryan
17 years ago

I'm still on the fence regarding the taxonomies table. In most cases the contents of that table will never change. It will be static. I'm wondering just what the performance difference between an int join and a join on a short string would be. Using a taxonomy string would simplify things.

#4 @majelbstoat
17 years ago

Yeah, to be honest, we don't need that lookup table at all. Because I restructured the term_relationships table to an int join, we only store the taxonomy string in one place and we don't ever need to join on taxonomy type. In fact, adding a lookup table will needlessly complicate things now that I think about it, and is probably a good case of over-normalisation. I'll refactor the functions to just accept a taxonomy string and add a bit more functionality to the WPTaxonomy class to flesh it out a bit. When the basics are there, we'll have to start reworking the category API internals. Speaking of which, did you have any comments on the right/left structure for hierarchical data from the above link? This is the right time to implement it, if we're going to...

#5 @jhodgdon
17 years ago

Would it be possible to make the term name field wider (say, 200 characters) or unlimited width (TEXT rather than VARCHAR) in the schema? See #3729 -- it is useful to have category/term names be larger for some multi-lingual plugins. I am also unclear why the slug would be 200 characters if the name itself is much shorter -- they should probably be of similar length.

#6 @jhodgdon
17 years ago

I would also like to vote for making the schema and API easier to understand.

With the current proposal, it is not at all clear to me from reading the code how I would I go about doing common tasks, such as defining a category, setting the categories or tags for a post, figuring out what categories go with a post, etc. It is also not clear to me what entries in the MySQL tables would hold such information. All of which makes it very difficult to comment on the proposal.

#7 follow-up: @majelbstoat
17 years ago

There was lots of discussion about the proposed schema on the hackers list, linked in the description of this ticket.

The slug and term_name being different sizes is a typo. Personally, I would prefer shorter fields as multilingual plugins are used by a minority of users, and having size 200 is pretty inefficient for the vast majority of people. I would certainly by -1 for a text field. Plugins that need to extend the size of a field can do so easily by making an ALTER TABLE statement when they are activated.

The API as in the file is obviously nowhere near complete. Unfortunately, I'm a bit under the weather, so it's been a little while since I worked on it. The retrieval functions aren't written for example, but and example might be something like:

$wp_taxonomy->get_object_terms(3, array('post_category', 'post_tag'));

Returning all the categories and tags for the specified post. The first argument may also be an array to retrieve whole subsets of taxonomy in one go. The aim is to provide functions for every interaction with the database. Setting the categories or tags for a post is pretty easy, and is explained above:

$wp_taxonomy->add_term_relationship(array('cities', 'vacation', 'Japan'), 3, array('post_category', 'post_tag'));

Adds post 3 to categories, 'cities', 'vacation' and 'Japan', as well as tagging them with the same. Obviously, the arguments can be changed to add link_categories or your_taxonomy or whatever.

The proposal isn't difficult to understand, it's just very different from what is there now. What code is there is extensively commented and I'd encourage anyone else who works on this to continue to do the same. Terms are stored in the terms table and shared by all taxonomies. Terms are bound to taxonomies in the terms_taxonomies table, where we also store counts and hierarchical information. Term taxonomies and objects are bound in the term relationships table. Again, though, the idea is that plugin developers won't need to fully understand the structure to add new taxonomy information. (I can understand why you'd want to though!).

If you can outline the specific functionality that you'd like to see, we can make sure it goes in to the API.

#8 @jhodgdon
17 years ago

Well, the schema is one reason why I was confused: as discussed in the hackers list, if the relationships table is storing all relationships between posts and term-taxonomy pairings, and also between links and term-taxonomy, then you cannot use object ID as the primary key. Two reasons, unless again I am not understanding things correctly:

1) You might have a link with ID=3 as well as a post with ID=3.

2) I think you will also make one relationship entry for each tag and for each category, right? So for your example on post 3, you will end up with 6 entries all having ID=3.

#9 follow-up: @majelbstoat
17 years ago

Both of those statements are correct. object_id is not unique. However, the tuple of object_id and term_taxonomy_id will be unique. And term_taxonomy_id is associated with a tuple of term_id and taxonomy which again will be unique. How you interpret the object_id is down to how you query the data. If you're looking for the categories of post 3, you'd choose get_term_relationships(3, 'post-category'), which would return all the terms associated with post 3 as a category. If you're looking for the categories of link 3, you'd choose get_term_relationships(3, 'link-category'), which would return all the terms associated with link 3 as a link-category. Sure, it is possible to add link-category taxonomy to a post, but why would you? The core certainly won't be doing that. Providing sufficient API for adding and querying will help to eliminate basic mis-classifying errors, whilst allowing more complex usage for those who need it.

#10 in reply to: ↑ 7 @jhodgdon
17 years ago

Replying to majelbstoat:

The slug and term_name being different sizes is a typo. Personally, I would prefer shorter fields as multilingual plugins are used by a minority of users, and having size 200 is pretty inefficient for the vast majority of people. I would certainly by -1 for a text field. Plugins that need to extend the size of a field can do so easily by making an ALTER TABLE statement when they are activated.

I agree that multilingual blogs are the minority, but how much storage space are we really talking? 145 extra bytes per term is not very much, and I don't think the WP database is really optimized for space anyway.

Also, your suggestion of ALTER TABLE in the plugin will not work. What you have to do as a plugin author is either distribute a hacked version of upgrade_schema.php or tell the plugin user to edit it themselves, and then remember to re-edit every time they upgrade WP, and BEFORE running the upgrade script. That is because when the user upgrades to a new version of WP, the database schema update will "helpfully" set the fields back to the original field widths, thereby dropping the painstakingly entered extra characters that were stored under the altered table structure. It's not pretty.

#11 in reply to: ↑ 9 @jhodgdon
17 years ago

Replying to majelbstoat:

Both of those statements are correct. object_id is not unique. However, the tuple of object_id and term_taxonomy_id will be unique.

Thanks, makes more sense now... I was hung up on trying to understand the database structure by reading the schema diff, but I guess it's not quite complete/correct yet and I would have been better off starting with the code.

The idea of adding various helper functions to do the queries is excellent, as it will save people time and also serve as models for how to do custom queries in plugins for other purposes.

#12 @ryan
17 years ago

The slug should be longer than the name. The slug can hold encoded UTF-8 strings. One character in a double-byte language will be encoded to six characters: %xx%xx. To hold the entire name, the slug really should be six times longer.

#13 @ryan
17 years ago

So, I think we need to implement the "get tags and categories for post x" query, make sure the schema supports that the way we want it to, and get a first revision in so we can play with it.

When querying post x, we want to be able to get all taxonomy for the post and then sort it into tag and category buckets (and buckets for any taxonomies added via plugin). We have to make sure we don't accidentally get link taxonomies for a link that has the same ID as the post.

#14 @ryan
17 years ago

Wondering if we need an object_type(post|link|foo) in term_relationships so we can slurp all taxonomies for the given object id without getting the taxonomies for other objects of different types with the same ID.

#15 @majelbstoat
17 years ago

You might be right about that, though it is possible we could avoid having it. To not get link_categories for a post, don't request link_categories in combination with a post object_id (reply 11). In the end though, it might be better to have it in the long term, to avoid ambiguity so we can avoid similar issues to "I can't get link cats without post cats" (even if it's possible).

That said, just adding version 2 here, which adds the basis of get_object_terms. It can take an array of objects and taxonomies (or singles of each) and does one query to get it all out in one go. The return structure will vary on what combination of object and taxonomy counts were passed. Should be fairly easy to follow - still obviously quite fluid at this point - we might want to do more work on cleaning up the data before we return it, so we don't return the object_id and taxonomy back with it, but it might be quite useful.

I've also removed the taxonomies table and changed taxonomy back to a string everywhere. There's a couple more function skeletons in there too. I've left the tag length at 55 for now, with the extended slug length of 200. Parent is left as a single column for now, though still keen to hear people's thoughts on using two columns and potentially moving to InnoDB for these new tables (as per the thread on the hackers list).

@majelbstoat
17 years ago

Second iteration, no taxonomy table, more functions.

#16 @ryan
17 years ago

That's looking good. I don't know if we'll really need to be able to ask for all post taxonomies. If we do, we can add something register_post_taxonomy() and get_post_taxonomies(). Plugins could add their particular post taxonomy to the list. We can wait on that until we deem it really needed.

I'm of the inclination to hold off on doing nest set categories until we get the other stuff done. Doing both is a lot of change at once, and I don't want nested set to bog down taxonomy. Requiring a particular storage engine might be a non-starter.

@ryan
17 years ago

4189b + syntax fixes

#17 @ryan
17 years ago

4189c.diff just fixes some errors and warnings. Getting ready to plug things in.

Let's do a sample implementation of get_terms($taxonomy) and see how well it plays with our schema. It will get all terms used in a given taxonomy or taxonomies. Think get_categories() and get_tags(). If we like the way it looks, we can start hooking taxonomy into our cat and tag functions and working out the bugs.

@ryan
17 years ago

#18 @ryan
17 years ago

Patch starts putting taxonomy to use for tags. I removed the WPTaxonomy class since we don't really need a class with what we currently have. add_term() now takes a taxonomy argument. I figure most of the term manipulation will be done within the context of a taxonomy, so just have add_term() add the term to the terms table and take care of adding the term_taxonomy relation.

Things are basically working with tags. You can add tags writing/editing a post and they will show up in the tags field after being added. There's a bug where the query will blow up if the tag is already there.

[Duplicate entry '7' for key 1]
INSERT INTO wp_trunk_term_relationships(object_id, term_taxonomy_id) SELECT '7', term_taxonomy_id FROM wp_trunk_term_taxonomy AS tt INNER JOIN wp_trunk_terms AS t ON tt.term_id = t.term_id WHERE (t.term_slug IN ('test', ' Testy')) AND tt.taxonomy IN ('post_tag')

Anyhow, just a first pass.

#19 @majelbstoat
17 years ago

Do you have a complete patch that includes the updated taxonomy functions, not inside WPTaxonomy - think it was missing from terms.diff?

@ryan
17 years ago

#20 @ryan
17 years ago

Oops, forgot to svn add before diffing.

#21 @ryan
17 years ago

(In [5510]) Very rough initial commit of taxonomy for everyone's hacking pleasure. There be dragons. see #4189

#22 @ryan
17 years ago

(In [5515]) Fix primary key on term_relationships. Add wp_set_object_terms(). Setting post tags working now. see #4189

#23 @ryan
17 years ago

I got rid of REPLACE INTO because it was changing the term's ID each time. One of the INSERT SELECTs was giving me trouble so I changed the code to do a one at a time queries. Then I realized the primary key on term_relationships was the problem. I left my changed code in, but we could probably convert back to the INSERT SELECT to save queries. Anyhow, settting/getting post tags is working now.

#24 @ryan
17 years ago

(In [5516]) Keep the old tables in the schema until everything is converted. see #4189

#25 @ryan
17 years ago

(In [5521]) Add get_terms() and get_term(). Move more of tagging to taxonomy. see #4189

#26 @ryan
17 years ago

(In [5522]) wp_insert_term() and wp_update_term(). see #4189

#27 @ryan
17 years ago

(In [5523]) Start moving link categories to taxonomy. see #4189

#28 @ryan
17 years ago

(In [5524]) Add some term hooks. see #4189

#29 @ryan
17 years ago

(In [5525]) Add get_term_by() and taxonomy registration bits. Move more category stuff to taxonomy. see #4189

#30 @ryan
17 years ago

(In [5528]) wp_insert_category(), cat_rows(), and others using taxonomy. see #4189

#31 @ryan
17 years ago

(In [5529]) Set and get post cats to taxonomy. see #4189

#32 @ryan
17 years ago

(In [5530]) Convert category queries and list cats to taxonomy. see #4189

#33 @markjaquith
17 years ago

WordPress database error: [Unknown column 'rel_type' in 'field list']
SELECT post_id, category_id, rel_type FROM wp_post2cat WHERE post_id IN (82,81,80,79,77,76,68,70,56,53)

After upgrade from 2.2

I see rel_type in some queries, but not in the schema.

#34 @ryan
17 years ago

(In [5532]) More rel_type elimination. see #4189

#35 @ryan
17 years ago

(In [5533]) wp_delete_term(). see #4189

#36 @ryan
17 years ago

(In [5534]) No more TAXONOMY bit fiels. see #4189

#37 @ryan
17 years ago

TODO

  • Update UTW and cat2tag importers
  • Upgrade from 2.2 and from the aborted tags schema
  • Provide separate link category management. Some API, AJAX, and UI separation needed.
  • Add more filters and actions in taxonomy API
  • Provide more back compat.
  • Fix a zillion bugs

When migrating from categories/post2cat to taxonomy, preserving the category IDs would be nice. I'll leave the migration and import work as projects for others.

#38 @ryan
17 years ago

Oh, and resurrect the term and term relationship caches in a more term generic fashion.

#39 @ryan
17 years ago

(In [5540]) Don't create old tables. see #4189

#40 @ryan
17 years ago

(In [5544]) Don't load category cache from old tables. see #4189

#41 @ryan
17 years ago

(In [5551]) Fix schema formatting to appease dbDelta. see #4189

#42 @ryan
17 years ago

[5549] fixes a typo

#43 @ryan
17 years ago

Currently we have a unique key on the slug. This is causing problems during upgrade because a fair number of folks have multiple categories that reduce down to the same slug. Trying to insert them will generate duplicate key errors. Do we remove the unique and make use of term_group for grouping terms that have the same slug, do we maintain the unique and append an incrementing "-x" to the duplicate slugs to make them unique, or do we do both -- make the slugs unique and put them in the same term group? The last option has the benefit of allowing querying a specific term by slug, or by querying all related terms by term group.

#44 @ryan
17 years ago

Hmmm, category children cause a problem. We've never forced slugs to be unique because categories can have different parents. The full hierarchy will make the slug unique when used in a cruft-free link. If we enforce unique slugs now, we will change links. For crazy people who put the category in their post permalinks (category permalink is an oxymoron), changed slugs will break their post links.

So, is the back compat concern worth allowing duplicate slugs? How many people have multiple categories with the same slug (taking into account cats with same slugs and different parents), and how many use category post permalinks?

#45 @ryan
17 years ago

(In [5552]) During upgrade, make slugs unique and put like slugs in a term group. see #4189

#46 @ryan
17 years ago

(In [5553]) Update cat2tag converter. Some term API tweaks. see #4189

#47 @ryan
17 years ago

(In [5554]) cat2tag fixes. see #4189

#48 @ryan
17 years ago

Upgrade works with categories that have duplicate slugs now. We make the slugs unique and put the terms in the same term group.

The category to tag converter is working again. I converted over 3000 categories to tags with one click. Conversion took just a few seconds.

#49 @ryan
17 years ago

(In [5555]) Term cache work. see #4189

#50 @ryan
17 years ago

[5556] fixes some relationship deletion and count issues.

#51 @ryan
17 years ago

(In [5557]) Remove debug. see #4189

#52 @ryan
17 years ago

(In [5558]) Change term count callback style. see #4189

#53 @ryan
17 years ago

(In [5559]) Delete term if no taxonomies use it. see #4189

#54 @ryan
17 years ago

(In [5560]) Link category fixes. see #4189

#55 @ryan
17 years ago

(In [5561]) Typo fix from flinkflonk. fixes #4347 see #4189

#56 @ryan
17 years ago

(In [5562]) Fix default cat options. see #4189

#57 @ryan
17 years ago

(In [5563]) Fix up exporter. Remove more refrences to the categories table. see #4189

#58 @ryan
17 years ago

(In [5564]) Return true when deleting link to satisfy admin-ajax error check. see #4189

#59 @ryan
17 years ago

TODO

  • Recalculate counts during upgrade
  • Provide link category management UI
  • Provide separate link category add and delete methods for AJAX and the category quick add
  • Fully restore caching
  • Update importers that still select on the categories table
  • Provide back compat for plugins

#60 @ryan
17 years ago

[5567] fixes a bug that was breaking setting of cats and tags.

#61 @markjaquith
17 years ago

(In [5571]) Fix category/tag base wording. see #4189

#62 @ryan
17 years ago

(In [5576]) Fix category permalinks. Update cat_ID to term_id. fixes #4349 see #4189

#63 @ryan
17 years ago

(In [5586]) Use API instead of SELECT. see #4189

#65 @markjaquith
17 years ago

(In [5590]) switch to term_id and name for category sorting. see #4189. sort by term_id for category permalinks, by name for get_the_category(). fixes #4335 for trunk. Props Erik Barzeski for the find.

#66 @ryan
17 years ago

(In [5592]) Use taxonomy instead of post2cat. see #4189

#67 @ryan
17 years ago

[5593] get_term_children and category query fixes

#68 @ryan
17 years ago

(In [5594]) Fix listing of categories with parents. see #4189

#69 @ryan
17 years ago

(In [5595]) Add missing taxonomy argument. Props nbachiyski. fixes #4362 see #4189

#70 @ryan
17 years ago

(In [5596]) Add ignore_empty option to wp_count_terms(). see #4189

#71 @ryan
17 years ago

(In [5597]) Category hierarchy fixes. see #4189

#72 @ryan
17 years ago

(In [5598]) Some term caching. see #4189

#73 @ryan
17 years ago

(In [5599]) Fix slug queries. see #4189

@johnbillion
17 years ago

Installation errors at r5602

#74 @johnbillion
17 years ago

Haven't been following the new taxonomy system too closely but I'm pretty sure no-one else has reported this yet. At Step 2 of installation of trunk at r5602, I get these database errors.

#75 @ryan
17 years ago

(In [5612]) If a term has no children, return an empty array for child_of queries. see #4189

#76 @ryan
17 years ago

(In [5616]) Abstract object term cache a bit more. see #4189

#77 @ryan
17 years ago

(In [5621]) Use new term tables in wp_install_defaults(). see #4380, #4189

#78 @ryan
17 years ago

(In [5622]) query by parent, not child_of, in get_nested_categories(). see #4189

#79 @ryan
17 years ago

(In [5637]) Separate AJAX cat adder into post and link flavors. see #4189

#80 @ryan
17 years ago

(In [5642]) Fix link category dropdown and filter. Add some back compat to get_categories. see #4189

#81 @ryan
17 years ago

(In [5643]) Upgrade fixes. see #4189

#82 @ryan
17 years ago

(In [5652]) Term sanitization. see #4189

#83 @ryan
17 years ago

(In [5660]) Use get_term_children() and get_term_field(). see #4189

#85 @ryan
17 years ago

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.