Opened 17 years ago
Closed 17 years ago
#4189 closed defect (bug) (fixed)
Taxonomy Schema and API
Reported by: | ryan | Owned by: | |
---|---|---|---|
Milestone: | 2.3 | Priority: | normal |
Severity: | normal | Version: | 2.2 |
Component: | General | Keywords: | term taxonomy tags categories |
Focuses: | Cc: |
Description
Create a generic taxonomy and term API and schema that can be used for tags, post categories, link categories, and arbitrary future taxonomies.
See the hackers list discussion:
http://www.nabble.com/Taxonomy-Schema-Proposal-tf3582783.html
See also the discussion on the tagging ticket, #3723
Attachments (7)
Change History (92)
#2
@
17 years ago
That being said, here is a basic schema (still with the parent column) and the beginnings of a WPTaxonomy class. I stayed away from WPTerm, because, even though it is a nice construct, you don't want to have to create a new instance of a term every time you want to do an operation. Taxonomies are stored in a separate table for permanence, but will be retrieved from get_options for minimal db impact. I think that's what was agreed on on the list? is_int() is used to distinguish between term_ids and term_slugs.
add_term_relationship() is worth a mention as it is designed to be quite flexible. It can take an array of term ids or slugs, an object_id and an array of taxonomy ids or taxonomy slugs. It should create the terms that don't exist (though, in general this shouldn't be required), then create or increment the term-taxonomy relationship (inefficient at the moment as it is db queries in a loop), then finally create the object-term-taxonomy relationships. An example call might be
$wp_taxonomy->add_term_relationship(array('cities', 'vacation', 5), 3, array('post_category', 'post_tag'));
Which would add both tags and categories for cities, vacation and the term with id 5 to the post with id 3. The eventual idea is to be able to add a whole load of taxonomy to an object with only one or two queries, rather than repeated calls for each category/tag etc. I'm thinking this will be a very common case, for example when saving a new post. To realise this fully, add_term_taxonomy() will have to be modified to accept arrays as well, so that we don't end up calling 3 queries per term per taxonomy.
This is just a starting point - possibly misguided and probably buggy, but it's something to work from...
Cheers,
Jamie.
#3
@
17 years ago
I'm still on the fence regarding the taxonomies table. In most cases the contents of that table will never change. It will be static. I'm wondering just what the performance difference between an int join and a join on a short string would be. Using a taxonomy string would simplify things.
#4
@
17 years ago
Yeah, to be honest, we don't need that lookup table at all. Because I restructured the term_relationships table to an int join, we only store the taxonomy string in one place and we don't ever need to join on taxonomy type. In fact, adding a lookup table will needlessly complicate things now that I think about it, and is probably a good case of over-normalisation. I'll refactor the functions to just accept a taxonomy string and add a bit more functionality to the WPTaxonomy class to flesh it out a bit. When the basics are there, we'll have to start reworking the category API internals. Speaking of which, did you have any comments on the right/left structure for hierarchical data from the above link? This is the right time to implement it, if we're going to...
#5
@
17 years ago
Would it be possible to make the term name field wider (say, 200 characters) or unlimited width (TEXT rather than VARCHAR) in the schema? See #3729 -- it is useful to have category/term names be larger for some multi-lingual plugins. I am also unclear why the slug would be 200 characters if the name itself is much shorter -- they should probably be of similar length.
#6
@
17 years ago
I would also like to vote for making the schema and API easier to understand.
With the current proposal, it is not at all clear to me from reading the code how I would I go about doing common tasks, such as defining a category, setting the categories or tags for a post, figuring out what categories go with a post, etc. It is also not clear to me what entries in the MySQL tables would hold such information. All of which makes it very difficult to comment on the proposal.
#7
follow-up:
↓ 10
@
17 years ago
There was lots of discussion about the proposed schema on the hackers list, linked in the description of this ticket.
The slug and term_name being different sizes is a typo. Personally, I would prefer shorter fields as multilingual plugins are used by a minority of users, and having size 200 is pretty inefficient for the vast majority of people. I would certainly by -1 for a text field. Plugins that need to extend the size of a field can do so easily by making an ALTER TABLE statement when they are activated.
The API as in the file is obviously nowhere near complete. Unfortunately, I'm a bit under the weather, so it's been a little while since I worked on it. The retrieval functions aren't written for example, but and example might be something like:
$wp_taxonomy->get_object_terms(3, array('post_category', 'post_tag'));
Returning all the categories and tags for the specified post. The first argument may also be an array to retrieve whole subsets of taxonomy in one go. The aim is to provide functions for every interaction with the database. Setting the categories or tags for a post is pretty easy, and is explained above:
$wp_taxonomy->add_term_relationship(array('cities', 'vacation', 'Japan'), 3, array('post_category', 'post_tag'));
Adds post 3 to categories, 'cities', 'vacation' and 'Japan', as well as tagging them with the same. Obviously, the arguments can be changed to add link_categories or your_taxonomy or whatever.
The proposal isn't difficult to understand, it's just very different from what is there now. What code is there is extensively commented and I'd encourage anyone else who works on this to continue to do the same. Terms are stored in the terms table and shared by all taxonomies. Terms are bound to taxonomies in the terms_taxonomies table, where we also store counts and hierarchical information. Term taxonomies and objects are bound in the term relationships table. Again, though, the idea is that plugin developers won't need to fully understand the structure to add new taxonomy information. (I can understand why you'd want to though!).
If you can outline the specific functionality that you'd like to see, we can make sure it goes in to the API.
#8
@
17 years ago
Well, the schema is one reason why I was confused: as discussed in the hackers list, if the relationships table is storing all relationships between posts and term-taxonomy pairings, and also between links and term-taxonomy, then you cannot use object ID as the primary key. Two reasons, unless again I am not understanding things correctly:
1) You might have a link with ID=3 as well as a post with ID=3.
2) I think you will also make one relationship entry for each tag and for each category, right? So for your example on post 3, you will end up with 6 entries all having ID=3.
#9
follow-up:
↓ 11
@
17 years ago
Both of those statements are correct. object_id is not unique. However, the tuple of object_id and term_taxonomy_id will be unique. And term_taxonomy_id is associated with a tuple of term_id and taxonomy which again will be unique. How you interpret the object_id is down to how you query the data. If you're looking for the categories of post 3, you'd choose get_term_relationships(3, 'post-category'), which would return all the terms associated with post 3 as a category. If you're looking for the categories of link 3, you'd choose get_term_relationships(3, 'link-category'), which would return all the terms associated with link 3 as a link-category. Sure, it is possible to add link-category taxonomy to a post, but why would you? The core certainly won't be doing that. Providing sufficient API for adding and querying will help to eliminate basic mis-classifying errors, whilst allowing more complex usage for those who need it.
#10
in reply to:
↑ 7
@
17 years ago
Replying to majelbstoat:
The slug and term_name being different sizes is a typo. Personally, I would prefer shorter fields as multilingual plugins are used by a minority of users, and having size 200 is pretty inefficient for the vast majority of people. I would certainly by -1 for a text field. Plugins that need to extend the size of a field can do so easily by making an ALTER TABLE statement when they are activated.
I agree that multilingual blogs are the minority, but how much storage space are we really talking? 145 extra bytes per term is not very much, and I don't think the WP database is really optimized for space anyway.
Also, your suggestion of ALTER TABLE in the plugin will not work. What you have to do as a plugin author is either distribute a hacked version of upgrade_schema.php or tell the plugin user to edit it themselves, and then remember to re-edit every time they upgrade WP, and BEFORE running the upgrade script. That is because when the user upgrades to a new version of WP, the database schema update will "helpfully" set the fields back to the original field widths, thereby dropping the painstakingly entered extra characters that were stored under the altered table structure. It's not pretty.
#11
in reply to:
↑ 9
@
17 years ago
Replying to majelbstoat:
Both of those statements are correct. object_id is not unique. However, the tuple of object_id and term_taxonomy_id will be unique.
Thanks, makes more sense now... I was hung up on trying to understand the database structure by reading the schema diff, but I guess it's not quite complete/correct yet and I would have been better off starting with the code.
The idea of adding various helper functions to do the queries is excellent, as it will save people time and also serve as models for how to do custom queries in plugins for other purposes.
#12
@
17 years ago
The slug should be longer than the name. The slug can hold encoded UTF-8 strings. One character in a double-byte language will be encoded to six characters: %xx%xx. To hold the entire name, the slug really should be six times longer.
#13
@
17 years ago
So, I think we need to implement the "get tags and categories for post x" query, make sure the schema supports that the way we want it to, and get a first revision in so we can play with it.
When querying post x, we want to be able to get all taxonomy for the post and then sort it into tag and category buckets (and buckets for any taxonomies added via plugin). We have to make sure we don't accidentally get link taxonomies for a link that has the same ID as the post.
#14
@
17 years ago
Wondering if we need an object_type(post|link|foo) in term_relationships so we can slurp all taxonomies for the given object id without getting the taxonomies for other objects of different types with the same ID.
#15
@
17 years ago
You might be right about that, though it is possible we could avoid having it. To not get link_categories for a post, don't request link_categories in combination with a post object_id (reply 11). In the end though, it might be better to have it in the long term, to avoid ambiguity so we can avoid similar issues to "I can't get link cats without post cats" (even if it's possible).
That said, just adding version 2 here, which adds the basis of get_object_terms. It can take an array of objects and taxonomies (or singles of each) and does one query to get it all out in one go. The return structure will vary on what combination of object and taxonomy counts were passed. Should be fairly easy to follow - still obviously quite fluid at this point - we might want to do more work on cleaning up the data before we return it, so we don't return the object_id and taxonomy back with it, but it might be quite useful.
I've also removed the taxonomies table and changed taxonomy back to a string everywhere. There's a couple more function skeletons in there too. I've left the tag length at 55 for now, with the extended slug length of 200. Parent is left as a single column for now, though still keen to hear people's thoughts on using two columns and potentially moving to InnoDB for these new tables (as per the thread on the hackers list).
#16
@
17 years ago
That's looking good. I don't know if we'll really need to be able to ask for all post taxonomies. If we do, we can add something register_post_taxonomy() and get_post_taxonomies(). Plugins could add their particular post taxonomy to the list. We can wait on that until we deem it really needed.
I'm of the inclination to hold off on doing nest set categories until we get the other stuff done. Doing both is a lot of change at once, and I don't want nested set to bog down taxonomy. Requiring a particular storage engine might be a non-starter.
#17
@
17 years ago
4189c.diff just fixes some errors and warnings. Getting ready to plug things in.
Let's do a sample implementation of get_terms($taxonomy) and see how well it plays with our schema. It will get all terms used in a given taxonomy or taxonomies. Think get_categories() and get_tags(). If we like the way it looks, we can start hooking taxonomy into our cat and tag functions and working out the bugs.
#18
@
17 years ago
Patch starts putting taxonomy to use for tags. I removed the WPTaxonomy class since we don't really need a class with what we currently have. add_term() now takes a taxonomy argument. I figure most of the term manipulation will be done within the context of a taxonomy, so just have add_term() add the term to the terms table and take care of adding the term_taxonomy relation.
Things are basically working with tags. You can add tags writing/editing a post and they will show up in the tags field after being added. There's a bug where the query will blow up if the tag is already there.
[Duplicate entry '7' for key 1] INSERT INTO wp_trunk_term_relationships(object_id, term_taxonomy_id) SELECT '7', term_taxonomy_id FROM wp_trunk_term_taxonomy AS tt INNER JOIN wp_trunk_terms AS t ON tt.term_id = t.term_id WHERE (t.term_slug IN ('test', ' Testy')) AND tt.taxonomy IN ('post_tag')
Anyhow, just a first pass.
#19
@
17 years ago
Do you have a complete patch that includes the updated taxonomy functions, not inside WPTaxonomy - think it was missing from terms.diff?
#23
@
17 years ago
I got rid of REPLACE INTO because it was changing the term's ID each time. One of the INSERT SELECTs was giving me trouble so I changed the code to do a one at a time queries. Then I realized the primary key on term_relationships was the problem. I left my changed code in, but we could probably convert back to the INSERT SELECT to save queries. Anyhow, settting/getting post tags is working now.
#33
@
17 years ago
WordPress database error: [Unknown column 'rel_type' in 'field list']
SELECT post_id, category_id, rel_type FROM wp_post2cat WHERE post_id IN (82,81,80,79,77,76,68,70,56,53)
After upgrade from 2.2
I see rel_type
in some queries, but not in the schema.
#37
@
17 years ago
TODO
- Update UTW and cat2tag importers
- Upgrade from 2.2 and from the aborted tags schema
- Provide separate link category management. Some API, AJAX, and UI separation needed.
- Add more filters and actions in taxonomy API
- Provide more back compat.
- Fix a zillion bugs
When migrating from categories/post2cat to taxonomy, preserving the category IDs would be nice. I'll leave the migration and import work as projects for others.
#38
@
17 years ago
Oh, and resurrect the term and term relationship caches in a more term generic fashion.
#43
@
17 years ago
Currently we have a unique key on the slug. This is causing problems during upgrade because a fair number of folks have multiple categories that reduce down to the same slug. Trying to insert them will generate duplicate key errors. Do we remove the unique and make use of term_group for grouping terms that have the same slug, do we maintain the unique and append an incrementing "-x" to the duplicate slugs to make them unique, or do we do both -- make the slugs unique and put them in the same term group? The last option has the benefit of allowing querying a specific term by slug, or by querying all related terms by term group.
#44
@
17 years ago
Hmmm, category children cause a problem. We've never forced slugs to be unique because categories can have different parents. The full hierarchy will make the slug unique when used in a cruft-free link. If we enforce unique slugs now, we will change links. For crazy people who put the category in their post permalinks (category permalink is an oxymoron), changed slugs will break their post links.
So, is the back compat concern worth allowing duplicate slugs? How many people have multiple categories with the same slug (taking into account cats with same slugs and different parents), and how many use category post permalinks?
#48
@
17 years ago
Upgrade works with categories that have duplicate slugs now. We make the slugs unique and put the terms in the same term group.
The category to tag converter is working again. I converted over 3000 categories to tags with one click. Conversion took just a few seconds.
#59
@
17 years ago
TODO
- Recalculate counts during upgrade
- Provide link category management UI
- Provide separate link category add and delete methods for AJAX and the category quick add
- Fully restore caching
- Update importers that still select on the categories table
- Provide back compat for plugins
#74
@
17 years ago
Haven't been following the new taxonomy system too closely but I'm pretty sure no-one else has reported this yet. At Step 2 of installation of trunk at r5602, I get these database errors.
Seeing as categories are going to be changing so much, I wondered if it might be a good time to consider an alternative hierarchical implementation in the database? This article makes a good case for nested set categories:
http://dev.mysql.com/tech-resources/articles/hierarchical-data.html
To implement this, we'd need a left and right column instead of a parent column. I quite like this anyway, as it also allows for more complicated taxonomy schemes, such as linked lists. There are lots of benefits to using this as opposed to our current method, the main of which is that it is optimised for retrieving data, rather than inserting.