WordPress.org

Make WordPress Core

Opened 14 years ago

Closed 14 years ago

Last modified 14 years ago

#5964 closed defect (bug) (worksforme)

Multi-word tags encoded incorrectly

Reported by: DavidMeade Owned by:
Milestone: Priority: normal
Severity: normal Version:
Component: General Keywords: tags, tagging
Focuses: Cc:

Description

When tagging posts in wordpress with a two-word tag (lets say the tag is "tag name") ... wordpress replaces the space with a dash (-) resulting in "tag-name".

However, technorati states that spaces should be replaced with a plus (+) and is thus expecting "tag+name". It sees "tag-name" as an entirely different thing, and returns different results for each.
(see: http://support.technorati.com/support/siteguide/tags )

The Microformats folks (microformats.org) also state that the plus (+) is to be used to represent spaces in multi-wrod tags: (see: http://microformats.org/wiki/rel-tag#Encoding_issues )

Meanwhile, wordpress doesn't even seem to allow an author to use a plus (+) as if they do wordpress just concatenates the two words into one word ("tagname").

It seems that Wordpress is defying convention at technorati and specification of microformats -- and that seems to make wordpress rather Technorati unfriendly. In order to find wordpress posts tagged "tag name" at technorati (or any other site which conforms to the microformats standard), I cant search for "tag name" and expect wordpress blogs to show up.

Shouldn't wordpress present multi-word tags according to the microformats specification?

Change History (15)

#1 @fitztrev
14 years ago

WordPress passes the tag name to the RSS feed with no alterations (no dashes or plus signs). Technorati takes care of formatting it on their end. The dashes added to the tag's slug by WordPress are only for within your site.

That part of the ticket is invalid. I'm not sure about the author name, though. I don't know if WordPress supports "+" characters in usernames.

#2 @DavidMeade
14 years ago

It's not just an RSS issue. Even though the links are to pages within our own sites, they are marked with rel="tag" -- and the rel="tag" micro format clearly states that multi word tags should be separated with the + sign. (See above links) This is not happening in Wordpress right now and causes multi-word-tag tagged post to not show up in searches for that multi word tag at technorati.

Technorati tag links do not have to point to technorati (they can point to our own pages) they just have to conform to the rel="tag" specs ... which wordpress does not. Technorati also does not rely solely on an RSS feed to get tags - in fact they tell you that in order to get indexed correctly you should just use a rel="tag" in the body and that this is the most reliable method ... but wordpress is not correctly formatting those rel="tag" links according to the specification.

Slugs for tags/categories should take into account that they are going to be used in rel="tag" links and conform the appropriate microformat specification that technorati and so many other aggregators expect. (IMHO)

#4 @lloydbudd
14 years ago

-1 I vote for wontfix

This issue has actually existed long before tags. It has existed since categories have had the rel-tag . A dash dividing words is the standard for words in URLS on the web (what search engines understand), it is easier to read than pluses and easier to manually enter (US keyboards). It would be better for technorati and the microformat to reflect this.

Plus is used within WordPress to find posts with both of those tags.

#5 @davidmeade
14 years ago

I think its better for tools (wordpress) to respect published standards (microformats) than for those tools to expect standards to change.

A dash isn't the standard for spaces in urls. Spaces are encoded to %20 and the plus sign is shorthand for this as microformats, technorati, and countless search engines understand.

The fact that wordpress hasn't conformed to this standard in the past shouldn't preclude it from meeting it in the future.

The location of dash and plus on the keyboard is irrelevant as in either case the user only hits the space bar ... its how that space is represented in the tag slug that is important ... and that's got nothing to do with which any particular user feels easier to press.

Without this fix, wordpress will be actively excluding its users from being accurately indexed at technorati and other search engines/aggregators that honor the standards that have been in place for quite a while now.

#6 follow-up: @lloydbudd
14 years ago

davidmeade, you are very generous with applying the term standard. As I wrote, long before this Microformat, web developers and search engines have standardized on using dash in URLs to mean a space.

%20 is ugly and awkward. Suggesting part of my argument is irrelevant does not further yours. It is relevant when I go to a blog and know that you have tagged articles about "daisy chains" so I can enter the URL http://example.com/tag/daisy-chains . That is an elegant web.

#7 in reply to: ↑ 6 @Nazgul
14 years ago

Replying to lloydbudd:

davidmeade, you are very generous with applying the term standard. As I wrote, long before this Microformat, web developers and search engines have standardized on using dash in URLs to mean a space.

Lloyd, the + sign as a space is more prevalent than you might think.

If we look at the good old HTML 4.01 specs, and section 17.13.4 in particular (http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4) we see:

Control names and values are escaped. Space characters are replaced by `+', and then reserved characters are escaped as described in [RFC1738], section 2.2.

You can also see this behaviour in the PHP urlencode function which converts spaces to plus signs.

%20 is ugly and awkward. Suggesting part of my argument is irrelevant does not further yours. It is relevant when I go to a blog and know that you have tagged articles about "daisy chains" so I can enter the URL http://example.com/tag/daisy-chains . That is an elegant web.

I agree that %20 is ugly, but http://example.com/tag/daisy+chains isn't and it's standards compliant.

#8 @davidmeade
14 years ago

Right.

lloydbudd - you are probably correct that the word 'specification' is better suited to the microformats publication. However as Nazgul points out + representing a space is part of the html standard. This standard has influenced specifications like rel="tag" microformat and has been honored to such a degree that its could be considered a defacto standard.

With a "+" used in tag slugs rather than the "-", Wordpress can still use "pretty" urls (http://example.com/tag/daisy+chains) that meet the html standard and the microformats defacto-standard (which so many search engines/aggregators expect). It cannot do so with the "-" used in tag slugs.

#9 @lloydbudd
14 years ago

I'm not arguing that the space isn't prevalent, but you also won't find people using it with a taste for asthetics.

this-is-a-sentence is much easier to enter and read than this+is+a+sentence

davidmeade, you are confused if you think that using - used as a word delimiter breaks the HTML standard. dash is the standard" for space in URLs on the modern web.

Further, you haven't demonstrated a real cost of WordPress not adopting it. Technorati could easily choise to recognize WPs multi-word tags, and anyway navigation and searching by tags on Technorati is a niche experience.

#10 @davidmeade
14 years ago

lloydbudd: the cost is that wordpress does not conform to the standards and specifications that countless others have adopted and thus prevents wordpress users from being indexed at countless sites (such as technorati). It is NOT technorati's responsibility recognize teh mal-formed rel=tag links that wordpress generates. Technorati (and countless others) understand rel-tags according the published specification/standard being used by so many others. It is wordpress' responsibility to meet those standards so that wordpress users can take advantage of those countless sites the same way blogger users or other tools users can. It's not just technorati its a long list of sites and services ... and its wordpress' responsibility to implement their rel=tags feature correctly.

Wordpress clearly acknowledges the value to its users that presenting rel=tag links offers or they wouldn't be rel=tag'ing the links. That value is largely negated by the fact that wordpress has implemented the rel=tag links in a way that violates the rel=tag specification. It is not technorati or anyone elses duty to acknowledge rel=tags that don't meet the specification. It's the blogging tool's responsibility to do so.

Again: this has NOTHING to do with aesthetics. Wordpress will still display a space and not the plus. Again: it has NOTHING to do with what is easier to "enter" as in BOTH cases you hit the space bar.

This has to do with one thing: Wordpress presents rel=tag links so its users can be indexed by the countless sites that honor the microformat specification for rel=tag (this specification was influenced by and conforms to the HTML standard) ... and yet Wordpress has implemented rel=tags incorrectly according to that standard - and this should be fixed.

The advantage to this fix is clear. The cost of not adopting it is clear. I still fail to understand how there could be a reasonable argument against this fix.

#11 @andreashaugstrup
14 years ago

+1 for fixing

As has been been pointed out this is a technical issue and not a aesthetical one. As has also been pointed out URL encodings have been standardized long before anyone ever though about Wordpress. RFC 1738 from 1994 (!) designates the space as an unsafe character that must always be encoded within a URL as %20. Later the HTML specification allows for the use of a plus sign.

No matter what you think dashes are not standard just because a popular blogging engine use them. Wordpress can of course use any format it wishes for internal use, making any kind of substitution. However, when wishing to interact with other parties Wordpress should always follow the established standards rather than making up they own. Wordpress cannot choose to adopt half the rel-tag specification, but not the other half. Either this bug should be fixed or rel-tag support in Wordpress should be dropped.

While dashes are used in Wordpress as a substitute for spaces in tags, it is not a good solution and that's why RFC 1738 describes a different route (%20). When you have hyphenated compound words it becomes impossible to tell a hyphenated compound apart from other options.

Take for example the tags "my wet suit" and "my wet-suit". These are distinct tags that carry different meanings (feel free to make up other examples, my native language is not English). But in Wordpress they would both result in the same tag URL: "my-wet-suit" even though they are separate tags. Using the only correct way to encode URLs (RFC 1738) they would remain distinct "my%20wet%20suit" and "my%20wet-suit" respectively - or as accepted in HTML: "my+wet+suit" and "my+wet-suit".

This should be fixed in Wordpress. Other software systems like Drupal gets this right.

#12 @karyrogers
14 years ago

+1 for fixing.

Reading over the opinions it looks to me that one side doesn't want to conform to a published and well known specification because "that's not how we've always done it." I can't see a compelling reason to NOT fix this.

I think the posts by davidmeade at 02/29/08 19:12:34 and by andreashaugstrup at 02/29/08 19:35:02 state a clear case as to why this should be fixed.

#13 @matt
14 years ago

  • Resolution set to worksforme
  • Status changed from new to closed

Looks like there are still some questions around multi-word tags:

http://microformats.org/wiki/rel-tag-faq#Multi-word_tags

rel-tag is still a DRAFT specification and has a number of issues, for example it doesn't work with query string URLs that we generate and seems to be quite unfriendly to internationalization. If when the spec is made final it still has these issues, we should probably just remove it, as it wouldn't be worth supporting.

#14 @lloydbudd
14 years ago

  • Milestone 2.6 deleted

#15 @davidmeade
14 years ago

First of all, the question here isn't about multi-word tags, its about multiple tags. The specification clearly states spaces should be encoded as +

secondly, I do not believe that rel=tag is a draft. Drafts and specifications are listed separately on the main page (http://microformats.org/wiki/Main_Page) you can see that rel=tag is not in the draft section.

The unanswered question you've linked to in the wiki FAQ is asking what can be done to alias tag useage at sites that have multiple ways of including multi-word tags. It does not speak to how the specification states multiple tags should be encoded. It is quite clear on that point.

#16 @davidmeade
14 years ago

sorry that last reply was confusing:

The issue isn't about compound-tags (like "wet-suit"), its about multiple word tags (like "wet suit") One you go scuba diving in, and one you rush home from the office to change ... wordpress treats them as the same thing.

The specification clearly states that spaces should be encoded as a +. But I cant do that. There is no way for me to tag "wet suit" in wordpress (walking to work in the rain?) it forces me to use "wet-suit" (scuba diving).

I can't even MANUALLY use the + to accomplish this as then wordpress just concatenates the two as "wetsuit".

How to alias the various sites out there is an entirely different matter.
The specification clearly states that spaces should be encoded as a +. Wordpress does not do this ... it does not even allow the user to do it manually.

Wordpress isn't adhering to the specification it as half-implemented and thus wordpress users aren't getting their posts tagged correctly out on the web. Blogger users are. Drupal users are. But not WordPress users.

Note: See TracTickets for help on using tickets.