Darwinian Web
Adam Green's thoughts on the evolution of the Internet

Problem with RSS categories

Posted on Sunday, January 15, 2006 at 8:04 AM (permalink)

The first version of my RubyRiver aggregator displays all the items in a feed without any filtering, which allows items unrelated to Ruby to appear. This morning I decided to explore filtering based on the category tag. The RSS 2.0 specification states that "You may include as many category elements as you need to, for different domains, and to have an item cross-referenced in different parts of the same domain." So there should be no problem. All I had to do was extract the category tag and select items that contained "Ruby" in that tag.

Here is an example from the blog Eric's Ponderings, which contains some useful Ruby programming posts, but also switches to football whenever the University of Texas Longhorns win a big game. Here is a portion of one of his RSS feed items about Ruby and Java:

<category>Software Development</category>
<category>Ruby<category>
<category>Java<category>
<category>Ruby On Rails<category>
<category>java<category>
<category>ruby<category>
<category>build<category>
<category>maven<category>
I don't know why he includes Ruby twice, but that shouldn't get in the way of my code, as long as there is at least one category I can match.

Checking the rest of my feeds, however, brought out a problem with the way some blogs use categories. For example, the O'Reilly Ruby blog is entirely about Ruby, so the authors don't feel the need to include Ruby in the categories. This is apparently assumed from the context. Instead the categories are terms like Opinion, News, and Articles. This makes sense within the blog, but doesn't help when the feed is aggregated with many others.

I can solve the problem in my own code by identifying feeds within my .opml feed list as Ruby specific or multi-topic. This will allow my code to use the category tag only when parsing multi-topic feeds. Unfortunately this requires me to go to extra effort when adding new feeds, which in turn means that any user of my code will have to understand this issue as well. General purpose aggregators aren't likely to use this solution, which means that filtering for Ruby categories in a generic aggregator will filter out some of the blog posts that the user would want.

This type of inconsistency in applying tags to blog posts illustrates the hurdles that still must be overcome before RSS can fulfill its promise. The specification is willing, but the patterns of usage are weak.