Archive for the ‘aggregators’ Category

A Unique Problem

Friday, August 18th, 2006

James Holderness writes on how various feed aggregators attempt to judge the uniqueness of items in feeds:

Detecting duplicate items in an RSS feed is something of a black art. How does one uniquely identify an item in a feed while still allowing for that item to be updated? …

I can’t say for sure what algorithms applications are using, but after running 150 tests on more than 20 different aggregators, I think have a fair idea how many of them work.

He summarises some reasonable ways of judging uniqueness and brings up good arguements for and against.

He goes on to say:

I would recommend you also include a unique link element for each item in your feed, to allow for aggregators that don’t handle guids very well. No two items should ever have the same link element,

Unfortuanately the link element is as abused as dates and GUIDs and while this principle is ideal it isn’t how many feeds are constructed. The link element is meant as a permalink to the item itself. Not to what the item is talking about or any other link. But many linkblogs will put the link they are talking about in the link element. In fact a big source of RSS, del.icio.us, gets it wrong and links to the link being mentioned. Now consider that many people on del.icio.us link to the same link and you suddenly have seperate feed items that some aggregators may treat as duplicates.

So ideally yes, GUID and link are good but in practice sadly not. This is the way of much feed parsing as we have found out in the FeedHenry.com project. Feeds need fuzzy logic to make much sense of.

(James’ article is worth reading.)